San Francisco Neighborhood Analysis

5 min readJan 22, 2020

This project is a part of the “Capstone Project” of Coursera — “IBM Applied Data Science Capstone Course”. I tried to write as clear as possible for beginner data scientists. You can check my Jupiter Notebook on GitHub or ask questions in the comments.

Introduction

San Francisco is a very vibrant city with a lot of neighborhoods, each with own character. Some neighborhoods are quiet and cozy, has convenient store locations, while others offer a lot of fun and nightlife activities. Choosing a neighborhood to live or open a business can be a complicated task to do, but with the help of Foursquare and crime data, we can make it a little bit easier.

Target Audience

Who would be interested in this project?

People interested in moving to San Francisco and looking for a perfect neighborhood for their needs
Business owners looking to expand their business to a new location
A beginner data scientist who may use this research as an example

Data Preparation

For this project we will pull the data from multiple sources:

Wikipedia — list of SF neighborhoods
Google Geocoding — geocoordinates of neighborhoods
Kaggle SF Police Incidents Dataset 2018–2019 — crime data
Foursquare — venues data for closest stores, parks, and attractions for each neighborhood

1. Location Data

First, we need to get a full list of all SF neighborhoods. Wikipedia article list of neighborhoods in San Francisco is a great place to start. Can you guess how many of them are in SF? 119!

Beautiful Soup is a Python library for pulling data out of HTML. We will use it to parse the Wikipedia page, take all headers with the ”mw-headline” class and we are all set:

How to use BeautifulSoup for parsing HTML

For geolocation data, we will use the Geocoding API. To get more information about it, follow the Geocoding Developer Guide.

The get_json_neig_data function takes the neighborhoods_list (which we just got with the Beautiful Soup) and returns a json. We don’t want to access the API too often, so we cache the response into a file.

get_json_neig_data

We create a new pandas Data Frame with neighborhood names, latitude, and longitude from a json response from the previous step.

Sometimes there are mistakes in Geocoding API, so be careful. Verify that all columns are correct.

According to this data, we can build a map of neighborhoods using python visualization library Folium.

Build the map with Folium

This is all the data about the locations that we need, so we can continue.

2. Crime Data

To analyze criminal activity for each neighborhood we use Police Department Incident Reports: 2018 to Present dataset from Kaggle. It contains information about location, time, category and other miscellaneous data from the SF Police Department.

We filter data to exclude certain crime categories, such as traffic collision and suspicious activity that doesn’t relate to the quality of life in the neighborhood. Also, we delete miscellaneous data about incidents, that doesn’t play a role in our analysis.

After dropping everything we don’t need and counting the number of crimes in every neighborhood, we create crime_count_df DataFrame.

*crime_count_df (sort by Number of Crimes)*

The police department uses slash-separated neighborhoods for definitions of some areas. In this case, we split the number of incidents equally between these two areas, and Number of Crimes we will count like equal.

Then we will join DataFrame of neighborhoods and DataFrame of crimes, filtering out areas with no available data.

sf_crime_join_df *(sort by Number of Crimes)*

Now we can find the most criminal neighborhoods (Mission, Tenderloin, and South of Market lead by a wide margin) and the most peaceful (Treasure Island, Oceanview, and Ingleside).

plotting

3. Venues Data (Foursquare API)

Foursquare API provides information about venues and geolocation. We will take every neighborhood location from sf_crime_join_df and look for the nearest venues.

Code below takes neighborhood names, latitudes, and longitudes from sf_crime_join_df, finds nearby venues and returns venues_df DataFrame.

getNearbyVenues

Next, we need to convent a special category into a general top-level category and remove excess categories such as Professional & Other Places or Residence.

This is all the data we needed for our study. Let’s move on.

Methodology

We’ll be using k-means clustering. For these purposes, we need to join the existing Data Frames: sf_crime_join_df and venues_df. For venues_df we’ll count general venue category:

ven_categoty_crime_df

Here is the result:

Empirically, it is clear that the optimal number of clusters is five.

Results

Cluster 1 — Fun and Convenient

The first cluster is relatively safe and livable. This cluster is great for food and shopping. The mean number of crimes is 2748, it’s few compared to the rest.

Cluster 2 — Wild Side of the City

There are a lot of entertainment, nightlife spots and recreation areas. But the mean number of crimes in the second cluster is 20 000! This is a record! These are the three most dangerous neighborhoods in all of San Francisco.

Cluster 3 — Downtown

Very average cluster. There are a lot of shops, foods, and nightlife here. Still many crimes (but not as many as in the second cluster).

Cluster 4 — Safe suburbs

The safest and best-for-life cluster. There are low crime and many shops, restaurants, and transport.

Cluster 5 — Not so Safe Suburbs

Another average cluster. Here a few interesting places. There is a low crime, but still not enough.

Conclusion

The purpose of this project was to find a neighborhood to live or open a business. The venues have been identified using Foursquare API and Geocoding API. Then we group neighborhoods on five different clusters.