In this assignment, you'll analyze a collection of data sets from the San Francisco Open Data Portal and Zillow. The data sets have been stored in the SQLite database sf_data.sqlite, which you can download here. The database contains the following tables:
| Table | Description |
|---|---|
crime |
Crime reports dating back to 2010. |
mobile_food_locations |
List of all locations where mobile food vendors sell. |
mobile_food_permits |
List of all mobile food vendor permits. More details here. |
mobile_food_schedule |
Schedules for mobile food vendors. |
noise |
Noise complaints dating back to August 2015. |
parking |
List of all parking lots. |
parks |
List of all parks. |
schools |
List of all schools. |
zillow |
Zillow rent and housing statistics dating back to 1996. More details here. |
The mobile_food_ tables are explicitly connected through the locationid and permit columns. The other tables are not connected, but you may be able to connect them using dates, latitude/longitude, or postal codes.
Shapefiles for US postal codes are available here. These may be useful for converting latitude/longitude to postal codes.
Shapefiles for San Francisco Neighborhoods are available here.
import sqlite3 as sql
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
import numpy as np
import geopandas as gpd
plt.rcParams['figure.figsize'] = (12, 12)
import geocoder
from matplotlib.collections import PatchCollection
import shapely.geometry as geom
con = sql.connect('sf_data.sqlite')
Exercise 1.1. Which mobile food vendor(s) sells at the most locations?
# Join the mobile_food_ tables
vendors = pd.read_sql("SELECT FacilityType, Applicant, Address FROM mobile_food_schedule "
"INNER JOIN mobile_food_locations ON mobile_food_schedule.locationid = mobile_food_locations.locationid "
"INNER JOIN mobile_food_permits ON mobile_food_schedule.permit = mobile_food_permits.permit", con)
# Create a vendors table in the database
vendors.to_sql("vendors", con, if_exists = 'replace', index = False)
# Read the table from the database and count the distinct number of addresses in the table
vendors_loc = pd.read_sql("SELECT *, COUNT(distinct Address) AS Count FROM vendors GROUP BY Applicant", con)
# replace the previous table with a new one that has a column for counts
vendors_loc.to_sql("vendors", con, if_exists = 'replace', index = False)
final_vendor = pd.read_sql("SELECT *, SUM(Count) AS Sum_Count FROM vendors GROUP BY Applicant", con)
final_vendor.to_sql("vendors", con, if_exists = 'replace', index = False)
pd.read_sql("SELECT *, Max(Sum_Count) FROM vendors", con)
May Catering sells at the most locations.
Exercise 1.2. Ask and use the database to analyze 5 questions about San Francisco. For each question, write at least 150 words and support your answer with plots. Make a map for at least 2 of the 5 questions.
You should try to come up with some questions on your own, but these are examples of reasonable questions:
Please make sure to clearly state each of your questions in your submission.
crime = pd.read_sql("SELECT Category, PdDistrict, Lon, Lat, DayOfWeek, COUNT(*) "
"AS COUNT FROM crime GROUP BY PdDistrict, Category", con)
crime.to_sql("crime_count", con, if_exists = 'replace', index = False)
Find the most popular crime in each district (crime_pop)
crime_pop = crime.loc[crime.groupby('PdDistrict').COUNT.agg('idxmax')]
Find the popular crimes in San Francisco (see crime_sort)
crime_cat = crime.loc[crime.groupby('Category').COUNT.agg('idxmax')]
crime_sort = crime_cat.sort(["COUNT"], ascending = False)
crime_graph = pd.read_sql("SELECT Category, PdDistrict, Lon, Lat, DayOfWeek, COUNT FROM crime_count WHERE "
"(Category LIKE '%ARSON%' OR Category LIKE '%ASSAULT%' "
"OR Category LIKE '%DRUG/NARCOTIC%' OR Category LIKE '%LARCENY/THEFT%' "
"OR Category LIKE '%SEX OFFENSES, FORCIBLE%') ", con)
Use Bar graph to find the most dangerous and safest district in San Francisco
import bokeh.io
bokeh.io.output_notebook()
from bokeh.charts import Bar, output_file, show
from bokeh.models import HoverTool
from bokeh.models import ColumnDataSource
source = ColumnDataSource(crime_graph)
bar = Bar(crime_graph, label='PdDistrict', values='COUNT', agg='mean', stack='Category', title="Popular Crime Distribution in Each District", tools = 'wheel_zoom, pan, hover')
hover = bar.select(dict(type=HoverTool))
hover.tooltips = [('District', '@PdDistrict'),('Type: ', '@Category')]
show(bar)
Use Bargraphs to find the most dangerous and safest day of the week
bar = Bar(crime_graph, label='DayOfWeek', values='COUNT', agg='sum', stack='Category', title="Popular Crime Distribution in Each District", tools = 'wheel_zoom, pan, hover')
hover = bar.select(dict(type=HoverTool))
hover.tooltips = [('Day', '@DayOfWeek'),('Type: ', '@Category')]
show(bar)
As we can see from the graphs above, Southern district is the most dangerous district while Park and Richmond are the safest. The most popular crime in San Francisco is Larceny/Theft while Arson is the least popular among the top 5 most popular crimes in San Francisco. Saturday was when most crimes occured, followed by Friday, Tuesday and Monday. Wednesday is the safest day of the week in San Francisco.
parks = pd.read_sql("SELECT * FROM parks", con)
parks = parks.dropna()
parks.Acreage.describe()
my_map = Basemap(llcrnrlon=-122.55, llcrnrlat=37.7, urcrnrlon=-122.37, urcrnrlat=37.81, resolution="h",
projection="merc"
)
my_map.drawcoastlines()
my_map.readshapefile("sf_zip_shp/sf_zip_shp", "zipcodes")
for lon, lat, acre in zip(parks.Lon, parks.Lat, parks.Acreage):
x, y = my_map(lon, lat)
marker = 'o'
# 25%
if acre < 0.32:
marker = '+'
#50%
elif acre <= 1.26:
marker = '*'
#75%
elif acre <= 4.39:
marker = 'x'
my_map.plot(x, y, marker=marker, color='m', markersize=12, markeredgewidth=2)
I used basemap to display all the park locations in San Francisco. Circle represents parks that have areas greater than 75% of the parks in San Francisco. x indicates the parks that have acreages between 50% to 75% of all the parks in SF. Stars represents 25% to 50% while plus means less than 25%. We can see that San Francisco has all kinds of parks and the majority of them evenly locate in San Francisco. Therefore, many residents can go to a park within a short distance. In addition, there are not a lot of parks near the coast.
# get the zipcodes for each district
district = crime.groupby('PdDistrict').head(1)
zillow = pd.read_sql("SELECT * FROM zillow", con)
zips = gpd.read_file("cb_2015_us_zcta510_500k")
zips = zips[zips.ZCTA5CE10.str.startswith("9")]
# Found zipcodes on Google
zips = zips[(pd.to_numeric(zips.ZCTA5CE10) >= 94102) & (pd.to_numeric(zips.ZCTA5CE10) <= 94188)]
def get_zip(df):
"""
get the zipcodes for the dataframe
Argument: df
Return: df
"""
lonlat = [geom.Point(lon, lat) for lon, lat in zip(df.Lon, df.Lat)]
df.drop(["Lat", "Lon"], axis = 1)
df = gpd.GeoDataFrame(df, geometry = lonlat, crs = {'init' :'epsg:4326'})
# Change the parking projection to the same as the zips projection.
df = df.to_crs(zips.crs)
zipcode = []
for pt in df.geometry:
matched = zips[zips.geometry.contains(pt)]
if matched.shape[0] == 0:
zipcode.append(np.NaN)
else:
zipcode.append(matched.ZCTA5CE10.iloc[0])
df["zip"] = zipcode
return df
school = pd.read_sql("SELECT Entity, Name, LowerGrade, UpperGrade, "
"GradeRange, Address, Lat, Lon, COUNT (*) "
"AS COUNT FROM schools GROUP BY Entity, GradeRange", con)
school_zip = get_zip(school)
# change dtypes
school_zip['zipcode'] = school_zip['zip'].astype(int)
district_zip = get_zip(district)
# drop Nan and change dtypes
district_zip = district_zip.dropna()
district_zip['zipcode'] = district_zip['zip'].astype(int)
school_zillow = pd.merge(school_zip, zillow, left_on=['zipcode'], right_on=['RegionName'], how = 'inner')
crime_zillow = pd.merge(district_zip, zillow, left_on=['zipcode'], right_on=['RegionName'], how = 'inner')
from bokeh.charts import BoxPlot, output_file, show
from bokeh.layouts import row
from bokeh.sampledata.autompg import autompg as df
box = BoxPlot(school_zillow, values='MedianSoldPricePerSqft_AllHomes', label='Entity',
title="Home Price Box Plot in Each Entity", plot_width=500, legend = False, tools = 'wheel_zoom, pan,')
box2 = BoxPlot(crime_zillow, values='MedianSoldPricePerSqft_AllHomes', label='PdDistrict',
title="Home Price Box Plot in Each District", plot_width=500, legend = False, tools = 'wheel_zoom, pan,')
show(row(box, box2))
I first find the zipcodes for all the districts and entities in the database, then I separately merge these two dataframes with the zillow dataframe. Lastly, I use boxplots to display the housing price distribution. I only find the zipcodes for each district because it takes too long to find zipcode for each data in the crime dataset.
As we can see from the entity boxplots, it suggests that there are no obvious differences in the housing price(Measured by median sold price per sqft) between the private and SFUSD entity and both categories have outliers. On the other hand, the housing prices in SFCCD are much lower than the other two entities. In addition, the boxplot of SFCCD is comparatively short, which suggests that the overall housing prices have a high level of agreement with each other.
From the district boxplots, we see that the mean median sold price for Central, Northern, Park, Southern and Tenderloin are relatively close and they have similar boxplot distributions. Therefore, there are no obvious differences in housing prices for these districts. The mean median sold price for bayview and taravel are comparatively lower than other districts. These two districts also have very similar distributions.
school_sf = pd.read_sql("SELECT Entity, Name, LowerGrade, UpperGrade, "
"GradeRange, Address, Lat, Lon, COUNT (*) "
"AS COUNT FROM schools GROUP BY Entity, GradeRange", con)
school = pd.read_sql("SELECT Entity, Name, LowerGrade, UpperGrade, Address, Lat, Lon FROM schools", con)
def gpd_Geo(df):
lonlat = [geom.Point(lon, lat) for lon, lat in zip(df.Lon, df.Lat)]
df.drop(["Lon", "Lat"], axis = 1)
df = gpd.GeoDataFrame(df, geometry = lonlat, crs = {'init' :'epsg:4326'})
return df
school = gpd_Geo(school)
PatchCollection
base = zips.plot()
school.plot(ax = base, marker = "o", color = "red")
plt.title("School Distribution in San Francisco")
def category(school):
"""
create school category based on uppergrade level
Argument: school
Return: category
"""
if school['UpperGrade'] <= 5:
category = "Elementary School"
elif school['UpperGrade'] <= 8:
category = "Middle School"
elif school['UpperGrade'] <= 12:
category = "High School"
else:
category = "College"
return category
school_sf['category'] = school_sf.apply(category, axis=1)
from bokeh.charts import Donut, show, output_file
from bokeh.models import HoverTool
from bokeh.plotting import show
from bokeh.models.sources import ColumnDataSource
from bokeh.charts import defaults
defaults.width = 700
defaults.height = 700
d = Donut(school_sf, label=['Entity', 'category'], values='COUNT',
text_font_size='8pt', hover_text='count', tools = "wheel_zoom,pan", active_scroll = "wheel_zoom")
show(d)
From the basemap, we can conclude that the schools are evenly distributed in San Francisco. Therefore, most students can live within commuting distance of their schools. In the pie chart, we can concude that the most popular schools where students get their education are private schools and the second most popular is SFUSD. There are not a lot of colleges in SF. Within the private school category, the most popular type of school is elementary school, followed by middle school and high school. We see the same distribution in SFUSD. However, the number of middle school and high school are relatively close in the SFUSD entity.
parking = pd.read_sql("SELECT * FROM parking", con)
parking = gpd_Geo(parking)
base = zips.plot()
parking.plot(ax = base, marker = "o", color = "red")
parking = get_zip(parking)
parking.groupby("zip").size().sort_values().tail(5)
In order to find the popular parking spots in San Francisco, I first graph all the spots on a basemap. We can see that the top right corner of the map have the most red dots (Parking lots). I then verify the results by using the groupby function and display the top 5 zipcodes that have the most parking areas. I googled those zipcodes and found that they locate in the financial district (94103,94102), popular spots for visit (94107, 94124) and Mission District(94110). These areas are very popular for residents or visitors. Therefore, it is not surprising that we see these areas have more parking spots than the rest.