Assignment 6

In this assignment, you'll analyze a collection of data sets from the San Francisco Open Data Portal and Zillow. The data sets have been stored in the SQLite database sf_data.sqlite, which you can download here. The database contains the following tables:

Table Description
crime Crime reports dating back to 2010.
mobile_food_locations List of all locations where mobile food vendors sell.
mobile_food_permits List of all mobile food vendor permits. More details here.
mobile_food_schedule Schedules for mobile food vendors.
noise Noise complaints dating back to August 2015.
parking List of all parking lots.
parks List of all parks.
schools List of all schools.
zillow Zillow rent and housing statistics dating back to 1996. More details here.

The mobile_food_ tables are explicitly connected through the locationid and permit columns. The other tables are not connected, but you may be able to connect them using dates, latitude/longitude, or postal codes.

Shapefiles for US postal codes are available here. These may be useful for converting latitude/longitude to postal codes.

Shapefiles for San Francisco Neighborhoods are available here.

In [1]:
import sqlite3 as sql
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
import numpy as np
import geopandas as gpd
plt.rcParams['figure.figsize'] = (12, 12)
import geocoder
from matplotlib.collections import PatchCollection
import shapely.geometry as geom
In [2]:
con = sql.connect('sf_data.sqlite')

Exercise 1.1. Which mobile food vendor(s) sells at the most locations?

In [3]:
# Join the mobile_food_ tables
vendors = pd.read_sql("SELECT FacilityType, Applicant, Address FROM mobile_food_schedule "
            "INNER JOIN mobile_food_locations ON mobile_food_schedule.locationid = mobile_food_locations.locationid "
            "INNER JOIN mobile_food_permits ON mobile_food_schedule.permit = mobile_food_permits.permit", con)
In [4]:
# Create a vendors table in the database
vendors.to_sql("vendors", con, if_exists = 'replace', index = False)
In [5]:
# Read the table from the database and count the distinct number of addresses in the table
vendors_loc = pd.read_sql("SELECT *, COUNT(distinct Address) AS Count FROM vendors GROUP BY Applicant", con)
In [6]:
# replace the previous table with a new one that has a column for counts

vendors_loc.to_sql("vendors", con, if_exists = 'replace', index = False)
In [7]:
final_vendor = pd.read_sql("SELECT *, SUM(Count) AS Sum_Count FROM vendors GROUP BY Applicant", con)
In [8]:
final_vendor.to_sql("vendors", con, if_exists = 'replace', index = False)
In [9]:
pd.read_sql("SELECT *, Max(Sum_Count) FROM vendors", con)
Out[9]:
FacilityType Applicant Address Count Sum_Count Max(Sum_Count)
0 Truck May Catering 1059 EVANS AVE 58 58 58

May Catering sells at the most locations.

Exercise 1.2. Ask and use the database to analyze 5 questions about San Francisco. For each question, write at least 150 words and support your answer with plots. Make a map for at least 2 of the 5 questions.

You should try to come up with some questions on your own, but these are examples of reasonable questions:

  • Which parts of the city are the most and least expensive?
  • Which parts of the city are the most dangerous (and at what times)?
  • Are noise complaints and mobile food vendors related?
  • What are the best times and places to find food trucks?
  • Is there a relationship between housing prices and any of the other tables?

Please make sure to clearly state each of your questions in your submission.

Question 1: Which parts of the city are the most dangerous (and at what times)?

In [10]:
crime = pd.read_sql("SELECT Category, PdDistrict, Lon, Lat, DayOfWeek, COUNT(*) "
                    "AS COUNT FROM crime GROUP BY PdDistrict, Category", con)
In [11]:
crime.to_sql("crime_count", con, if_exists = 'replace', index = False)

Find the most popular crime in each district (crime_pop)

In [12]:
crime_pop = crime.loc[crime.groupby('PdDistrict').COUNT.agg('idxmax')]

Find the popular crimes in San Francisco (see crime_sort)

In [13]:
crime_cat = crime.loc[crime.groupby('Category').COUNT.agg('idxmax')]
In [14]:
crime_sort = crime_cat.sort(["COUNT"], ascending = False)
/Users/Chloechen/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:1: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
  if __name__ == '__main__':
In [15]:
crime_graph = pd.read_sql("SELECT Category, PdDistrict, Lon, Lat, DayOfWeek, COUNT FROM crime_count WHERE "
                    "(Category LIKE '%ARSON%' OR Category LIKE '%ASSAULT%' " 
                    "OR Category LIKE '%DRUG/NARCOTIC%' OR Category LIKE '%LARCENY/THEFT%' "
                   "OR Category LIKE '%SEX OFFENSES, FORCIBLE%') ", con)

Use Bar graph to find the most dangerous and safest district in San Francisco

In [86]:
import bokeh.io

bokeh.io.output_notebook()
Loading BokehJS ...
In [18]:
from bokeh.charts import Bar, output_file, show
from bokeh.models import HoverTool
from bokeh.models import ColumnDataSource    

source = ColumnDataSource(crime_graph)

bar = Bar(crime_graph, label='PdDistrict', values='COUNT', agg='mean', stack='Category', title="Popular Crime Distribution in Each District", tools = 'wheel_zoom, pan, hover')
hover = bar.select(dict(type=HoverTool))
hover.tooltips = [('District', '@PdDistrict'),('Type: ', '@Category')]
show(bar)

Use Bargraphs to find the most dangerous and safest day of the week

In [19]:
bar = Bar(crime_graph, label='DayOfWeek', values='COUNT', agg='sum', stack='Category', title="Popular Crime Distribution in Each District", tools = 'wheel_zoom, pan, hover')
hover = bar.select(dict(type=HoverTool))
hover.tooltips = [('Day', '@DayOfWeek'),('Type: ', '@Category')]
show(bar)

As we can see from the graphs above, Southern district is the most dangerous district while Park and Richmond are the safest. The most popular crime in San Francisco is Larceny/Theft while Arson is the least popular among the top 5 most popular crimes in San Francisco. Saturday was when most crimes occured, followed by Friday, Tuesday and Monday. Wednesday is the safest day of the week in San Francisco.

Question 2: Which parts of the city has the most parks?

In [20]:
parks = pd.read_sql("SELECT * FROM parks", con)
In [21]:
parks = parks.dropna()
In [22]:
parks.Acreage.describe()
Out[22]:
count     202.000000
mean       15.913416
std        84.890148
min         0.000000
25%         0.325000
50%         1.260000
75%         4.390000
max      1031.980000
Name: Acreage, dtype: float64
In [23]:
my_map = Basemap(llcrnrlon=-122.55, llcrnrlat=37.7, urcrnrlon=-122.37, urcrnrlat=37.81, resolution="h",
    projection="merc"
)

my_map.drawcoastlines()
my_map.readshapefile("sf_zip_shp/sf_zip_shp", "zipcodes")

for lon, lat, acre in zip(parks.Lon, parks.Lat, parks.Acreage):
    x, y = my_map(lon, lat)
    marker = 'o'
    # 25%
    if acre < 0.32:
        marker = '+'
    #50%
    elif acre <= 1.26:
        marker = '*'
    #75%
    elif acre <= 4.39:
        marker = 'x'
    my_map.plot(x, y, marker=marker, color='m', markersize=12, markeredgewidth=2)

I used basemap to display all the park locations in San Francisco. Circle represents parks that have areas greater than 75% of the parks in San Francisco. x indicates the parks that have acreages between 50% to 75% of all the parks in SF. Stars represents 25% to 50% while plus means less than 25%. We can see that San Francisco has all kinds of parks and the majority of them evenly locate in San Francisco. Therefore, many residents can go to a park within a short distance. In addition, there are not a lot of parks near the coast.

Question 3: What is the housing price distribution that is measured by the median housing sold price per sqf in each Entity and District?

In [24]:
# get the zipcodes for each district
district = crime.groupby('PdDistrict').head(1)
In [25]:
zillow = pd.read_sql("SELECT * FROM zillow", con)
In [26]:
zips = gpd.read_file("cb_2015_us_zcta510_500k")
zips = zips[zips.ZCTA5CE10.str.startswith("9")]
# Found zipcodes on Google
zips = zips[(pd.to_numeric(zips.ZCTA5CE10) >= 94102) & (pd.to_numeric(zips.ZCTA5CE10) <= 94188)] 
In [27]:
def get_zip(df):
    """
    get the zipcodes for the dataframe
    
    Argument: df
    
    Return: df
    """
    lonlat = [geom.Point(lon, lat) for lon, lat in zip(df.Lon, df.Lat)]
    df.drop(["Lat", "Lon"], axis = 1)

    df = gpd.GeoDataFrame(df, geometry = lonlat, crs = {'init' :'epsg:4326'})
    
    # Change the parking projection to the same as the zips projection.
    df = df.to_crs(zips.crs)
    zipcode = []
    for pt in df.geometry:
        matched = zips[zips.geometry.contains(pt)]
        if matched.shape[0] == 0:
            zipcode.append(np.NaN)
        else:
            zipcode.append(matched.ZCTA5CE10.iloc[0])

    df["zip"] = zipcode
    return df
In [28]:
school = pd.read_sql("SELECT Entity, Name, LowerGrade, UpperGrade, "
                     "GradeRange, Address, Lat, Lon, COUNT (*) "
                     "AS COUNT FROM schools GROUP BY Entity, GradeRange", con)
In [29]:
school_zip = get_zip(school)
In [30]:
# change dtypes
school_zip['zipcode'] = school_zip['zip'].astype(int)
district_zip = get_zip(district)
In [31]:
# drop Nan and change dtypes
district_zip = district_zip.dropna()
district_zip['zipcode'] = district_zip['zip'].astype(int)
In [32]:
school_zillow = pd.merge(school_zip, zillow, left_on=['zipcode'], right_on=['RegionName'], how = 'inner')
In [33]:
crime_zillow = pd.merge(district_zip, zillow, left_on=['zipcode'], right_on=['RegionName'], how = 'inner')
In [34]:
from bokeh.charts import BoxPlot, output_file, show
from bokeh.layouts import row
from bokeh.sampledata.autompg import autompg as df

box = BoxPlot(school_zillow, values='MedianSoldPricePerSqft_AllHomes', label='Entity', 
              title="Home Price Box Plot in Each Entity", plot_width=500, legend = False, tools = 'wheel_zoom, pan,')
box2 = BoxPlot(crime_zillow, values='MedianSoldPricePerSqft_AllHomes', label='PdDistrict',
               title="Home Price Box Plot in Each District", plot_width=500, legend = False, tools = 'wheel_zoom, pan,')

show(row(box, box2))

I first find the zipcodes for all the districts and entities in the database, then I separately merge these two dataframes with the zillow dataframe. Lastly, I use boxplots to display the housing price distribution. I only find the zipcodes for each district because it takes too long to find zipcode for each data in the crime dataset.

As we can see from the entity boxplots, it suggests that there are no obvious differences in the housing price(Measured by median sold price per sqft) between the private and SFUSD entity and both categories have outliers. On the other hand, the housing prices in SFCCD are much lower than the other two entities. In addition, the boxplot of SFCCD is comparatively short, which suggests that the overall housing prices have a high level of agreement with each other.

From the district boxplots, we see that the mean median sold price for Central, Northern, Park, Southern and Tenderloin are relatively close and they have similar boxplot distributions. Therefore, there are no obvious differences in housing prices for these districts. The mean median sold price for bayview and taravel are comparatively lower than other districts. These two districts also have very similar distributions.

In [69]:
school_sf = pd.read_sql("SELECT Entity, Name, LowerGrade, UpperGrade, "
                     "GradeRange, Address, Lat, Lon, COUNT (*) "
                     "AS COUNT FROM schools GROUP BY Entity, GradeRange", con)
In [70]:
school = pd.read_sql("SELECT Entity, Name, LowerGrade, UpperGrade, Address, Lat, Lon FROM schools", con)
In [58]:
def gpd_Geo(df):
    lonlat = [geom.Point(lon, lat) for lon, lat in zip(df.Lon, df.Lat)]
    df.drop(["Lon", "Lat"], axis = 1)
    df = gpd.GeoDataFrame(df, geometry = lonlat, crs = {'init' :'epsg:4326'})
    return df
In [71]:
school = gpd_Geo(school)
In [81]:
PatchCollection
Out[81]:
matplotlib.collections.PatchCollection
In [82]:
base = zips.plot()
school.plot(ax = base, marker = "o", color = "red")
plt.title("School Distribution in San Francisco")
Out[82]:
<matplotlib.text.Text at 0x11f1cee50>
In [83]:
def category(school):
    """
    create school category based on uppergrade level
    
    Argument: school
    
    Return: category
    """
    if school['UpperGrade'] <= 5:
        category = "Elementary School"
    elif school['UpperGrade'] <= 8:
        category =  "Middle School"
    elif school['UpperGrade'] <= 12:
        category = "High School"
    else:
        category = "College"
    return category

school_sf['category'] = school_sf.apply(category, axis=1)
In [85]:
from bokeh.charts import Donut, show, output_file
from bokeh.models import HoverTool
from bokeh.plotting import show
from bokeh.models.sources import ColumnDataSource
from bokeh.charts import defaults

defaults.width = 700
defaults.height = 700

d = Donut(school_sf, label=['Entity', 'category'], values='COUNT',
          text_font_size='8pt', hover_text='count', tools = "wheel_zoom,pan", active_scroll = "wheel_zoom")

show(d)

From the basemap, we can conclude that the schools are evenly distributed in San Francisco. Therefore, most students can live within commuting distance of their schools. In the pie chart, we can concude that the most popular schools where students get their education are private schools and the second most popular is SFUSD. There are not a lot of colleges in SF. Within the private school category, the most popular type of school is elementary school, followed by middle school and high school. We see the same distribution in SFUSD. However, the number of middle school and high school are relatively close in the SFUSD entity.

In [87]:
parking = pd.read_sql("SELECT * FROM parking", con)
In [76]:
parking = gpd_Geo(parking)
base = zips.plot()
parking.plot(ax = base, marker = "o", color = "red")
Out[76]:
<matplotlib.axes._subplots.AxesSubplot at 0x11d4f69d0>
In [77]:
parking = get_zip(parking)
In [80]:
parking.groupby("zip").size().sort_values().tail(5)
Out[80]:
zip
94102     78
94110     98
94124    108
94107    116
94103    165
dtype: int64

In order to find the popular parking spots in San Francisco, I first graph all the spots on a basemap. We can see that the top right corner of the map have the most red dots (Parking lots). I then verify the results by using the groupby function and display the top 5 zipcodes that have the most parking areas. I googled those zipcodes and found that they locate in the financial district (94103,94102), popular spots for visit (94107, 94124) and Mission District(94110). These areas are very popular for residents or visitors. Therefore, it is not surprising that we see these areas have more parking spots than the rest.