In this assignment, you'll combine the assignment 3 data set with nutrition data from the USDA Food Composition Databases. The CSV file fresh.csv contains the fresh fruits and vegetables data you extracted in assignment 3.
The USDA Food Composition Databases have a documented web API that returns data in JSON format . You need a key in order to use the API. Only 1000 requests are allowed per hour, so it would be a good idea to use caching.
Sign up for an API key here. The key will work with any Data.gov API. You may need the key again later in the quarter, so make sure you save it.
These modules may be useful:
Exercise 1.1. Read the search request documentation, then write a function called ndb_search() that makes a search request. The function should accept the search term as an argument. The function should return the search result items as a list (for 0 items, return an empty list).
Note that the search url is: https://api.nal.usda.gov/ndb/search
As an example, a search for "quail eggs" should return this list:
[{u'ds': u'BL',
u'group': u'Branded Food Products Database',
u'name': u'CHAOKOH, QUAIL EGG IN BRINE, UPC: 044738074186',
u'ndbno': u'45094707',
u'offset': 0},
{u'ds': u'BL',
u'group': u'Branded Food Products Database',
u'name': u'L&W, QUAIL EGGS, UPC: 024072000256',
u'ndbno': u'45094890',
u'offset': 1},
{u'ds': u'BL',
u'group': u'Branded Food Products Database',
u'name': u'BUDDHA, QUAIL EGGS IN BRINE, UPC: 761934535098',
u'ndbno': u'45099560',
u'offset': 2},
{u'ds': u'BL',
u'group': u'Branded Food Products Database',
u'name': u'GRAN SABANA, QUAIL EGGS, UPC: 819140010103',
u'ndbno': u'45169279',
u'offset': 3},
{u'ds': u'BL',
u'group': u'Branded Food Products Database',
u'name': u"D'ARTAGNAN, QUAIL EGGS, UPC: 736622102630",
u'ndbno': u'45178254',
u'offset': 4},
{u'ds': u'SR',
u'group': u'Dairy and Egg Products',
u'name': u'Egg, quail, whole, fresh, raw',
u'ndbno': u'01140',
u'offset': 5}]
As usual, make sure you document and test your function.
import requests_cache
import requests
import pandas as pd
from matplotlib import pyplot as plt
plt.style.use('ggplot')
%matplotlib inline
from pandas.tools.plotting import scatter_matrix
import numpy as np
import seaborn as sns
requests_cache.install_cache("cache")
key = "KzsGUKXTRFCNq9WSYbTcwVi5FA7SjWzFC15sr7rO"
def ndb_search(term):
"""
makes a search request
Argument: search term
Return: search result items as a list (for 0 items, return an empty list)
"""
url = "https://api.nal.usda.gov/ndb/search"
response = requests.get(url, params = {
"q": term,
"api_key": key,
"format":"json"
})
output = response.json()
if "list" not in output.keys():
return {}
else:
return output["list"]["item"]
ndb_search("quail eggs")
Exercise 1.2. Use your search function to get NDB numbers for the foods in the fresh.csv file. It's okay if you don't get an NDB number for every food, but try to come up with a strategy that gets most of them. Discuss your strategy in a short paragraph.
Hints:
ndb_search() to a data frame with pd.DataFrame().pd.merge().My strategy:
path = "/Users/Chloechen/Downloads/fresh.csv"
df_fruit = pd.read_csv(path, header = 0)
# manipulate the name in the "food" column
df_fruit['new_food'] = df_fruit['food'].str.replace('_', ", ")
df_fruit["food"][df_fruit["food"].str.contains('green') == False] = df_fruit['new_food']
df_fruit["food"][df_fruit["food"].str.contains('green_peppers') == True] = "green peppers"
df_fruit["food"][df_fruit["food"].str.contains('red, peppers') == True] = "Peppers, red"
def search(key):
"""
create a dataframe for the output from the ndb_search function
Argument: key/search term
Return: dataframe
"""
if key == "kiwi":
term = "kiwifruit"
elif key == "apples":
term = "apples, raw"
elif key == "collard_greens":
term = "Collards"
elif key == "green_beans":
term = "Beans, snap, green, raw"
elif key == "red_peppers":
term = "Peppers, sweet, red, raw"
else:
term = key
result = ndb_search(term)
if result == {}:
df_result = pd.DataFrame(result)
df_result["food"] = key
else:
df_result = pd.DataFrame(result)
df_result["food"] = key
raw = df_result["name"].str.lower().str.contains(', raw')
return df_result[raw]
fruit_list = [search(fruit) for fruit in df_fruit["food"]]
df_result = pd.concat(fruit_list)
df_result = df_result.loc[:, ["name", "ndbno", "food"]]
final_df = pd.merge(df_fruit, df_result, on = "food", how = "inner")
fresh_final_df = final_df.drop_duplicates(['food'])
del fresh_final_df['new_food']
fresh_final_df
Exercise 1.3. Read the food reports V2 documentation, then write a function called ndb_report() that requests a basic food report. The function should accept the NDB number as an argument and return the list of nutrients for the food.
Note that the report url is: https://api.nal.usda.gov/ndb/V2/reports
For example, for "09279" (raw plums) the first element of the returned list should be:
{u'group': u'Proximates',
u'measures': [{u'eqv': 165.0,
u'eunit': u'g',
u'label': u'cup, sliced',
u'qty': 1.0,
u'value': u'143.93'},
{u'eqv': 66.0,
u'eunit': u'g',
u'label': u'fruit (2-1/8" dia)',
u'qty': 1.0,
u'value': u'57.57'},
{u'eqv': 151.0,
u'eunit': u'g',
u'label': u'NLEA serving',
u'qty': 1.0,
u'value': u'131.72'}],
u'name': u'Water',
u'nutrient_id': u'255',
u'unit': u'g',
u'value': u'87.23'}
Be sure to document and test your function.
requests_cache.install_cache("cache")
key = "KzsGUKXTRFCNq9WSYbTcwVi5FA7SjWzFC15sr7rO"
def ndb_report(number):
"""
Requests a basic food report
Argument: NDB number
Return: A list of nutrients for the food
"""
url = "https://api.nal.usda.gov/ndb/V2/reports"
response = requests.get(url, params = {
"ndbno": number,
"api_key": key,
"format": "json"
})
output = response.json()
return output["foods"][0]['food']['nutrients']
number = "09279"
output = ndb_report(number)
output[0]
Exercise 1.4. Which foods provide the best combination of price, yield, and nutrition? You can use kilocalories as a measure of "nutrition" here, but more a detailed analysis is better. Use plots to support your analysis.
def nutri_search(number):
"""
Create a dataframe for the output from the ndb_report function
Argument: ndb number
Return: dataframe
"""
result = ndb_report(number)
df_number = pd.DataFrame(result)
df_number["ndbno"] = number
return df_number
Step 1: Merge the fresh_final_df and the dataframe generated from the nutri_search function
nutri_list = [nutri_search(number) for number in fresh_final_df["ndbno"]]
df_number = pd.concat(nutri_list)
df_number = df_number.loc[:, ["name", "value", "unit", "ndbno"]]
fresh_df = pd.merge(fresh_final_df, df_number, on = "ndbno", how = "inner")
Step 2: Select the rows that contains the kilocalories information
df = fresh_df[fresh_df["name_y"].str.contains("Energy")]
df = df.rename(columns = {"value": "nutrition"})
df["nutrition"] = df["nutrition"].convert_objects(convert_numeric=True)
df
Step 3: Find the correlation between the variables using scatter plots
fig, ax = plt.subplots(1, 1)
def scatter(group):
plt.plot(group['nutrition'], group['price_per_lb'], 'o', label = group.name)
fig.suptitle('Correlation between Nutrition and Price for Fruit and Vegetables', fontsize=15)
df.groupby("type").apply(scatter)
ax.set(xlabel = 'nutrition', ylabel = 'price_per_lb')
ax.legend(loc = 4)
# plot a regression line
x = df['nutrition']
y = df['price_per_lb']
fit = np.polyfit(x, y, deg=1)
ax.plot(x, fit[0] * x + fit[1], color='red')
ax.set_ylim([0, 8])
plt.show()
fig, ax = plt.subplots(1, 1)
def scatter(group):
plt.plot(group['nutrition'], group['yield'], 'o', label = group.name)
fig.suptitle('Correlation between Nutrition and Yield for Fruit and Vegetables', fontsize=15)
df.groupby("type").apply(scatter)
ax.set(xlabel = 'nutrition', ylabel = 'yield')
ax.legend(loc = 4)
# plot a regression line
x = df['nutrition']
y = df['yield']
fit = np.polyfit(x, y, deg=1)
ax.plot(x, fit[0] * x + fit[1], color='red')
plt.show()
fig, ax = plt.subplots(1, 1)
def scatter(group):
plt.plot(group['price_per_lb'], group['yield'], 'o', label = group.name)
fig.suptitle('Correlation between Price and Yield for Fruit and Vegetables', fontsize=15)
df.groupby("type").apply(scatter)
ax.set(xlabel = 'price_per_lb', ylabel = 'yield')
ax.legend(loc = 4)
# plot a regression line
x = df['price_per_lb']
y = df['yield']
fit = np.polyfit(x, y, deg=1)
ax.plot(x, fit[0] * x + fit[1], color='red')
ax.set_xlim([0, 8])
plt.show()
There is no significant correlation between price and nutrition, nutrition and yield for fruit and vegetables. However, there is a moderate positive correlation between price and yield.
Step 4: Use boxplot to illustrate the price distribution for different nutrition and yield category for fruit and vegetables.
df["nutrition_category"] = pd.cut(df["nutrition"], 3, labels = ["low", "medium", "high"])
df["price_category"] = pd.cut(df["price_per_lb"], 3, labels = ["low", "medium", "high"])
df["yield_category"] = pd.cut(df["yield"], 3, labels = ["low", "medium", "high"])
g = sns.FacetGrid(df, col="type", size=4, aspect=.7)
(g.map(sns.boxplot, "nutrition_category", "price_per_lb", "yield_category").despine(left=True).add_legend(title="yield_category"))
The majority of the fruits and vegetables are in the low nutrition category.
Boxplots Interpretation:
For Fruits: Low Nutrition Category with low yield has a comparatively short boxplot, which suggests that the overall prices for these fruits are relatively close.
Low Nutrition Category with medium yield has a comparatively short boxplot, which suggests that the overall prices for these fruits are relatively close. However, it has an outlier.
Low Nutrition Category with high yield has a tall boxplot, which suggests that the overall prices for these fruits are quite different.
For Vegetables: Low Nutrition Category with low, medium and high yield has comparatively tall boxplots, which suggests that the overall prices for these vegetables are different. It also suggests a difference in prices between the yield groups.
Step 5: Separate the yield, nutrition and price into three categories, use swarmplot to find the best combination
import seaborn as sns
sns.swarmplot(x="nutrition", y="yield_category", data=df, hue="price_category", size=10)
As we can see the graph above, the best combination is the food with high nutrition, low price and medium yield. Therefore, avocados provides the best combination of price, yield, and nutrition.
Step 6: Create a heatmap for the price, yield and nutrition
data_pivoted = df.pivot("yield", "nutrition", "price_per_lb")
ax = sns.heatmap(data_pivoted, annot=True, fmt=".1f", cmap="YlGnBu", annot_kws={"size": 8})
ax.set_title('Heatmap for Price, Yield and Nutrition')
plt.show()
I get the same conclusion using the heatmap above. The fruit with price $2.2 and relatively high yield and nutrituion is the avocados.