Assignment 4

In this assignment, you'll combine the assignment 3 data set with nutrition data from the USDA Food Composition Databases. The CSV file fresh.csv contains the fresh fruits and vegetables data you extracted in assignment 3.

The USDA Food Composition Databases have a documented web API that returns data in JSON format . You need a key in order to use the API. Only 1000 requests are allowed per hour, so it would be a good idea to use caching.

Sign up for an API key here. The key will work with any Data.gov API. You may need the key again later in the quarter, so make sure you save it.

These modules may be useful:

Exercise 1.1. Read the search request documentation, then write a function called ndb_search() that makes a search request. The function should accept the search term as an argument. The function should return the search result items as a list (for 0 items, return an empty list).

Note that the search url is: https://api.nal.usda.gov/ndb/search

As an example, a search for "quail eggs" should return this list:

[{u'ds': u'BL',
  u'group': u'Branded Food Products Database',
  u'name': u'CHAOKOH, QUAIL EGG IN BRINE, UPC: 044738074186',
  u'ndbno': u'45094707',
  u'offset': 0},
 {u'ds': u'BL',
  u'group': u'Branded Food Products Database',
  u'name': u'L&W, QUAIL EGGS, UPC: 024072000256',
  u'ndbno': u'45094890',
  u'offset': 1},
 {u'ds': u'BL',
  u'group': u'Branded Food Products Database',
  u'name': u'BUDDHA, QUAIL EGGS IN BRINE, UPC: 761934535098',
  u'ndbno': u'45099560',
  u'offset': 2},
 {u'ds': u'BL',
  u'group': u'Branded Food Products Database',
  u'name': u'GRAN SABANA, QUAIL EGGS, UPC: 819140010103',
  u'ndbno': u'45169279',
  u'offset': 3},
 {u'ds': u'BL',
  u'group': u'Branded Food Products Database',
  u'name': u"D'ARTAGNAN, QUAIL EGGS, UPC: 736622102630",
  u'ndbno': u'45178254',
  u'offset': 4},
 {u'ds': u'SR',
  u'group': u'Dairy and Egg Products',
  u'name': u'Egg, quail, whole, fresh, raw',
  u'ndbno': u'01140',
  u'offset': 5}]

As usual, make sure you document and test your function.

In [1]:
import requests_cache
import requests
import pandas as pd
from matplotlib import pyplot as plt
plt.style.use('ggplot')
%matplotlib inline
from pandas.tools.plotting import scatter_matrix
import numpy as np
import seaborn as sns
In [2]:
requests_cache.install_cache("cache")
key = "KzsGUKXTRFCNq9WSYbTcwVi5FA7SjWzFC15sr7rO"


def ndb_search(term):
    """
    makes a search request
    
    Argument: search term
    
    Return: search result items as a list (for 0 items, return an empty list)
    """
    url = "https://api.nal.usda.gov/ndb/search"
    response = requests.get(url, params = {
            "q": term,
            "api_key": key,
            "format":"json"
        })
    
    output = response.json()
    
    if "list" not in output.keys():
        return {}
    else:
        return output["list"]["item"]
In [3]:
ndb_search("quail eggs")
Out[3]:
[{u'ds': u'BL',
  u'group': u'Branded Food Products Database',
  u'name': u'CHAOKOH, QUAIL EGG IN BRINE, UPC: 044738074186',
  u'ndbno': u'45094707',
  u'offset': 0},
 {u'ds': u'BL',
  u'group': u'Branded Food Products Database',
  u'name': u'L&W, QUAIL EGGS, UPC: 024072000256',
  u'ndbno': u'45094890',
  u'offset': 1},
 {u'ds': u'BL',
  u'group': u'Branded Food Products Database',
  u'name': u'BUDDHA, QUAIL EGGS IN BRINE, UPC: 761934535098',
  u'ndbno': u'45099560',
  u'offset': 2},
 {u'ds': u'BL',
  u'group': u'Branded Food Products Database',
  u'name': u'GRAN SABANA, QUAIL EGGS, UPC: 819140010103',
  u'ndbno': u'45169279',
  u'offset': 3},
 {u'ds': u'BL',
  u'group': u'Branded Food Products Database',
  u'name': u"D'ARTAGNAN, QUAIL EGGS, UPC: 736622102630",
  u'ndbno': u'45178254',
  u'offset': 4},
 {u'ds': u'SR',
  u'group': u'Dairy and Egg Products',
  u'name': u'Egg, quail, whole, fresh, raw',
  u'ndbno': u'01140',
  u'offset': 5}]

Exercise 1.2. Use your search function to get NDB numbers for the foods in the fresh.csv file. It's okay if you don't get an NDB number for every food, but try to come up with a strategy that gets most of them. Discuss your strategy in a short paragraph.

Hints:

  • The foods are all raw and unbranded.
  • You can test search terms with the online search page.
  • You can convert the output of ndb_search() to a data frame with pd.DataFrame().
  • The string methods for Python and Pandas are useful here. It's okay if you use simple regular expressions in the Pandas methods, although this exercise can be solved without them.
  • You can merge data frames that have a column in common with pd.merge().

My strategy:

  1. Manipulate the name in the "food" column so that the search function can return valid outputs
  2. Create a search function to return a dataframe for the output from the ndb_search function
  3. Search all the fruit from the fresh.csv
  4. Merge the fresh.csv dataframe with the dataframe from the search function
  5. Remove the duplicate rows
In [4]:
path = "/Users/Chloechen/Downloads/fresh.csv"
df_fruit = pd.read_csv(path, header = 0)
In [5]:
# manipulate the name in the "food" column
df_fruit['new_food'] = df_fruit['food'].str.replace('_', ", ")
df_fruit["food"][df_fruit["food"].str.contains('green') == False] = df_fruit['new_food']
df_fruit["food"][df_fruit["food"].str.contains('green_peppers') == True] = "green peppers"
df_fruit["food"][df_fruit["food"].str.contains('red, peppers') == True] = "Peppers, red"
/Users/Chloechen/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
/Users/Chloechen/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/Users/Chloechen/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
In [6]:
def search(key):
    """
    create a dataframe for the output from the ndb_search function
    
    Argument: key/search term
    
    Return: dataframe
    """
    if key == "kiwi":
        term = "kiwifruit"
    elif key == "apples":
        term = "apples, raw"
    elif key == "collard_greens":
        term = "Collards"
    elif key == "green_beans":
        term = "Beans, snap, green, raw"
    elif key == "red_peppers":
        term = "Peppers, sweet, red, raw"
    else:
        term = key
    result = ndb_search(term)
    
    if result == {}:
        df_result = pd.DataFrame(result)
        df_result["food"] = key
    
    else:
        df_result = pd.DataFrame(result)
        df_result["food"] = key
        raw = df_result["name"].str.lower().str.contains(', raw')
        return df_result[raw]
In [7]:
fruit_list = [search(fruit) for fruit in df_fruit["food"]]
df_result = pd.concat(fruit_list)
df_result = df_result.loc[:, ["name", "ndbno", "food"]]
In [8]:
final_df = pd.merge(df_fruit, df_result, on = "food", how = "inner")
In [9]:
fresh_final_df = final_df.drop_duplicates(['food'])
del fresh_final_df['new_food']
In [10]:
fresh_final_df
Out[10]:
form price_per_lb yield lb_per_cup price_per_cup food type name ndbno
0 Fresh1 0.333412 0.520000 0.330693 0.212033 watermelon fruit Watermelon, raw 09326
1 Fresh1 0.535874 0.510000 0.374786 0.393800 cantaloupe fruit Melons, cantaloupe, raw 09181
2 Fresh1 1.377962 0.740000 0.407855 0.759471 tangerines fruit Tangerines, (mandarin oranges), raw 09218
4 Fresh1 2.358808 0.940000 0.319670 0.802171 strawberries fruit Strawberries, raw 09316
6 Fresh1 1.827416 0.940000 0.363763 0.707176 plums fruit Plums, raw 09279
10 Fresh1 1.035173 0.730000 0.407855 0.578357 oranges fruit Oranges, raw, California, valencias 09201
18 Fresh1 6.975811 0.960000 0.319670 2.322874 raspberries fruit Raspberries, raw 09302
19 Fresh1 2.173590 0.560000 0.341717 1.326342 pomegranate fruit Pomegranates, raw 09286
20 Fresh1 0.627662 0.510000 0.363763 0.447686 pineapple fruit Pineapple, raw, all varieties 09266
23 Fresh1 3.040072 0.930000 0.363763 1.189102 apricots fruit Apricots, raw 09021
24 Fresh1 0.796656 0.460000 0.374786 0.649077 honeydew fruit Melons, honeydew, raw 09184
25 Fresh1 1.298012 0.620000 0.308647 0.646174 papaya fruit Papayas, raw 09226
26 Fresh1 2.044683 0.760000 0.385809 1.037970 kiwi fruit Kiwifruit, green, raw 09148
28 Fresh1 3.592990 0.920000 0.341717 1.334548 cherries fruit Cherries, sour, red, raw 09063
32 Fresh1 0.566983 0.640000 0.330693 0.292965 bananas fruit Bananas, raw 09040
34 Fresh1 1.567515 0.900000 0.242508 0.422373 apples fruit Apples, raw, with skin 09003
49 Fresh1 1.591187 0.960000 0.341717 0.566390 peaches fruit Peaches, yellow, raw 09236
50 Fresh1 1.761148 0.910000 0.319670 0.618667 nectarines fruit Nectarines, raw 09191
51 Fresh1 1.461575 0.900000 0.363763 0.590740 pears fruit Pears, raw 09252
61 Fresh1 0.897802 0.490000 0.462971 0.848278 grapefruit fruit Grapefruit, raw, white, California 09117
70 Fresh1 5.774708 0.960000 0.319670 1.922919 blackberries fruit Blackberries, raw 09042
73 Fresh1 2.093827 0.960000 0.330693 0.721266 grapes fruit Grapes, muscadine, raw 09129
77 Fresh1 4.734622 0.950000 0.319670 1.593177 blueberries fruit Blueberries, raw 09050
79 Fresh1 1.377563 0.710000 0.363763 0.705783 mangoes fruit Mangos, raw 09176
80 Fresh1 3.213494 0.493835 0.396832 2.582272 asparagus vegetables Asparagus, raw 11011
81 Fresh, consumed with peel1 1.295931 0.970000 0.264555 0.353448 cucumbers vegetables Cucumber, with peel, raw 11205
93 Fresh1 1.213039 0.950000 0.242508 0.309655 lettuce, iceberg vegetables Lettuce, iceberg (includes crisphead types), raw 11252
94 Fresh1 1.038107 0.900000 0.352740 0.406868 onions vegetables Onions, raw 11282
98 Fresh1 2.471749 0.750000 0.319670 1.053526 turnip_greens vegetables Turnip greens, raw 11568
99 Fresh1 2.569235 0.840000 0.308647 0.944032 mustard_greens vegetables Mustard greens, raw 11270
100 Fresh1 0.564320 0.811301 0.264555 0.184017 potatoes vegetables Potatoes, flesh and skin, raw 11352
107 Fresh1 2.630838 1.160000 0.286601 0.650001 collard_greens vegetables Collards, raw 11161
108 Fresh1 2.139972 0.846575 0.275578 0.696606 green_beans vegetables Beans, snap, green, raw 11052
110 Fresh1 1.172248 0.458554 0.451948 1.155360 acorn, squash vegetables Squash, winter, acorn, raw 11482
111 Fresh1 2.277940 0.820000 0.264555 0.734926 Peppers, red vegetables Peppers, sweet, red, raw 11821
114 Fresh green cabbage1 0.579208 0.778797 0.330693 0.245944 cabbage vegetables Swamp cabbage, (skunk cabbage), raw 11503
146 Fresh1 0.918897 0.811301 0.440925 0.499400 sweet, potatoes vegetables Sweet potato leaves, raw 11505
149 Fresh1 1.639477 0.769500 0.396832 0.845480 summer, squash vegetables Squash, summer, scallop, raw 11475
153 Fresh1 1.311629 0.900000 0.275578 0.401618 radish vegetables Radishes, raw 11429
157 Fresh1 1.244737 0.714000 0.451948 0.787893 butternut, squash vegetables Squash, winter, butternut, raw 11485
158 Fresh1 2.235874 0.740753 0.319670 0.964886 avocados vegetables Avocados, raw, California 09038
161 Fresh1 2.807302 1.050000 0.286601 0.766262 kale vegetables Kale, raw 11233
165 Fresh1 2.213050 0.375309 0.385809 2.274967 artichoke vegetables Artichokes, (globe or french), raw 11007
167 Fresh1 3.213552 0.769474 0.352740 1.473146 okra vegetables Okra, raw 11278
168 Fresh1 1.410363 0.820000 0.264555 0.455022 green peppers vegetables Peppers, sweet, green, raw 11333
170 Fresh1 2.763553 1.060000 0.341717 0.890898 brussels, sprouts vegetables Brussels sprouts, raw 11098
171 Fresh1 2.690623 0.540000 0.363763 1.812497 corn, sweet vegetables Corn, sweet, white, raw 11900

Exercise 1.3. Read the food reports V2 documentation, then write a function called ndb_report() that requests a basic food report. The function should accept the NDB number as an argument and return the list of nutrients for the food.

Note that the report url is: https://api.nal.usda.gov/ndb/V2/reports

For example, for "09279" (raw plums) the first element of the returned list should be:

{u'group': u'Proximates',
 u'measures': [{u'eqv': 165.0,
   u'eunit': u'g',
   u'label': u'cup, sliced',
   u'qty': 1.0,
   u'value': u'143.93'},
  {u'eqv': 66.0,
   u'eunit': u'g',
   u'label': u'fruit (2-1/8" dia)',
   u'qty': 1.0,
   u'value': u'57.57'},
  {u'eqv': 151.0,
   u'eunit': u'g',
   u'label': u'NLEA serving',
   u'qty': 1.0,
   u'value': u'131.72'}],
 u'name': u'Water',
 u'nutrient_id': u'255',
 u'unit': u'g',
 u'value': u'87.23'}

Be sure to document and test your function.

In [11]:
requests_cache.install_cache("cache")
key = "KzsGUKXTRFCNq9WSYbTcwVi5FA7SjWzFC15sr7rO"

def ndb_report(number):
    """
    Requests a basic food report
    
    Argument: NDB number
    
    Return: A list of nutrients for the food
    
    """
    url = "https://api.nal.usda.gov/ndb/V2/reports"
    response = requests.get(url, params = {
            "ndbno": number,
            "api_key": key,
            "format": "json"
        })
    output = response.json()
    return output["foods"][0]['food']['nutrients']
In [12]:
number = "09279"
output = ndb_report(number)
output[0]
Out[12]:
{u'group': u'Proximates',
 u'measures': [{u'eqv': 165.0,
   u'eunit': u'g',
   u'label': u'cup, sliced',
   u'qty': 1.0,
   u'value': u'143.93'},
  {u'eqv': 66.0,
   u'eunit': u'g',
   u'label': u'fruit (2-1/8" dia)',
   u'qty': 1.0,
   u'value': u'57.57'},
  {u'eqv': 151.0,
   u'eunit': u'g',
   u'label': u'NLEA serving',
   u'qty': 1.0,
   u'value': u'131.72'}],
 u'name': u'Water',
 u'nutrient_id': u'255',
 u'unit': u'g',
 u'value': u'87.23'}

Exercise 1.4. Which foods provide the best combination of price, yield, and nutrition? You can use kilocalories as a measure of "nutrition" here, but more a detailed analysis is better. Use plots to support your analysis.

In [13]:
def nutri_search(number):
    """
    Create a dataframe for the output from the ndb_report function
    
    Argument: ndb number
    
    Return: dataframe
    """
    result = ndb_report(number)
    df_number = pd.DataFrame(result)
    df_number["ndbno"] = number
    return df_number

Step 1: Merge the fresh_final_df and the dataframe generated from the nutri_search function

In [14]:
nutri_list = [nutri_search(number) for number in fresh_final_df["ndbno"]] 
df_number = pd.concat(nutri_list)
df_number = df_number.loc[:, ["name", "value", "unit", "ndbno"]]
fresh_df = pd.merge(fresh_final_df, df_number, on = "ndbno", how = "inner")

Step 2: Select the rows that contains the kilocalories information

In [15]:
df = fresh_df[fresh_df["name_y"].str.contains("Energy")]
In [16]:
df = df.rename(columns = {"value": "nutrition"})

df["nutrition"] = df["nutrition"].convert_objects(convert_numeric=True)

df
/Users/Chloechen/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:3: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
  app.launch_new_instance()
Out[16]:
form price_per_lb yield lb_per_cup price_per_cup food type name_x ndbno name_y nutrition unit
1 Fresh1 0.333412 0.520000 0.330693 0.212033 watermelon fruit Watermelon, raw 09326 Energy 30 kcal
34 Fresh1 0.535874 0.510000 0.374786 0.393800 cantaloupe fruit Melons, cantaloupe, raw 09181 Energy 34 kcal
67 Fresh1 1.377962 0.740000 0.407855 0.759471 tangerines fruit Tangerines, (mandarin oranges), raw 09218 Energy 53 kcal
100 Fresh1 2.358808 0.940000 0.319670 0.802171 strawberries fruit Strawberries, raw 09316 Energy 32 kcal
133 Fresh1 1.827416 0.940000 0.363763 0.707176 plums fruit Plums, raw 09279 Energy 46 kcal
166 Fresh1 1.035173 0.730000 0.407855 0.578357 oranges fruit Oranges, raw, California, valencias 09201 Energy 49 kcal
195 Fresh1 6.975811 0.960000 0.319670 2.322874 raspberries fruit Raspberries, raw 09302 Energy 52 kcal
228 Fresh1 2.173590 0.560000 0.341717 1.326342 pomegranate fruit Pomegranates, raw 09286 Energy 83 kcal
261 Fresh1 0.627662 0.510000 0.363763 0.447686 pineapple fruit Pineapple, raw, all varieties 09266 Energy 50 kcal
294 Fresh1 3.040072 0.930000 0.363763 1.189102 apricots fruit Apricots, raw 09021 Energy 48 kcal
327 Fresh1 0.796656 0.460000 0.374786 0.649077 honeydew fruit Melons, honeydew, raw 09184 Energy 36 kcal
360 Fresh1 1.298012 0.620000 0.308647 0.646174 papaya fruit Papayas, raw 09226 Energy 43 kcal
393 Fresh1 2.044683 0.760000 0.385809 1.037970 kiwi fruit Kiwifruit, green, raw 09148 Energy 61 kcal
426 Fresh1 3.592990 0.920000 0.341717 1.334548 cherries fruit Cherries, sour, red, raw 09063 Energy 50 kcal
459 Fresh1 0.566983 0.640000 0.330693 0.292965 bananas fruit Bananas, raw 09040 Energy 89 kcal
492 Fresh1 1.567515 0.900000 0.242508 0.422373 apples fruit Apples, raw, with skin 09003 Energy 52 kcal
525 Fresh1 1.591187 0.960000 0.341717 0.566390 peaches fruit Peaches, yellow, raw 09236 Energy 39 kcal
558 Fresh1 1.761148 0.910000 0.319670 0.618667 nectarines fruit Nectarines, raw 09191 Energy 44 kcal
591 Fresh1 1.461575 0.900000 0.363763 0.590740 pears fruit Pears, raw 09252 Energy 57 kcal
624 Fresh1 0.897802 0.490000 0.462971 0.848278 grapefruit fruit Grapefruit, raw, white, California 09117 Energy 37 kcal
652 Fresh1 5.774708 0.960000 0.319670 1.922919 blackberries fruit Blackberries, raw 09042 Energy 43 kcal
685 Fresh1 2.093827 0.960000 0.330693 0.721266 grapes fruit Grapes, muscadine, raw 09129 Energy 57 kcal
703 Fresh1 4.734622 0.950000 0.319670 1.593177 blueberries fruit Blueberries, raw 09050 Energy 57 kcal
736 Fresh1 1.377563 0.710000 0.363763 0.705783 mangoes fruit Mangos, raw 09176 Energy 60 kcal
769 Fresh1 3.213494 0.493835 0.396832 2.582272 asparagus vegetables Asparagus, raw 11011 Energy 20 kcal
802 Fresh, consumed with peel1 1.295931 0.970000 0.264555 0.353448 cucumbers vegetables Cucumber, with peel, raw 11205 Energy 15 kcal
835 Fresh1 1.213039 0.950000 0.242508 0.309655 lettuce, iceberg vegetables Lettuce, iceberg (includes crisphead types), raw 11252 Energy 14 kcal
868 Fresh1 1.038107 0.900000 0.352740 0.406868 onions vegetables Onions, raw 11282 Energy 40 kcal
901 Fresh1 2.471749 0.750000 0.319670 1.053526 turnip_greens vegetables Turnip greens, raw 11568 Energy 32 kcal
934 Fresh1 2.569235 0.840000 0.308647 0.944032 mustard_greens vegetables Mustard greens, raw 11270 Energy 27 kcal
967 Fresh1 0.564320 0.811301 0.264555 0.184017 potatoes vegetables Potatoes, flesh and skin, raw 11352 Energy 77 kcal
1000 Fresh1 2.630838 1.160000 0.286601 0.650001 collard_greens vegetables Collards, raw 11161 Energy 32 kcal
1033 Fresh1 2.139972 0.846575 0.275578 0.696606 green_beans vegetables Beans, snap, green, raw 11052 Energy 31 kcal
1066 Fresh1 1.172248 0.458554 0.451948 1.155360 acorn, squash vegetables Squash, winter, acorn, raw 11482 Energy 40 kcal
1095 Fresh1 2.277940 0.820000 0.264555 0.734926 Peppers, red vegetables Peppers, sweet, red, raw 11821 Energy 31 kcal
1128 Fresh green cabbage1 0.579208 0.778797 0.330693 0.245944 cabbage vegetables Swamp cabbage, (skunk cabbage), raw 11503 Energy 19 kcal
1154 Fresh1 0.918897 0.811301 0.440925 0.499400 sweet, potatoes vegetables Sweet potato leaves, raw 11505 Energy 42 kcal
1183 Fresh1 1.639477 0.769500 0.396832 0.845480 summer, squash vegetables Squash, summer, scallop, raw 11475 Energy 18 kcal
1216 Fresh1 1.311629 0.900000 0.275578 0.401618 radish vegetables Radishes, raw 11429 Energy 16 kcal
1249 Fresh1 1.244737 0.714000 0.451948 0.787893 butternut, squash vegetables Squash, winter, butternut, raw 11485 Energy 45 kcal
1282 Fresh1 2.235874 0.740753 0.319670 0.964886 avocados vegetables Avocados, raw, California 09038 Energy 167 kcal
1314 Fresh1 2.807302 1.050000 0.286601 0.766262 kale vegetables Kale, raw 11233 Energy 49 kcal
1347 Fresh1 2.213050 0.375309 0.385809 2.274967 artichoke vegetables Artichokes, (globe or french), raw 11007 Energy 47 kcal
1379 Fresh1 3.213552 0.769474 0.352740 1.473146 okra vegetables Okra, raw 11278 Energy 33 kcal
1412 Fresh1 1.410363 0.820000 0.264555 0.455022 green peppers vegetables Peppers, sweet, green, raw 11333 Energy 20 kcal
1445 Fresh1 2.763553 1.060000 0.341717 0.890898 brussels, sprouts vegetables Brussels sprouts, raw 11098 Energy 43 kcal
1478 Fresh1 2.690623 0.540000 0.363763 1.812497 corn, sweet vegetables Corn, sweet, white, raw 11900 Energy 86 kcal

Step 3: Find the correlation between the variables using scatter plots

In [17]:
fig, ax = plt.subplots(1, 1)

def scatter(group):
    plt.plot(group['nutrition'], group['price_per_lb'], 'o', label = group.name)

fig.suptitle('Correlation between Nutrition and Price for Fruit and Vegetables', fontsize=15)
df.groupby("type").apply(scatter)
ax.set(xlabel = 'nutrition', ylabel = 'price_per_lb')
ax.legend(loc = 4)

# plot a regression line 
x = df['nutrition']
y = df['price_per_lb']
fit = np.polyfit(x, y, deg=1)
ax.plot(x, fit[0] * x + fit[1], color='red')
ax.set_ylim([0, 8])

plt.show()
In [18]:
fig, ax = plt.subplots(1, 1)

def scatter(group):
    plt.plot(group['nutrition'], group['yield'], 'o', label = group.name)

fig.suptitle('Correlation between Nutrition and Yield for Fruit and Vegetables', fontsize=15)
df.groupby("type").apply(scatter)
ax.set(xlabel = 'nutrition', ylabel = 'yield')
ax.legend(loc = 4)

# plot a regression line 
x = df['nutrition']
y = df['yield']
fit = np.polyfit(x, y, deg=1)
ax.plot(x, fit[0] * x + fit[1], color='red')

plt.show()
In [19]:
fig, ax = plt.subplots(1, 1)

def scatter(group):
    plt.plot(group['price_per_lb'], group['yield'], 'o', label = group.name)

fig.suptitle('Correlation between Price and Yield for Fruit and Vegetables', fontsize=15)
df.groupby("type").apply(scatter)
ax.set(xlabel = 'price_per_lb', ylabel = 'yield')
ax.legend(loc = 4)

# plot a regression line 
x = df['price_per_lb']
y = df['yield']
fit = np.polyfit(x, y, deg=1)
ax.plot(x, fit[0] * x + fit[1], color='red')
ax.set_xlim([0, 8])

plt.show()

There is no significant correlation between price and nutrition, nutrition and yield for fruit and vegetables. However, there is a moderate positive correlation between price and yield.

Step 4: Use boxplot to illustrate the price distribution for different nutrition and yield category for fruit and vegetables.

In [20]:
df["nutrition_category"] = pd.cut(df["nutrition"], 3, labels = ["low", "medium", "high"])
In [21]:
df["price_category"] = pd.cut(df["price_per_lb"], 3, labels = ["low", "medium", "high"])
In [22]:
df["yield_category"] = pd.cut(df["yield"], 3, labels = ["low", "medium", "high"])
In [23]:
g = sns.FacetGrid(df, col="type", size=4, aspect=.7)
(g.map(sns.boxplot, "nutrition_category", "price_per_lb", "yield_category").despine(left=True).add_legend(title="yield_category"))  
Out[23]:
<seaborn.axisgrid.FacetGrid at 0x119de6a50>

The majority of the fruits and vegetables are in the low nutrition category.

Boxplots Interpretation:

For Fruits: Low Nutrition Category with low yield has a comparatively short boxplot, which suggests that the overall prices for these fruits are relatively close.

Low Nutrition Category with medium yield has a comparatively short boxplot, which suggests that the overall prices for these fruits are relatively close. However, it has an outlier.

Low Nutrition Category with high yield has a tall boxplot, which suggests that the overall prices for these fruits are quite different.

For Vegetables: Low Nutrition Category with low, medium and high yield has comparatively tall boxplots, which suggests that the overall prices for these vegetables are different. It also suggests a difference in prices between the yield groups.

Step 5: Separate the yield, nutrition and price into three categories, use swarmplot to find the best combination

In [24]:
import seaborn as sns
sns.swarmplot(x="nutrition", y="yield_category", data=df, hue="price_category", size=10)
Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x11c042350>

As we can see the graph above, the best combination is the food with high nutrition, low price and medium yield. Therefore, avocados provides the best combination of price, yield, and nutrition.

Step 6: Create a heatmap for the price, yield and nutrition

In [25]:
data_pivoted = df.pivot("yield", "nutrition", "price_per_lb")
ax = sns.heatmap(data_pivoted, annot=True, fmt=".1f", cmap="YlGnBu", annot_kws={"size": 8})

ax.set_title('Heatmap for Price, Yield and Nutrition')

plt.show()

I get the same conclusion using the heatmap above. The fruit with price $2.2 and relatively high yield and nutrituion is the avocados.