Assignment 3

The US Department of Agriculture publishes price estimates for fruits and vegetables online. The most recent estimates are based on a 2013 survey of US retail stores.

The estimates are provided as a collection of MS Excel files, with one file per fruit or vegetable. The assignment3_data.zip file contains the fruit and vegetable files in the directories fruit and vegetables, respectively.

Exercise 1.1. Use pandas to extract the "Fresh" row(s) from the fruit Excel files. Combine the data into a single data frame. Your data frame should look something like this:

type food form price_per_lb yield lb_per_cup price_per_cup
fruit watermelon Fresh1 0.333412 0.52 0.330693 0.212033
fruit cantaloupe Fresh1 0.535874 0.51 0.374786 0.3938
vegetables onions Fresh1 1.03811 0.9 0.35274 0.406868
...

It's okay if the rows and columns of your data frame are in a different order. These modules are especially relevant:

Ask questions and search the documentation/web to find the functions you need.

In [1]:
import os
import glob
import pandas as pd
import numpy as np
from pandas import Series, DataFrame
import matplotlib.pyplot as plt
In [2]:
def readfile(f, word):
    """
    read the file and extract the "Fresh" rows 

    Args: 
        file path and the key word(fresh)

    Returns:
        dataframe
    """
    df = pd.read_excel(f, header = 1)
    df_fresh = df.iloc[:,0].str.startswith(word)
    df = df.fillna(" ")
    df = df[df_fresh == True]
    name = os.path.basename(f)
    name = name[:-5]
    df['food'] = name
    df['type'] = typename
    df = df.drop(df.columns[[2, 5]], axis = 1)
    df = df[0:5]
    return df
In [3]:
def fixdf(foodlist):
    """
    Fix the dataframe return from the readfile function
    
    Args:
        a list
    
    Returns:
        dataframe
    """
    new_df = pd.concat(foodlist, axis = 0)
    new_df = new_df.drop(new_df.columns[[5, 6]], axis = 1)
    new_df = new_df.dropna()
    new_df = new_df.rename(columns= {new_df.columns[0]:"price_per_cup", new_df.columns[1]:"price_per_lb", new_df.columns[2]:"form", new_df.columns[3]:"yield", new_df.columns[4]:"lb_per_cup"})
    new_df = new_df[['type', 'food', 'form', 'price_per_lb', 'yield', 'lb_per_cup', 'price_per_cup']]
    return new_df
In [4]:
# Get the file path 
path = "/Users/Chloechen/Desktop/STA141B/assignment3_data/fruit"
# Get all the filenames
fruitfiles = glob.glob(os.path.join(path, "*.xlsx"))

fruit = []
fruitlist = []
word = "Fresh"
typename = "fruit"

# Use a forloop to append all the file data into a list
for f in fruitfiles:
    data = readfile(f, word)
    fruitlist.append(data)

fruitdf = fixdf(fruitlist)

Exercise 1.2. Reuse your code from exercise 1.1 to extract the "Fresh" row(s) from the vegetable Excel files.

Does your code produce the correct prices for tomatoes? If not, why not? Do any other files have the same problem as the tomatoes file?

You don't need to extract the prices for these problem files. However, make sure the prices are extracted for files like asparagus that don't have this problem.

In [5]:
# Get the file path 
vegepath = "/Users/Chloechen/Desktop/STA141B/assignment3_data/vegetables"
# Get all the filenames
vegefiles = glob.glob(os.path.join(vegepath, "*.xlsx"))

vegetable = []
vegelist = []
word = "Fresh"
typename = "vegetables"

# Use a forloop to append all the file data into a list
for f in vegefiles:
    data = readfile(f, word)
    vegelist.append(data)
    
vegedf = fixdf(vegelist)
In [6]:
print "As we can see from the table above, my codes do not produce the correct prices for tomatoes because the \"Fresh\" row only contains the word \"Fresh\" and there is no other data that can be extracted. The other files such as broccolli, carrots, cauliflower, celery, lettuce_romaine, mushrooms and spinach also have the same problem"
As we can see from the table above, my codes do not produce the correct prices for tomatoes because the "Fresh" row only contains the word "Fresh" and there is no other data that can be extracted. The other files such as broccolli, carrots, cauliflower, celery, lettuce_romaine, mushrooms and spinach also have the same problem

Exercise 1.3. Remove rows without a price from the vegetable data frame and then combine the fruit and vegetable data frames. Make sure all columns of numbers are numeric (not strings).

In [7]:
# eliminate the rows that do not contain any data
vegedf = vegedf.ix[vegedf['price_per_lb'] != ' ']
In [8]:
fooddf = pd.concat([fruitdf, vegedf], axis = 0)

# change all the data into numeric
str_col = fooddf.iloc[:, -4:]
for col in str_col:
    fooddf[col] = fooddf[col].convert_objects(convert_numeric=True)
    
str_col = fooddf.iloc[:, -4:]
str_col.dtypes
/Users/Chloechen/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:6: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
Out[8]:
price_per_lb     float64
yield            float64
lb_per_cup       float64
price_per_cup    float64
dtype: object
In [9]:
fooddf
Out[9]:
type food form price_per_lb yield lb_per_cup price_per_cup
1 fruit apples Fresh1 1.567515 0.900000 0.242508 0.422373
1 fruit apricots Fresh1 3.040072 0.930000 0.363763 1.189102
1 fruit bananas Fresh1 0.566983 0.640000 0.330693 0.292965
1 fruit blackberries Fresh1 5.774708 0.960000 0.319670 1.922919
1 fruit blueberries Fresh1 4.734622 0.950000 0.319670 1.593177
1 fruit cantaloupe Fresh1 0.535874 0.510000 0.374786 0.393800
1 fruit cherries Fresh1 3.592990 0.920000 0.341717 1.334548
1 fruit grapefruit Fresh1 0.897802 0.490000 0.462971 0.848278
1 fruit grapes Fresh1 2.093827 0.960000 0.330693 0.721266
1 fruit honeydew Fresh1 0.796656 0.460000 0.374786 0.649077
1 fruit kiwi Fresh1 2.044683 0.760000 0.385809 1.037970
1 fruit mangoes Fresh1 1.377563 0.710000 0.363763 0.705783
1 fruit nectarines Fresh1 1.761148 0.910000 0.319670 0.618667
1 fruit oranges Fresh1 1.035173 0.730000 0.407855 0.578357
1 fruit papaya Fresh1 1.298012 0.620000 0.308647 0.646174
1 fruit peaches Fresh1 1.591187 0.960000 0.341717 0.566390
1 fruit pears Fresh1 1.461575 0.900000 0.363763 0.590740
1 fruit pineapple Fresh1 0.627662 0.510000 0.363763 0.447686
1 fruit plums Fresh1 1.827416 0.940000 0.363763 0.707176
1 fruit pomegranate Fresh1 2.173590 0.560000 0.341717 1.326342
1 fruit raspberries Fresh1 6.975811 0.960000 0.319670 2.322874
1 fruit strawberries Fresh1 2.358808 0.940000 0.319670 0.802171
1 fruit tangerines Fresh1 1.377962 0.740000 0.407855 0.759471
1 fruit watermelon Fresh1 0.333412 0.520000 0.330693 0.212033
1 vegetables acorn_squash Fresh1 1.172248 0.458554 0.451948 1.155360
1 vegetables artichoke Fresh1 2.213050 0.375309 0.385809 2.274967
1 vegetables asparagus Fresh1 3.213494 0.493835 0.396832 2.582272
1 vegetables avocados Fresh1 2.235874 0.740753 0.319670 0.964886
1 vegetables brussels_sprouts Fresh1 2.763553 1.060000 0.341717 0.890898
1 vegetables butternut_squash Fresh1 1.244737 0.714000 0.451948 0.787893
1 vegetables cabbage Fresh green cabbage1 0.579208 0.778797 0.330693 0.245944
2 vegetables cabbage Fresh red cabbage1 1.056450 0.779107 0.330693 0.448412
1 vegetables collard_greens Fresh1 2.630838 1.160000 0.286601 0.650001
1 vegetables corn_sweet Fresh1 2.690623 0.540000 0.363763 1.812497
1 vegetables cucumbers Fresh, consumed with peel1 1.295931 0.970000 0.264555 0.353448
2 vegetables cucumbers Fresh, peeled1 1.295931 0.730000 0.264555 0.469650
1 vegetables green_beans Fresh1 2.139972 0.846575 0.275578 0.696606
1 vegetables green_peppers Fresh1 1.410363 0.820000 0.264555 0.455022
1 vegetables kale Fresh1 2.807302 1.050000 0.286601 0.766262
1 vegetables lettuce_iceberg Fresh1 1.213039 0.950000 0.242508 0.309655
1 vegetables mustard_greens Fresh1 2.569235 0.840000 0.308647 0.944032
1 vegetables okra Fresh1 3.213552 0.769474 0.352740 1.473146
1 vegetables onions Fresh1 1.038107 0.900000 0.352740 0.406868
1 vegetables potatoes Fresh1 0.564320 0.811301 0.264555 0.184017
1 vegetables radish Fresh1 1.311629 0.900000 0.275578 0.401618
1 vegetables red_peppers Fresh1 2.277940 0.820000 0.264555 0.734926
1 vegetables summer_squash Fresh1 1.639477 0.769500 0.396832 0.845480
1 vegetables sweet_potatoes Fresh1 0.918897 0.811301 0.440925 0.499400
1 vegetables turnip_greens Fresh1 2.471749 0.750000 0.319670 1.053526

Exercise 1.4. Discuss the questions below (a paragraph each is sufficient). Use plots to support your ideas.

  • What kinds of fruits are the most expensive (per pound)? What kinds are the least expensive?
  • How do the price distributions compare for fruit and vegetables?
  • Which foods are the best value for the price?
  • What's something surprising about this data set?
  • Which foods do you expect to provide the best combination of price, yield, and nutrition? A future assignment may combine this data set with another so you can check your hypothesis.
In [10]:
maxprice = fruitdf["price_per_lb"].max()
minprice = fruitdf["price_per_lb"].min()
print "The most expensive fruit is %s.\n" %fruitdf.loc[fruitdf["price_per_lb"] == maxprice].food[1]
print "The least expensive fruit is %s.\n" %fruitdf.loc[fruitdf["price_per_lb"] == minprice].food[1]
The most expensive fruit is raspberries.

The least expensive fruit is watermelon.

In [11]:
import seaborn as sns

plot = sns.FacetGrid(fooddf, col = "type")
plot.map(sns.distplot, "price_per_lb")
plt.subplots_adjust(top=0.8)
plot.fig.suptitle('Price Distribution for Fruit and Vegetables')

plt.show()
In [12]:
print "The most common price_per_lb value for fruit is roughly between $0.25~$2 and the least common price_per_lb value is roughly between $4.5~$7.\nOn the other hand, the most common price_per_lb value for vegetables is between $0.5 and $1.5. In addition, the prices of the vegetables are more concentrated in a range compared to the prices of the fruit."
The most common price_per_lb value for fruit is roughly between $0.25~$2 and the least common price_per_lb value is roughly between $4.5~$7.
On the other hand, the most common price_per_lb value for vegetables is between $0.5 and $1.5. In addition, the prices of the vegetables are more concentrated in a range compared to the prices of the fruit.
In [13]:
fruitdf["price_yield_ratio"] = fruitdf["yield"]/fruitdf["price_per_lb"]
highestratio = fruitdf["price_yield_ratio"].max()
print "The definition for best value for the price: The food that has the highest price yield ratio.\nThe fruit that has the best value for price is %s.\n" %fruitdf.loc[fruitdf["price_yield_ratio"] == highestratio].food[1]
The definition for best value for the price: The food that has the highest price yield ratio.
The fruit that has the best value for price is watermelon.

In [14]:
vegedf["price_yield_ratio"] = vegedf["yield"]/vegedf["price_per_lb"]
highestratio = vegedf["price_yield_ratio"].max()
print "The vegetables that has the best value for price is %s.\n" %vegedf.loc[vegedf["price_yield_ratio"] == highestratio].food[1]
The vegetables that has the best value for price is potatoes.

In [15]:
fig, ax = plt.subplots(1, 1)

def scatter(group):
    plt.plot(group['price_per_lb'], group['price_per_cup'], 'o', label = group.name)

fig.suptitle('Correlation between prices measure in different units for Fruit and Vegetables', fontsize=15)
fooddf.groupby("type").apply(scatter)
ax.set(xlabel = "price_per_lb", ylabel = 'price_per_cup')
ax.legend(loc = 4)

# plot a regression line 
x = fooddf['price_per_lb']
y = fooddf['price_per_cup']
fit = np.polyfit(x, y, deg=1)
ax.plot(x, fit[0] * x + fit[1], color='red')


plt.show()
In [16]:
print "I would expect there is a very strong positive correlation between price_per_cup and price_per_lb. However, the graph contradicts my prediction by showing many points scatter away from the line."
I would expect there is a very strong positive correlation between price_per_cup and price_per_lb. However, the graph contradicts my prediction by showing many points scatter away from the line.
In [17]:
fooddf["price_yield_ratio"] = fooddf["yield"]/fooddf["price_per_lb"]

fooddf.boxplot("price_per_lb", "type")
plt.title("Boxplot of Price_per_lb for Fruit and Vegetables")
plt.suptitle("")

fooddf.boxplot("yield", "type")
plt.title("Boxplot of Yield for Fruit and Vegetables")
plt.suptitle("")


fooddf.boxplot("price_yield_ratio", "type")
plt.title("Boxplot of Price_yield_ratio for Fruit and Vegetables")
plt.suptitle("")


plt.show()
In [20]:
print "I use boxplots of price_per_lb and yield to compare fruit and vegetables.\nIn the Boxplot of Price_per_lb, both food have very similar median price. However, there are some outliers in the fruit category. In addition, the distibution is skewed right, which suggests that the most price observations are concentrated on the lower price scale.\nOn the other hand, the price distribution of vegetables is symmetric, which suggests that the prices are evenly split at the median price.\nThe median of the yield for both categories is also very close. The boxplot of yield for fruit is skewed left, which suggests that many fruits have higher yield rate. The one for vegetables is symmetric and therefore the yield rate is evenly split at the median yield. Therefore, the fruit should have higher return of value because most of them cost less and have higer yield, this has also proven on the boxplot of Price_yield_ratio since the boxplot for vegetables is more skewed right than fruits."
I use boxplots of price_per_lb and yield to compare fruit and vegetables.
In the Boxplot of Price_per_lb, both food have very similar median price. However, there are some outliers in the fruit category. In addition, the distibution is skewed right, which suggests that the most price observations are concentrated on the lower price scale.
On the other hand, the price distribution of vegetables is symmetric, which suggests that the prices are evenly split at the median price.
The median of the yield for both categories is also very close. The boxplot of yield for fruit is skewed left, which suggests that many fruits have higher yield rate. The one for vegetables is symmetric and therefore the yield rate is evenly split at the median yield. Therefore, the fruit should have higher return of value because most of them cost less and have higer yield, this has also proven on the boxplot of Price_yield_ratio since the boxplot for vegetables is more skewed right than fruits.
In [19]:
print "My hypothesis is that watermelon and potatoes have the best combination of price, yield, and nutrition."
My hypothesis is that watermelon and potatoes have the best combination of price, yield, and nutrition.