Assignment 3¶

The US Department of Agriculture publishes price estimates for fruits and vegetables online. The most recent estimates are based on a 2013 survey of US retail stores.

The estimates are provided as a collection of MS Excel files, with one file per fruit or vegetable. The assignment3_data.zip file contains the fruit and vegetable files in the directories fruit and vegetables, respectively.

Exercise 1.1. Use pandas to extract the "Fresh" row(s) from the fruit Excel files. Combine the data into a single data frame. Your data frame should look something like this:

type	food	form	price_per_lb	yield	lb_per_cup	price_per_cup
fruit	watermelon	Fresh1	0.333412	0.52	0.330693	0.212033
fruit	cantaloupe	Fresh1	0.535874	0.51	0.374786	0.3938
vegetables	onions	Fresh1	1.03811	0.9	0.35274	0.406868
...

It's okay if the rows and columns of your data frame are in a different order. These modules are especially relevant:

str methods
os
os.path
pandas: read_excel(), concat(), .fillna(), .str, plotting methods

Ask questions and search the documentation/web to find the functions you need.

import os
import glob
import pandas as pd
import numpy as np
from pandas import Series, DataFrame
import matplotlib.pyplot as plt

def readfile(f, word):
    """
    read the file and extract the "Fresh" rows 

    Args: 
        file path and the key word(fresh)

    Returns:
        dataframe
    """
    df = pd.read_excel(f, header = 1)
    df_fresh = df.iloc[:,0].str.startswith(word)
    df = df.fillna(" ")
    df = df[df_fresh == True]
    name = os.path.basename(f)
    name = name[:-5]
    df['food'] = name
    df['type'] = typename
    df = df.drop(df.columns[[2, 5]], axis = 1)
    df = df[0:5]
    return df

def fixdf(foodlist):
    """
    Fix the dataframe return from the readfile function
    
    Args:
        a list
    
    Returns:
        dataframe
    """
    new_df = pd.concat(foodlist, axis = 0)
    new_df = new_df.drop(new_df.columns[[5, 6]], axis = 1)
    new_df = new_df.dropna()
    new_df = new_df.rename(columns= {new_df.columns[0]:"price_per_cup", new_df.columns[1]:"price_per_lb", new_df.columns[2]:"form", new_df.columns[3]:"yield", new_df.columns[4]:"lb_per_cup"})
    new_df = new_df[['type', 'food', 'form', 'price_per_lb', 'yield', 'lb_per_cup', 'price_per_cup']]
    return new_df

# Get the file path 
path = "/Users/Chloechen/Desktop/STA141B/assignment3_data/fruit"
# Get all the filenames
fruitfiles = glob.glob(os.path.join(path, "*.xlsx"))

fruit = []
fruitlist = []
word = "Fresh"
typename = "fruit"

# Use a forloop to append all the file data into a list
for f in fruitfiles:
    data = readfile(f, word)
    fruitlist.append(data)

fruitdf = fixdf(fruitlist)

Exercise 1.2. Reuse your code from exercise 1.1 to extract the "Fresh" row(s) from the vegetable Excel files.

Does your code produce the correct prices for tomatoes? If not, why not? Do any other files have the same problem as the tomatoes file?

You don't need to extract the prices for these problem files. However, make sure the prices are extracted for files like asparagus that don't have this problem.

# Get the file path 
vegepath = "/Users/Chloechen/Desktop/STA141B/assignment3_data/vegetables"
# Get all the filenames
vegefiles = glob.glob(os.path.join(vegepath, "*.xlsx"))

vegetable = []
vegelist = []
word = "Fresh"
typename = "vegetables"

# Use a forloop to append all the file data into a list
for f in vegefiles:
    data = readfile(f, word)
    vegelist.append(data)
    
vegedf = fixdf(vegelist)

print "As we can see from the table above, my codes do not produce the correct prices for tomatoes because the \"Fresh\" row only contains the word \"Fresh\" and there is no other data that can be extracted. The other files such as broccolli, carrots, cauliflower, celery, lettuce_romaine, mushrooms and spinach also have the same problem"

As we can see from the table above, my codes do not produce the correct prices for tomatoes because the "Fresh" row only contains the word "Fresh" and there is no other data that can be extracted. The other files such as broccolli, carrots, cauliflower, celery, lettuce_romaine, mushrooms and spinach also have the same problem

Exercise 1.3. Remove rows without a price from the vegetable data frame and then combine the fruit and vegetable data frames. Make sure all columns of numbers are numeric (not strings).

# eliminate the rows that do not contain any data
vegedf = vegedf.ix[vegedf['price_per_lb'] != ' ']

fooddf = pd.concat([fruitdf, vegedf], axis = 0)

# change all the data into numeric
str_col = fooddf.iloc[:, -4:]
for col in str_col:
    fooddf[col] = fooddf[col].convert_objects(convert_numeric=True)
    
str_col = fooddf.iloc[:, -4:]
str_col.dtypes

/Users/Chloechen/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:6: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.

price_per_lb     float64
yield            float64
lb_per_cup       float64
price_per_cup    float64
dtype: object

fooddf

Exercise 1.4. Discuss the questions below (a paragraph each is sufficient). Use plots to support your ideas.

What kinds of fruits are the most expensive (per pound)? What kinds are the least expensive?
How do the price distributions compare for fruit and vegetables?
Which foods are the best value for the price?
What's something surprising about this data set?
Which foods do you expect to provide the best combination of price, yield, and nutrition? A future assignment may combine this data set with another so you can check your hypothesis.

maxprice = fruitdf["price_per_lb"].max()
minprice = fruitdf["price_per_lb"].min()
print "The most expensive fruit is %s.\n" %fruitdf.loc[fruitdf["price_per_lb"] == maxprice].food[1]
print "The least expensive fruit is %s.\n" %fruitdf.loc[fruitdf["price_per_lb"] == minprice].food[1]

The most expensive fruit is raspberries.

The least expensive fruit is watermelon.

import seaborn as sns

plot = sns.FacetGrid(fooddf, col = "type")
plot.map(sns.distplot, "price_per_lb")
plt.subplots_adjust(top=0.8)
plot.fig.suptitle('Price Distribution for Fruit and Vegetables')

plt.show()

print "The most common price_per_lb value for fruit is roughly between $0.25~$2 and the least common price_per_lb value is roughly between $4.5~$7.\nOn the other hand, the most common price_per_lb value for vegetables is between $0.5 and $1.5. In addition, the prices of the vegetables are more concentrated in a range compared to the prices of the fruit."

The most common price_per_lb value for fruit is roughly between $0.25~$2 and the least common price_per_lb value is roughly between $4.5~$7.
On the other hand, the most common price_per_lb value for vegetables is between $0.5 and $1.5. In addition, the prices of the vegetables are more concentrated in a range compared to the prices of the fruit.

fruitdf["price_yield_ratio"] = fruitdf["yield"]/fruitdf["price_per_lb"]
highestratio = fruitdf["price_yield_ratio"].max()
print "The definition for best value for the price: The food that has the highest price yield ratio.\nThe fruit that has the best value for price is %s.\n" %fruitdf.loc[fruitdf["price_yield_ratio"] == highestratio].food[1]

The definition for best value for the price: The food that has the highest price yield ratio.
The fruit that has the best value for price is watermelon.

vegedf["price_yield_ratio"] = vegedf["yield"]/vegedf["price_per_lb"]
highestratio = vegedf["price_yield_ratio"].max()
print "The vegetables that has the best value for price is %s.\n" %vegedf.loc[vegedf["price_yield_ratio"] == highestratio].food[1]

The vegetables that has the best value for price is potatoes.

fig, ax = plt.subplots(1, 1)

def scatter(group):
    plt.plot(group['price_per_lb'], group['price_per_cup'], 'o', label = group.name)

fig.suptitle('Correlation between prices measure in different units for Fruit and Vegetables', fontsize=15)
fooddf.groupby("type").apply(scatter)
ax.set(xlabel = "price_per_lb", ylabel = 'price_per_cup')
ax.legend(loc = 4)

# plot a regression line 
x = fooddf['price_per_lb']
y = fooddf['price_per_cup']
fit = np.polyfit(x, y, deg=1)
ax.plot(x, fit[0] * x + fit[1], color='red')


plt.show()

print "I would expect there is a very strong positive correlation between price_per_cup and price_per_lb. However, the graph contradicts my prediction by showing many points scatter away from the line."

I would expect there is a very strong positive correlation between price_per_cup and price_per_lb. However, the graph contradicts my prediction by showing many points scatter away from the line.

fooddf["price_yield_ratio"] = fooddf["yield"]/fooddf["price_per_lb"]

fooddf.boxplot("price_per_lb", "type")
plt.title("Boxplot of Price_per_lb for Fruit and Vegetables")
plt.suptitle("")

fooddf.boxplot("yield", "type")
plt.title("Boxplot of Yield for Fruit and Vegetables")
plt.suptitle("")


fooddf.boxplot("price_yield_ratio", "type")
plt.title("Boxplot of Price_yield_ratio for Fruit and Vegetables")
plt.suptitle("")


plt.show()

print "I use boxplots of price_per_lb and yield to compare fruit and vegetables.\nIn the Boxplot of Price_per_lb, both food have very similar median price. However, there are some outliers in the fruit category. In addition, the distibution is skewed right, which suggests that the most price observations are concentrated on the lower price scale.\nOn the other hand, the price distribution of vegetables is symmetric, which suggests that the prices are evenly split at the median price.\nThe median of the yield for both categories is also very close. The boxplot of yield for fruit is skewed left, which suggests that many fruits have higher yield rate. The one for vegetables is symmetric and therefore the yield rate is evenly split at the median yield. Therefore, the fruit should have higher return of value because most of them cost less and have higer yield, this has also proven on the boxplot of Price_yield_ratio since the boxplot for vegetables is more skewed right than fruits."

I use boxplots of price_per_lb and yield to compare fruit and vegetables.
In the Boxplot of Price_per_lb, both food have very similar median price. However, there are some outliers in the fruit category. In addition, the distibution is skewed right, which suggests that the most price observations are concentrated on the lower price scale.
On the other hand, the price distribution of vegetables is symmetric, which suggests that the prices are evenly split at the median price.
The median of the yield for both categories is also very close. The boxplot of yield for fruit is skewed left, which suggests that many fruits have higher yield rate. The one for vegetables is symmetric and therefore the yield rate is evenly split at the median yield. Therefore, the fruit should have higher return of value because most of them cost less and have higer yield, this has also proven on the boxplot of Price_yield_ratio since the boxplot for vegetables is more skewed right than fruits.

print "My hypothesis is that watermelon and potatoes have the best combination of price, yield, and nutrition."

My hypothesis is that watermelon and potatoes have the best combination of price, yield, and nutrition.

	type	food	form	price_per_lb	yield	lb_per_cup	price_per_cup
1	fruit	apples	Fresh1	1.567515	0.900000	0.242508	0.422373
1	fruit	apricots	Fresh1	3.040072	0.930000	0.363763	1.189102
1	fruit	bananas	Fresh1	0.566983	0.640000	0.330693	0.292965
1	fruit	blackberries	Fresh1	5.774708	0.960000	0.319670	1.922919
1	fruit	blueberries	Fresh1	4.734622	0.950000	0.319670	1.593177
1	fruit	cantaloupe	Fresh1	0.535874	0.510000	0.374786	0.393800
1	fruit	cherries	Fresh1	3.592990	0.920000	0.341717	1.334548
1	fruit	grapefruit	Fresh1	0.897802	0.490000	0.462971	0.848278
1	fruit	grapes	Fresh1	2.093827	0.960000	0.330693	0.721266
1	fruit	honeydew	Fresh1	0.796656	0.460000	0.374786	0.649077
1	fruit	kiwi	Fresh1	2.044683	0.760000	0.385809	1.037970
1	fruit	mangoes	Fresh1	1.377563	0.710000	0.363763	0.705783
1	fruit	nectarines	Fresh1	1.761148	0.910000	0.319670	0.618667
1	fruit	oranges	Fresh1	1.035173	0.730000	0.407855	0.578357
1	fruit	papaya	Fresh1	1.298012	0.620000	0.308647	0.646174
1	fruit	peaches	Fresh1	1.591187	0.960000	0.341717	0.566390
1	fruit	pears	Fresh1	1.461575	0.900000	0.363763	0.590740
1	fruit	pineapple	Fresh1	0.627662	0.510000	0.363763	0.447686
1	fruit	plums	Fresh1	1.827416	0.940000	0.363763	0.707176
1	fruit	pomegranate	Fresh1	2.173590	0.560000	0.341717	1.326342
1	fruit	raspberries	Fresh1	6.975811	0.960000	0.319670	2.322874
1	fruit	strawberries	Fresh1	2.358808	0.940000	0.319670	0.802171
1	fruit	tangerines	Fresh1	1.377962	0.740000	0.407855	0.759471
1	fruit	watermelon	Fresh1	0.333412	0.520000	0.330693	0.212033
1	vegetables	acorn_squash	Fresh1	1.172248	0.458554	0.451948	1.155360
1	vegetables	artichoke	Fresh1	2.213050	0.375309	0.385809	2.274967
1	vegetables	asparagus	Fresh1	3.213494	0.493835	0.396832	2.582272
1	vegetables	avocados	Fresh1	2.235874	0.740753	0.319670	0.964886
1	vegetables	brussels_sprouts	Fresh1	2.763553	1.060000	0.341717	0.890898
1	vegetables	butternut_squash	Fresh1	1.244737	0.714000	0.451948	0.787893
1	vegetables	cabbage	Fresh green cabbage1	0.579208	0.778797	0.330693	0.245944
2	vegetables	cabbage	Fresh red cabbage1	1.056450	0.779107	0.330693	0.448412
1	vegetables	collard_greens	Fresh1	2.630838	1.160000	0.286601	0.650001
1	vegetables	corn_sweet	Fresh1	2.690623	0.540000	0.363763	1.812497
1	vegetables	cucumbers	Fresh, consumed with peel1	1.295931	0.970000	0.264555	0.353448
2	vegetables	cucumbers	Fresh, peeled1	1.295931	0.730000	0.264555	0.469650
1	vegetables	green_beans	Fresh1	2.139972	0.846575	0.275578	0.696606
1	vegetables	green_peppers	Fresh1	1.410363	0.820000	0.264555	0.455022
1	vegetables	kale	Fresh1	2.807302	1.050000	0.286601	0.766262
1	vegetables	lettuce_iceberg	Fresh1	1.213039	0.950000	0.242508	0.309655
1	vegetables	mustard_greens	Fresh1	2.569235	0.840000	0.308647	0.944032
1	vegetables	okra	Fresh1	3.213552	0.769474	0.352740	1.473146
1	vegetables	onions	Fresh1	1.038107	0.900000	0.352740	0.406868
1	vegetables	potatoes	Fresh1	0.564320	0.811301	0.264555	0.184017
1	vegetables	radish	Fresh1	1.311629	0.900000	0.275578	0.401618
1	vegetables	red_peppers	Fresh1	2.277940	0.820000	0.264555	0.734926
1	vegetables	summer_squash	Fresh1	1.639477	0.769500	0.396832	0.845480
1	vegetables	sweet_potatoes	Fresh1	0.918897	0.811301	0.440925	0.499400
1	vegetables	turnip_greens	Fresh1	2.471749	0.750000	0.319670	1.053526