The US Department of Agriculture publishes price estimates for fruits and vegetables online. The most recent estimates are based on a 2013 survey of US retail stores.
The estimates are provided as a collection of MS Excel files, with one file per fruit or vegetable. The assignment3_data.zip file contains the fruit and vegetable files in the directories fruit and vegetables, respectively.
Exercise 1.1. Use pandas to extract the "Fresh" row(s) from the fruit Excel files. Combine the data into a single data frame. Your data frame should look something like this:
| type | food | form | price_per_lb | yield | lb_per_cup | price_per_cup |
|---|---|---|---|---|---|---|
| fruit | watermelon | Fresh1 | 0.333412 | 0.52 | 0.330693 | 0.212033 |
| fruit | cantaloupe | Fresh1 | 0.535874 | 0.51 | 0.374786 | 0.3938 |
| vegetables | onions | Fresh1 | 1.03811 | 0.9 | 0.35274 | 0.406868 |
| ... |
It's okay if the rows and columns of your data frame are in a different order. These modules are especially relevant:
str methodsosos.pathread_excel(), concat(), .fillna(), .str, plotting methodsAsk questions and search the documentation/web to find the functions you need.
import os
import glob
import pandas as pd
import numpy as np
from pandas import Series, DataFrame
import matplotlib.pyplot as plt
def readfile(f, word):
"""
read the file and extract the "Fresh" rows
Args:
file path and the key word(fresh)
Returns:
dataframe
"""
df = pd.read_excel(f, header = 1)
df_fresh = df.iloc[:,0].str.startswith(word)
df = df.fillna(" ")
df = df[df_fresh == True]
name = os.path.basename(f)
name = name[:-5]
df['food'] = name
df['type'] = typename
df = df.drop(df.columns[[2, 5]], axis = 1)
df = df[0:5]
return df
def fixdf(foodlist):
"""
Fix the dataframe return from the readfile function
Args:
a list
Returns:
dataframe
"""
new_df = pd.concat(foodlist, axis = 0)
new_df = new_df.drop(new_df.columns[[5, 6]], axis = 1)
new_df = new_df.dropna()
new_df = new_df.rename(columns= {new_df.columns[0]:"price_per_cup", new_df.columns[1]:"price_per_lb", new_df.columns[2]:"form", new_df.columns[3]:"yield", new_df.columns[4]:"lb_per_cup"})
new_df = new_df[['type', 'food', 'form', 'price_per_lb', 'yield', 'lb_per_cup', 'price_per_cup']]
return new_df
# Get the file path
path = "/Users/Chloechen/Desktop/STA141B/assignment3_data/fruit"
# Get all the filenames
fruitfiles = glob.glob(os.path.join(path, "*.xlsx"))
fruit = []
fruitlist = []
word = "Fresh"
typename = "fruit"
# Use a forloop to append all the file data into a list
for f in fruitfiles:
data = readfile(f, word)
fruitlist.append(data)
fruitdf = fixdf(fruitlist)
Exercise 1.2. Reuse your code from exercise 1.1 to extract the "Fresh" row(s) from the vegetable Excel files.
Does your code produce the correct prices for tomatoes? If not, why not? Do any other files have the same problem as the tomatoes file?
You don't need to extract the prices for these problem files. However, make sure the prices are extracted for files like asparagus that don't have this problem.
# Get the file path
vegepath = "/Users/Chloechen/Desktop/STA141B/assignment3_data/vegetables"
# Get all the filenames
vegefiles = glob.glob(os.path.join(vegepath, "*.xlsx"))
vegetable = []
vegelist = []
word = "Fresh"
typename = "vegetables"
# Use a forloop to append all the file data into a list
for f in vegefiles:
data = readfile(f, word)
vegelist.append(data)
vegedf = fixdf(vegelist)
print "As we can see from the table above, my codes do not produce the correct prices for tomatoes because the \"Fresh\" row only contains the word \"Fresh\" and there is no other data that can be extracted. The other files such as broccolli, carrots, cauliflower, celery, lettuce_romaine, mushrooms and spinach also have the same problem"
Exercise 1.3. Remove rows without a price from the vegetable data frame and then combine the fruit and vegetable data frames. Make sure all columns of numbers are numeric (not strings).
# eliminate the rows that do not contain any data
vegedf = vegedf.ix[vegedf['price_per_lb'] != ' ']
fooddf = pd.concat([fruitdf, vegedf], axis = 0)
# change all the data into numeric
str_col = fooddf.iloc[:, -4:]
for col in str_col:
fooddf[col] = fooddf[col].convert_objects(convert_numeric=True)
str_col = fooddf.iloc[:, -4:]
str_col.dtypes
fooddf
Exercise 1.4. Discuss the questions below (a paragraph each is sufficient). Use plots to support your ideas.
maxprice = fruitdf["price_per_lb"].max()
minprice = fruitdf["price_per_lb"].min()
print "The most expensive fruit is %s.\n" %fruitdf.loc[fruitdf["price_per_lb"] == maxprice].food[1]
print "The least expensive fruit is %s.\n" %fruitdf.loc[fruitdf["price_per_lb"] == minprice].food[1]
import seaborn as sns
plot = sns.FacetGrid(fooddf, col = "type")
plot.map(sns.distplot, "price_per_lb")
plt.subplots_adjust(top=0.8)
plot.fig.suptitle('Price Distribution for Fruit and Vegetables')
plt.show()
print "The most common price_per_lb value for fruit is roughly between $0.25~$2 and the least common price_per_lb value is roughly between $4.5~$7.\nOn the other hand, the most common price_per_lb value for vegetables is between $0.5 and $1.5. In addition, the prices of the vegetables are more concentrated in a range compared to the prices of the fruit."
fruitdf["price_yield_ratio"] = fruitdf["yield"]/fruitdf["price_per_lb"]
highestratio = fruitdf["price_yield_ratio"].max()
print "The definition for best value for the price: The food that has the highest price yield ratio.\nThe fruit that has the best value for price is %s.\n" %fruitdf.loc[fruitdf["price_yield_ratio"] == highestratio].food[1]
vegedf["price_yield_ratio"] = vegedf["yield"]/vegedf["price_per_lb"]
highestratio = vegedf["price_yield_ratio"].max()
print "The vegetables that has the best value for price is %s.\n" %vegedf.loc[vegedf["price_yield_ratio"] == highestratio].food[1]
fig, ax = plt.subplots(1, 1)
def scatter(group):
plt.plot(group['price_per_lb'], group['price_per_cup'], 'o', label = group.name)
fig.suptitle('Correlation between prices measure in different units for Fruit and Vegetables', fontsize=15)
fooddf.groupby("type").apply(scatter)
ax.set(xlabel = "price_per_lb", ylabel = 'price_per_cup')
ax.legend(loc = 4)
# plot a regression line
x = fooddf['price_per_lb']
y = fooddf['price_per_cup']
fit = np.polyfit(x, y, deg=1)
ax.plot(x, fit[0] * x + fit[1], color='red')
plt.show()
print "I would expect there is a very strong positive correlation between price_per_cup and price_per_lb. However, the graph contradicts my prediction by showing many points scatter away from the line."
fooddf["price_yield_ratio"] = fooddf["yield"]/fooddf["price_per_lb"]
fooddf.boxplot("price_per_lb", "type")
plt.title("Boxplot of Price_per_lb for Fruit and Vegetables")
plt.suptitle("")
fooddf.boxplot("yield", "type")
plt.title("Boxplot of Yield for Fruit and Vegetables")
plt.suptitle("")
fooddf.boxplot("price_yield_ratio", "type")
plt.title("Boxplot of Price_yield_ratio for Fruit and Vegetables")
plt.suptitle("")
plt.show()
print "I use boxplots of price_per_lb and yield to compare fruit and vegetables.\nIn the Boxplot of Price_per_lb, both food have very similar median price. However, there are some outliers in the fruit category. In addition, the distibution is skewed right, which suggests that the most price observations are concentrated on the lower price scale.\nOn the other hand, the price distribution of vegetables is symmetric, which suggests that the prices are evenly split at the median price.\nThe median of the yield for both categories is also very close. The boxplot of yield for fruit is skewed left, which suggests that many fruits have higher yield rate. The one for vegetables is symmetric and therefore the yield rate is evenly split at the median yield. Therefore, the fruit should have higher return of value because most of them cost less and have higer yield, this has also proven on the boxplot of Price_yield_ratio since the boxplot for vegetables is more skewed right than fruits."
print "My hypothesis is that watermelon and potatoes have the best combination of price, yield, and nutrition."