Assignment 5

In this assignment, you'll scrape text from The California Aggie and then analyze the text.

The Aggie is organized by category into article lists. For example, there's a Campus News list, Arts & Culture list, and Sports list. Notice that each list has multiple pages, with a maximum of 15 articles per page.

The goal of exercises 1.1 - 1.3 is to scrape articles from the Aggie for analysis in exercise 1.4.

Exercise 1.1. Write a function that extracts all of the links to articles in an Aggie article list. The function should:

  • Have a parameter url for the URL of the article list.

  • Have a parameter page for the number of pages to fetch links from. The default should be 1.

  • Return a list of aricle URLs (each URL should be a string).

Test your function on 2-3 different categories to make sure it works.

Hints:

  • Be polite to The Aggie and save time by setting up requests_cache before you write your function.

  • Start by getting your function to work for just 1 page. Once that works, have your function call itself to get additional pages.

  • You can use lxml.html or BeautifulSoup to scrape HTML. Choose one and use it throughout the entire assignment.

In [1]:
import requests
import requests_cache

requests_cache.install_cache('demo_cache')

from bs4 import BeautifulSoup

import pandas as pd
In [2]:
def extract_link(url, num_page):
    """
    Extracts all of the links to articles in an Aggie article list.
    
    Argument: url, num_page
    
    Return url_page link
    """ 
    
    url_page = url + 'page/' + str(num_page) + '/'
    news = requests.get(url_page)
    news_soup = BeautifulSoup(news.content, "lxml")
    links = news_soup.findAll("a", {"class": "more-link"})
    url_list = [link.get('href') for link in links]
    return url_list
In [3]:
url = "https://theaggie.org/campus/"
num_page = 6
extract_link(url, num_page)
Out[3]:
['https://theaggie.org/2016/11/22/the-life-of-former-chancellor-linda-p-b-katehi-post-resignation/',
 'https://theaggie.org/2016/11/21/uc-davis-releases-2015-2016-annual-campus-travel-survey-results/',
 'https://theaggie.org/2016/11/21/plant-and-animal-sciences-at-uc-davis-rank-number-one-in-the-world/',
 'https://theaggie.org/2016/11/20/achieve-uc-program-encourages-students-to-apply-to-ucs/',
 'https://theaggie.org/2016/11/20/uc-transfer-application-deadline-extended/',
 'https://theaggie.org/2016/11/18/anti-diversity-posters-discovered-on-campus/',
 'https://theaggie.org/2016/11/17/s-p-e-a-k-community-bands-together-on-quad-to-protest-trump-presidency/',
 'https://theaggie.org/2016/11/17/university-of-california-among-largest-source-of-donations-to-clinton/',
 'https://theaggie.org/2016/11/17/this-week-in-senate-31/',
 'https://theaggie.org/2016/11/16/matthew-mcfadden-confirmed-as-new-interim-senator/',
 'https://theaggie.org/2016/11/15/library-hosts-workshops-to-shape-future-space/',
 'https://theaggie.org/2016/11/15/interim-provost-ken-burtis-announces-upcoming-renovations-for-chemistry-complex/',
 'https://theaggie.org/2016/11/15/asucd-senate-calls-for-more-transparency-in-chancellor-search/',
 'https://theaggie.org/2016/11/15/seven-out-of-10-uc-employees-face-food-insecurity-low-wages/',
 'https://theaggie.org/2016/11/14/uc-davis-faces-inspections-negligence-allegations-by-usda/']

Exercise 1.2. Write a function that extracts the title, text, and author of an Aggie article. The function should:

  • Have a parameter url for the URL of the article.

  • For the author, extract the "Written By" line that appears at the end of most articles. You don't have to extract the author's name from this line.

  • Return a dictionary with keys "url", "title", "text", and "author". The values for these should be the article url, title, text, and author, respectively.

For example, for this article your function should return something similar to this:

{
    'author': u'Written By: Bianca Antunez \xa0\u2014\xa0city@theaggie.org',
    'text': u'Davis residents create financial model to make city\'s financial state more transparent To increase transparency between the city\'s financial situation and the community, three residents created a model called Project Toto which aims to improve how the city communicates its finances in an easily accessible design. Jeff Miller and Matt Williams, who are members of Davis\' Finance and Budget Commission, joined together with Davis entrepreneur Bob Fung to create the model plan to bring the project to the Finance and Budget Commission in February, according to Kelly Stachowicz, assistant city manager. "City staff appreciate the efforts that have gone into this, and the interest in trying to look at the city\'s potential financial position over the long term," Stachowicz said in an email interview. "We all have a shared goal to plan for a sound fiscal future with few surprises. We believe the Project Toto effort will mesh well with our other efforts as we build the budget for the next fiscal year and beyond." Project Toto complements the city\'s effort to amplify the transparency of city decisions to community members. The aim is to increase the understanding about the city\'s financial situation and make the information more accessible and easier to understand. The project is mostly a tool for public education, but can also make predictions about potential decisions regarding the city\'s financial future. Once completed, the program will allow residents to manipulate variables to see their eventual consequences, such as tax increases or extensions and proposed developments "This really isn\'t a budget, it is a forecast to see the intervention of these decisions," Williams said in an interview with The Davis Enterprise. "What happens if we extend the sales tax? What does it do given the other numbers that are in?" Project Toto enables users, whether it be a curious Davis resident, a concerned community member or a city leader, with the ability to project city finances with differing variables. The online program consists of the 400-page city budget for the 2016-2017 fiscal year, the previous budget, staff reports and consultant analyses. All of the documents are cited and accessible to the public within Project Toto. "It\'s a model that very easily lends itself to visual representation," Mayor Robb Davis said. "You can see the impacts of decisions the council makes on the fiscal health of the city." Complementary to this program, there is also a more advanced version of the model with more in-depth analyses of the city\'s finances. However, for an easy-to-understand, simplistic overview, Project Toto should be enough to help residents comprehend Davis finances. There is still more to do on the project, but its creators are hard at work trying to finalize it before the 2017-2018 fiscal year budget. "It\'s something I have been very much supportive of," Davis said. "Transparency is not just something that I have been supportive of but something we have stated as a city council objective [ ] this fits very well with our attempt to inform the public of our challenges with our fiscal situation." ',
    'title': 'Project Toto aims to address questions regarding city finances',
    'url': 'https://theaggie.org/2017/02/14/project-toto-aims-to-address-questions-regarding-city-finances/'
}

Hints:

  • The author line is always the last line of the last paragraph.

  • Python 2 displays some Unicode characters as \uXXXX. For instance, \u201c is a left-facing quotation mark. You can convert most of these to ASCII characters with the method call (on a string)

    .translate({ 0x2018:0x27, 0x2019:0x27, 0x201C:0x22, 0x201D:0x22, 0x2026:0x20 })

    If you're curious about these characters, you can look them up on this page, or read more about what Unicode is.

In [4]:
def extract(url):
    """
    extract information from the url
    
    Argument: url
    
    Return: a dictionary with necessary information
    """
    article = requests.get(url)
    news_soup = BeautifulSoup(article.content)
    contentnews = news_soup.find("div", attrs = {"itemprop": "articleBody"})
    content = contentnews.find_all("p")
    newscontent = "\n".join([con.text for con in content])
    news_split = newscontent.split("\n")
    author = news_split[-1]
    news = news_split[0:-1]
    news = "".join(news)
    news = news.translate({ 0x2018:0x27, 0x2019:0x27, 0x201C:0x22, 0x201D:0x22, 0x2026:0x20 })
    tit = news_soup.findAll("h1", {"class": "entry-title"})
    title = "".join([t.text for t in tit])
    article_dict = {"author": author, "text": news, "title": title, "url": url}
    return article_dict
In [5]:
url = "https://theaggie.org/2017/02/14/project-toto-aims-to-address-questions-regarding-city-finances/"
extract(url)
/Users/Chloechen/anaconda/lib/python2.7/site-packages/bs4/__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 174 of the file /Users/Chloechen/anaconda/lib/python2.7/runpy.py. To get rid of this warning, change code that looks like this:

 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))
Out[5]:
{'author': u'Written By: Bianca Antunez \xa0\u2014\xa0city@theaggie.org',
 'text': u'Davis residents create financial model to make city\'s financial state more transparentTo increase transparency between the city\'s financial situation and the community, three residents created a model called Project Toto which aims to improve how the city communicates its finances in an easily accessible design. Jeff Miller and Matt Williams, who are members of Davis\' Finance and Budget Commission, joined together with Davis entrepreneur Bob Fung to create the model plan to bring the project to the Finance and Budget Commission in February, according to Kelly Stachowicz, assistant city manager. "City staff appreciate the efforts that have gone into this, and the interest in trying to look at the city\'s potential financial position over the long term," Stachowicz said in an email interview. "We all have a shared goal to plan for a sound fiscal future with few surprises. We believe the Project Toto effort will mesh well with our other efforts as we build the budget for the next fiscal year and beyond."Project Toto complements the city\'s effort to amplify the transparency of city decisions to community members. The aim is to increase the understanding about the city\'s financial situation and make the information more accessible and easier to understand.The project is mostly a tool for public education, but can also make predictions about potential decisions regarding the city\'s financial future. Once completed, the program will allow residents to manipulate variables to see their eventual consequences, such as tax increases or extensions and proposed developments"This really isn\'t a budget, it is a forecast to see the intervention of these decisions," Williams said in an interview with The Davis Enterprise. "What happens if we extend the sales tax? What does it do given the other numbers that are in?"Project Toto enables users, whether it be a curious Davis resident, a concerned community member or a city leader, with the ability to project city finances with differing variables. The online program consists of the 400-page city budget for the 2016-2017 fiscal year, the previous budget, staff reports and consultant analyses. All of the documents are cited and accessible to the public within Project Toto."It\'s a model that very easily lends itself to visual representation," Mayor Robb Davis said. "You can see the impacts of decisions the council makes on the fiscal health of the city."Complementary to this program, there is also a more advanced version of the model with more in-depth analyses of the city\'s finances. However, for an easy-to-understand, simplistic overview, Project Toto should be enough to help residents comprehend Davis finances. There is still more to do on the project, but its creators are hard at work trying to finalize it before the 2017-2018 fiscal year budget. "It\'s something I have been very much supportive of," Davis said. "Transparency is not just something that I have been supportive of but something we have stated as a city council objective [ ] this fits very well with our attempt to inform the public of our challenges with our fiscal situation."',
 'title': u'Project Toto aims to address questions regarding city finances',
 'url': 'https://theaggie.org/2017/02/14/project-toto-aims-to-address-questions-regarding-city-finances/'}

Exercise 1.3. Use your functions from exercises 1.1 and 1.2 to get a data frame of 60 Campus News articles and a data frame of 60 City News articles. Add a column to each that indicates the category, then combine them into one big data frame.

The "text" column of this data frame will be your corpus for natural language processing in exercise 1.4.

In [6]:
# get the first 4 pages to get the 60 articles
page = [i for i in range(1, 5, 1)]
In [7]:
# extract all 60 articles from the campus news page
article_campus = [extract_link("https://theaggie.org/campus/", num_page) for num_page in page]
url_output = [[extract(url) for url in link] for link in article_campus]
pd_campus = [pd.DataFrame(output) for output in url_output]
df_campus = pd.concat(pd_campus)
df_campus["category"] = "campus news"
df_campus.index = range(60)
In [8]:
# extract all 60 articles from the city news page
article_campus = [extract_link("https://theaggie.org/city/", num_page) for num_page in page]
url_output = [[extract(url) for url in link] for link in article_campus]
pd_city = [pd.DataFrame(output) for output in url_output]
df_city = pd.concat(pd_city)
df_city["category"] = "city news"
df_city.index = range(60)

# combine two dataframes
df = pd.concat([df_campus, df_city], ignore_index = True)
In [9]:
df.head()
Out[9]:
author text title url category
0 Written by: Kaitlyn Cheung — campus@theaggie.org Student protesters march from MU flagpole to M... UC Davis students participate in UC-wide #NoDA... https://theaggie.org/2017/02/17/uc-davis-stude... campus news
1 Written by: Jayashri Padmanabhan — campus@thea... Conference entails full day of speakers, panel... UC Davis holds first mental health conference https://theaggie.org/2017/02/17/uc-davis-holds... campus news
2 Written by: Demi Caceres — campus@theaggie.org Last week in SenateThe ASUCD Senate meeting wa... Last week in Senate https://theaggie.org/2017/02/16/last-week-in-s... campus news
3 Written by: Alyssa Vandenberg and Emilie DeFaz... Executive: Josh Dalavai and Adilla JamaludinIn... 2017 ASUCD Winter Elections — Meet the Candidates https://theaggie.org/2017/02/16/2017-asucd-win... campus news
4 Written by: Ivan Valenzuela — campus@theaggie.org New showcase provides opportunity for students... Shields Library hosts new exhibit for Davis ce... https://theaggie.org/2017/02/14/shields-librar... campus news
In [10]:
df.tail()
Out[10]:
author text title url category
115 Written By: Bianca Antunez — city@theaggie.org Election results are in; Davis community conce... Nov. 8 2016: An Election Day many may never fo... https://theaggie.org/2016/11/17/nov-8-2016-an-... city news
116 “Male on a motorized bike pulling a trailer, e... More turkeys, more tomfoolery, more accidental... Police Logs https://theaggie.org/2016/11/15/police-logs-3/ city news
117 Written By: Bianca Antunez – city@theaggie.org Participants line up for Thanksgiving 5k befor... Yolo Food Bank’s eighth Annual Running of the ... https://theaggie.org/2016/11/15/yolo-food-bank... city news
118 Written by: Raul Castellanos — city@theaggie.org Bernie Sanders visits Sacramento to rally for ... Return of the Bern https://theaggie.org/2016/11/15/return-of-the-... city news
119 Written by: Alana Joldersma –– city@theaggie.org Indoor facility will provide a cafeteria, clas... Construction of the All Student Center at Davi... https://theaggie.org/2016/11/14/construction-o... city news

Exercise 1.4. Use the Aggie corpus to answer the following questions. Use plots to support your analysis.

  • What topics does the Aggie cover the most? Do city articles typically cover different topics than campus articles?

  • What are the titles of the top 3 pairs of most similar articles? Examine each pair of articles. What words do they have in common?

  • Do you think this corpus is representative of the Aggie? Why or why not? What kinds of inference can this corpus support? Explain your reasoning.

Hints:

  • The nltk book and scikit-learn documentation may be helpful here.

  • You can determine whether city articles are "near" campus articles from the similarity matrix or with k-nearest neighbors.

  • If you want, you can use the wordcloud package to plot a word cloud. To install the package, run

    conda install -c https://conda.anaconda.org/amueller wordcloud

    in a terminal. Word clouds look nice and are easy to read, but are less precise than bar plots.

In [11]:
import nltk
import string
import random
import itertools
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.probability import FreqDist
%matplotlib inline
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import numpy as np
In [12]:
# Use sources from the Youtube Channel

def nlp_test(text):
    """
    use natural language processing to remove and lemmatize text
    
    Argument: text
    
    Return: stemed text
    """
    
    stop_words = set(stopwords.words("english"))
    common = ["said", "also", "one", "would", "like", "new", "many", "make", "really", "think"]
    common_words = [x.decode('UTF8') for x in common]
    words = nltk.word_tokenize(text)
    # remove stop words and common words
    words = [wor.lower() for wor in words]
    filtered = [wor for wor in words if not wor in stop_words]
    filtered = [wor for wor in filtered if not wor in common_words]
    # word lemmatization for English words only
    lmtzr = WordNetLemmatizer()
    lem = [lmtzr.lemmatize(fil) for fil in filtered if fil.isalpha()]
    return lem
In [13]:
def graph_wordCloud(text):
    wordcloud = WordCloud(background_color='white', width=1200, height=1000, max_words=40).generate(text)
    return plt.imshow(wordcloud)
In [14]:
def topic_df(dataframe):
    # apply natural language processing to each row of the article dataframe
    manipulate_text = dataframe.apply(lambda row: nlp_test(row["text"]), axis=1)
    all_words = [w for w in manipulate_text]
    merged_all_words = list(itertools.chain.from_iterable(all_words))
    # Text Classification
    fd = nltk.FreqDist(merged_all_words)
    output = fd.most_common(25)
    pd_article = pd.DataFrame(output)
    pd_article.rename(columns = {list(pd_article)[0]: 'Topics', list(pd_article)[1]: 'Counts'}, inplace = True)
    return pd_article
In [15]:
def topic_graph(dataframe):
    # apply natural language processing to each row of the article dataframe
    manipulate_text = dataframe.apply(lambda row: nlp_test(row["text"]), axis=1)
    all_words = [w for w in manipulate_text]
    merged_all_words = list(itertools.chain.from_iterable(all_words))
    # Text Classification
    fd = nltk.FreqDist(merged_all_words)
    # Graph the text distribution
    fd.plot(40,cumulative=False)
    text_final = ' '.join(merged_all_words)
    graph_wordCloud(text_final)
In [16]:
def bargraph(df_topic):
    ax = df_topic.plot.bar(x='Topics', rot=0, title='Most Popular Topics', figsize=(15,10), fontsize=9)
    ax.set_ylabel("Counts")
    ax.set_xlabel("Topics")
    patches, labels = ax.get_legend_handles_labels()
    ax.legend(patches, labels, loc='best')

    for p in ax.patches:
        ax.annotate(str(p.get_height()), (p.get_x() * 1.005, p.get_height() * 1.005))

Below are the topics that the Aggie covers the most for the dataframe that contains city and campus articles

Here are the top 25 popular topics:

In [17]:
topic_graph(df)
df_topic = topic_df(df)
bargraph(df_topic)

Below are the topics that the Aggie covers the most for the dataframe that contains campus articles

In [18]:
topic_graph(df_campus)
campus_topic = topic_df(df_campus)
bargraph(campus_topic)

Below are the topics that the Aggie covers the most for the dataframe that contains city articles

In [19]:
topic_graph(df_city)
city_topic = topic_df(df_city)
bargraph(city_topic)

As we can see from the analysis above, the topics that Aggie covers the most are davis, student, uc, campus and people. If we separate the news based on campus and city, the campus news cover most about student, uc, davis, campus and university. The city news cover most about davis, city, community, people and year. Therefore, city articles typically cover different topics than campus articles.

Top 3 pairs of most similar articles Analysis

The rows and columns in the similarity matrix corresponds to the articles. Therefore, the top three entries that have the largest values/similarity scores correspond to the top 3 pairs of most similar articles.

In [20]:
vectorizer = TfidfVectorizer(tokenizer=nlp_test, stop_words='english',smooth_idf = True, norm = None)
tfs = vectorizer.fit_transform(df['text'])
sim_matrix = tfs.dot(tfs.T)
In [21]:
# convert a sparse matrix to a (2-d) NumPy matrix
array_text = sim_matrix.toarray()
array_text
Out[21]:
array([[  5941.05984939,    826.07177057,    233.42792606, ...,
           159.8441979 ,    334.55155792,    470.35232621],
       [   826.07177057,  26344.41501073,    305.69242507, ...,
           468.31942796,    395.87549826,   1351.39120406],
       [   233.42792606,    305.69242507,   5198.03530186, ...,
            41.25825987,    133.02183059,     85.59322572],
       ..., 
       [   159.8441979 ,    468.31942796,     41.25825987, ...,
          9228.89652249,    152.79087355,    226.5930069 ],
       [   334.55155792,    395.87549826,    133.02183059, ...,
           152.79087355,   6391.19270668,    164.44438809],
       [   470.35232621,   1351.39120406,     85.59322572, ...,
           226.5930069 ,    164.44438809,   4389.79752964]])
In [22]:
# convert the array into a list
array_list = array_text.tolist()
a_list = list(itertools.chain.from_iterable(array_list))
In [23]:
# return 3 maximum numbers in the list
top = sorted(a_list, reverse = True)[:3]
In [24]:
def return_title(df):
    """
    print out the top three similar articles for each dataframe
    
    Argument: dataframe
    
    Return: print out the title name
    """
    for num in top:
        var = np.nonzero(sim_matrix == num)
        print df["title"][var[1][0]]

Top 3 pairs of most similar articles

In [25]:
return_title(df_campus)
2017 ASUCD Winter Elections — Meet the Candidates
Russell Boulevard intramural fields withdrawn from 2017-2027 Long Range Development Plan
UC Davis study abroad secures $22,000 in grants from French government
In [26]:
return_title(df_city)
Davis stands with Muslim residents
Police Logs
Yolo County Library materials to be more widely available
In [27]:
import pandas as pd

def combine_text(df_campus, df_city, num):
    """
    combine the content from the campus news and the city news
    
    Argument: df_campus, df_city, num
    
    Return: the combined text
    """
    var = np.nonzero(sim_matrix == top[num])
    campus_text = df_campus["text"][var[1][0]]
    city_text = df_city["text"][var[1][0]]
    total_text = campus_text + city_text
    return total_text
In [28]:
text_output = [combine_text(df_campus, df_city, num) for num in range(0, len(top))]

Below are the graphs that illustrate the common words that are shared in the top 3 pairs of most similar articles Analysis

The bigger the fonts are, the more they appear in both articles

In [32]:
graph_wordCloud(text_output[0]) 
Out[32]:
<matplotlib.image.AxesImage at 0x119060750>

Common words are Student, Campus, Davis, year, muslim, community

In [34]:
graph_wordCloud(text_output[1]) 
Out[34]:
<matplotlib.image.AxesImage at 0x127320c50>

Common words are Davis, housing, student, field

In [35]:
graph_wordCloud(text_output[2]) 
Out[35]:
<matplotlib.image.AxesImage at 0x1273da0d0>

Common words are student, UC, Abroad, Davis

Analysis

Yes. I think this corpus is a good representative of the Aggie because the Aggie is the newspaper that serves the UC Davis campus and community. Therefore, the topics that the authors cover should be mostly related to the Davis campus and community, which is consistent with my analysis above. We can infer that the content that the Aggie covers is mostly related to the current news. However, the focus on the campus news is different from the city news.