How I scrapped 1000 multi-lingual news articles ?

How hard can it be — is should be easy with Python, right ? Actually, it was not so trivial — let’s take a deeper look.

Why I needed to scrape 1000 news ?

My goal was to create a neural-network which can correctly classify political ideology of a newspaper based on it’s news articles. As with most AI related projects, the data plays a big role into the performance on the model — I wanted to use actual news articles, with the latest news, without them being manually annotated. In this article the focus will be on the data gathering part — finding online newspapers and scraping them.

How does one find news articles ?

The task required me to not only gather news articles, but also data about their political ideology. A good to place to start looking for this information is Wikipedia (or Wikidata, which contains all the data from Wikipedia in a queryable manner). Some of the news papers have their political idealogy written down in Wikidata. We can use the Wikidata Query Service to gather all newspapers, who have a political idealogy defined.

The Wikidata Query Service provides a SPARQL endpoint, which allows us to run queries on the Wikidata knowledge base. I won’t go into too much detail on knowledge bases and SPARQL, but in a nutshell, one can view a knowledge base as a big graph which connects objects to other objects via predicates (or properties).

We can use SPARQL to create queries on the knowledge base, and get all the information Wikidata has about newspapers. We will start with a simple query, which returns all objects, which are classified as newspapers.

[w.wiki/5W2i](w.wiki/5W2i "w.wiki/5W2i")

Some notes about the syntax: strings starting with ? are variables, while wdt: or wd: are prefixes that contain the properties and values for objects. What this query does is that it selects all objects that satisfy the property instance of (P31) with the value newspaper (Q11032). As a result we get 41300 newspapers.

We can now expand this query to return all newspapers which have a link to a website. We now want all objects, which not only satisfy the instance of property with a value newspaper, but also have a property called official website with some value (saved in the variable ?link)

[w.wiki/5W2k](w.wiki/5W2k "w.wiki/5W2k")

Now from 41300 newspapers, we go down to 7132 newspapers with links.

The last thing we need is the political ideology for those websites. The political ideology can be expressed as two properties — political ideology (P1142) and political alignment (P1387). We can create two separate queries for each of the properties and then take the union of both sets of results.

[w.wiki/5W2u](w.wiki/5W2u "w.wiki/5W2u")

The result of this query yields 217 websites with the political ideology label. One issue with this data is that there are labels which have one or two websites — this is a problem, because neural networks work best when the data is balanced (there are equal amount of examples for each label). To fix this, I took the top 10 labels and removed all other websites. This does reduce the amount of websites to work with, but it is guaranteed that the each label has a good amount of newspapers to it. After this filtering, I was left with 83 websites. 17 of the websites were not reachable, so I removed them. On top of that, there were duplicate websites and after removing the duplicates, I was finally left with 44 news websites, a link to them and the political ideology label.

But really, how do you scrape the articles ?

Getting the links to the newspapers was not enough. I needed links to actual news articles themselves. Of course, the home pages of the websites do contain links to a lot of news articles, however that would require creating a separate scraper for the homepages (which will most likely be different for each news website).

Another way of approaching this problem is to use the forgotten [RSS](en.wikipedia.org/wiki/RSS "en.wikipedia.org/wiki/RSS") feed. For those of you who don’t know (or remember) RSS feeds, basically they are a simple XML file, which contain news articles (or other links) from a given website. Those feeds are updated periodically, which allows the user to have the latest news live.

This means that if a site supports RSS feed, I can parse the contents and extract the links to the news articles. Now I had to manually check if the websites had an RSS feed. Out of the 44 websites, 30 had RSS. Most of the websites had a RSS button, others were accessible directly by adding “rss” to the URL.

Actually parsing the articles

For scraping websites, the most widely used Python library is [BeautifulSoup](beautiful-soup-4.readthedocs.io/en/latest "beautiful-soup-4.readthedocs.io/en/latest"). However, getting the content of a page is not that simple. The way BeautifulSoup works, is that it can return the DOM tree of the webpage, as well as providing a simple interface to navigate the elements of the webpage. This would mean that I had to check the structure of each website, in order to know which HTML elements I would need to take the value from. To my surprise, most of the website had a similar structure — the main article content was kept inside a <article> tag, in the form of multiple <p> tags for each paragraph. In order to parse those types of webpages, all I had to do is find the <article> tag, and take the content of each <p> tag inside that <article> tag.

Out of the 30 websites, 19 had the structure of <article>/<p>. Another 8 of them had the content again laid out in <p> tags, but this time, the content was inside a <div> tag, with either a unique class of id. Parsing those pages would require a minimal change to the <article> tag parser - mainly to search for <div> tags with a given class or id.

The other 3 websites were a bit difficult — they didn’t have an easy-enough parsable page structure. It just wasn’t worth the effort to create something custom to parse 3 websites.

After the text was scraped, all that was needed was to save the content into a file. I wanted each website to be in a separate directory, with the articles being in plain-text files.

Final results

Putting all the code together is simple — given a link to a RSS feed of a website, the code parses all the links, scrapes the contents and saves them in the correct directory.

By using this code on the 23 websites, I managed to scrape 1006 articles. The code does not make any assumptions about the language of the articles, which allows it to be used on any news website, regardless of the language of the article.

Of course, as with every project, this scraper could also be improved — all “low-hanging fruits” were picked, and parsing the other suitable websites would require a bit more effort in exploring and modelling their webpages structures.

At the end of the day, the 1006 articles were successfully used in training the model, which had around 90% accuracy. The full code can be found in my Github — link [here](github.com/lyubolp/NewsArticleProfiling/blo.. "github.com/lyubolp/NewsArticleProfiling/blo..").