In this post we’re going to discuss how to scrape news articles with Python. This can be done using the handy newspaper package.
Introduction to Python’s newspaper package
The newspaper package can be installed using pip:
pip install newspaper
Once its installed, we can get started. newspaper can work by either scraping a single article from a given URL, or by finding the links on a webpage to other news articles. Let’s start with handling a single article. First, we need to import the Article class. Next, we use this class to download the content from the URL to our news article. Then, we use the parse method to parse the HTML. Lastly, we can print out the text of the article using .text.
Scraping a single article
from newspaper import Article
url = “https://www.bloomberg.com/news/articles/2020-08-01/apple-buys-startup-to-turn-iphones-into-payment-terminals?srnd=premium”
# download and parse article
article = Article(url)
article.download()
article.parse()
# print article text
print(article.text)
It’s also possible to get other information about the article, such as links to images or videos embedded in the post.
# get list of image links
article.images
# get list of videos – empty in this case
article.movies
Downloading all the articles linked on a webpage
Now, let’s look at how we can all the news articles linked on a webpage. We’ll do that using the newspaper.build method, like below. Then, we can extract the article URLs using the article_urls method.
import newspaper
site = newspaper.build(“https://news.ycombinator.com/”)
# get list of article URLs
site.article_urls()
Using our object above, we can also get the contents of each of those articles. Here, all of the article objects are stored in the list, site.articles. For example, let’s get the first article’s contents.
site_article = site.articles[0]
site_article.download()
site_article.parse()
print(site_article.text)
Now, let’s modify our code to get the top ten articles:
top_articles = []
for index in range(10):
article = site.articles[index]
article.download()
article.parse()
top_articles.append(article)
Now, we can look at the text of any of these articles.
print(site[0].text)
print(site[3].text)
Warning!
One important note when using newspaper is that if you run newspaper.build multiple times with the same URL, the package will cache and then remove the articles already scraped. For example, in the below code, we run newspaper.build two consecutive times and get different results. The second time we run it, the code just returns the newly added links.
site = newspaper.build(“https://news.ycombinator.com/”)
print(len(site.articles))
site = newspaper.build(“https://news.ycombinator.com/”)
print(len(site.articles))
This can be adjusted by adding a extra parameter to our function call, like below:
site = newspaper.build(“https://news.ycombinator.com/”, memoize_articles=False)
How to get article summaries
The newspaper package also supports some NLP functionality. You can check this out by calling the nlp method.
article = top_articles[3]
article.nlp()
Now, let’s use the summary method. This will attempt to return a summary of the article.
article.summary()
You can also get a list of keywords from the article.
article.keywords
How to get top trending Google keywords
newspaper has a couple of other cool features. For example, we can use it to easily pull the top trending searches on Google using the hot method.
newspaper.hot()
The package can also return a list of popular URLs, like below.
newspaper.popular_urls()
Conclusion
That’s all for now. In this post, we learned how to scrape news articles with Python. If you want to learn more about web scraping, check out my extensive web scraping fundamentals course I co-created with 365 Data Science, now available on Udemy. Also, make sure to check out their full program of courses (which includes mine) available by clicking here.
Visit TheAutomatic.net Blog to download additional code: http://theautomatic.net/2020/08/05/how-to-scrape-news-articles-with-python/.
Disclosure: Interactive Brokers
Information posted on IBKR Campus that is provided by third-parties does NOT constitute a recommendation that you should contract for the services of that third party. Third-party participants who contribute to IBKR Campus are independent of Interactive Brokers and Interactive Brokers does not make any representations or warranties concerning the services offered, their past or future performance, or the accuracy of the information provided by the third party. Past performance is no guarantee of future results.
This material is from TheAutomatic.net and is being posted with its permission. The views expressed in this material are solely those of the author and/or TheAutomatic.net and Interactive Brokers is not endorsing or recommending any investment or trading discussed in the material. This material is not and should not be construed as an offer to buy or sell any security. It should not be construed as research or investment advice or a recommendation to buy, sell or hold any security or commodity. This material does not and is not intended to take into account the particular financial conditions, investment objectives or requirements of individual customers. Before acting on this material, you should consider whether it is suitable for your particular circumstances and, as necessary, seek professional advice.