7 min. read

November 01, 2022

Scraping Tech Blogs with Python

Web scraping is a great way to go through millions of technical blogs and create a database full of technical content. Scroll down for more and more...

Lakindu Hewawasam, EDITOR OF ENLEAR ACADEMY

An ideal technical blog generally consists of desirable content users want to read. But finding a good topic is never easy. It requires a lot of searching and reading across many resources. Even if you come up with great ideas, there's no way of determining how your readers will react, or if your concept suits your target audience.

However, finding a topic for an article doesn't always have to be this difficult. You can use techniques like scraping to go through millions of technical blogs (in a few minutes) and create a database full of technical content. Later, you can use this data for predictive topic generation, or even compile analytical reports on what sort of content performs well over specific periods.

The use of the scraped blog data is limitless. Blog scrapers help plan content effectively, quickly, and efficiently, allowing you to design intuitive, exciting, and appealing topics in seconds. So, in this article, I will show you how to build your own blog scraper in 5 steps.

Getting Started With Scraping

Building a blog scraper is not technically challenging. However, it's recommended to use Python as it offers third-party libraries that help parse DOM elements and create spreadsheets to store data. Therefore, this article will focus on building a simple technical blog scraper using Python.

Step 01 - Creating a Virtual Environment

Since Python applications utilize third-party dependencies for scraping, a virtual environment must be used. Therefore, create a virtual environment by executing the below command.

python3 -m venv venv

After executing the command, a new directory titled venv will get created in the project directory. Hereafter, activate the virtual environment using the command shown below.

source venv/bin/activate

After executing the command, the environment's name will be in your terminal. This indicates that your virtual environment has been activated successfully.

pic1

Figure: Activating the virtual environment

Step 02 - Installing the Required Libraries

After creating and activating your virtual environment, you need to install two third-party libraries. These libraries will help scrape data on web pages.

  1. requests

    : The requests library will be used to perform HTTP requests.

  2. beautifulsoup4

    : The Beautiful Soup library will be used to scrape information from web pages.

To install the two libraries, run the two commands displayed below.

python -m pip install requests python -m pip install beautifulsoup4

After installing, you will see the output shown below.

pic2

Figure: Installing the required libraries

Step 03 - Analyzing the Blog to Scrape

You can now start on your scraping script. For demonstration purposes, this guide will show you how to implement a script that will scrape a Medium Publication.

First, it's essential to establish a reusable URL that can be used to scrap any publication in Medium. Thankfully, Medium has a URL that specific technical blogs can use to archive content. It fetches a list of all the articles a publication has published since it was created. The generic URL for archived content is shown below.

https://medium.com/{{publicationHandle}}/archive/{{yyyy}}/{{mm}}/{{dd}}

For example, you can compile a list of archived content of all content published on Enlear Academy using the URL - https://medium.com/enlear-academy/archive. It will display the output shown below.

pic3

Figure - Viewing the archived content of Enlear Academy

This blog scraper will use the generic archive URL and fetch a list of all technical content published in the blog, then it will collect attributes such as:

  1. Article Name

  2. Article Subtitle

  3. Claps

  4. Reading Time

You can extract the above information by inspecting the HTML for the Medium Content Card, as shown below.

pic4

Figure: Identifying CSS classes and HTML Elements to target

All Medium content cards are wrapped with a div containing CSS classes - streamItem streamItem--postPreview js-streamItem. Therefore, you can use Beautiful Soup to get a list of all div elements having the specified classes and extract the list of articles on the archive page.

Step 04 - Implementing the Scraper

Create a file titled scraper.py where the code for the scraper will be included. Initially, you must add the two imports as shown below.

from bs4 import BeautifulSoup # import BeautifulSoup import requests # import requests import json # import json for data storing

The requests library will be used to perform a GET request to the archives of the Medium publication.

# create request to archive pageblog_archive_url = 'https://enlear.academy/archive/2022/01'response = requests.get(blog_archive_url)

Then, the text returned by the response must be parsed into HTML using the HTML Parser of Beautiful Soup, as shown below:

# parse the response using HTML parser on BeautifulSoupparsedHtml = BeautifulSoup(response.text, 'html.parser')

Then, the stories can be queried by performing a DOM operation to fetch all the div elements that have the class list streamItem streamItem--postPreview js-streamItem.

# get list of all divs having the classes "streamItem streamItem--postPreview js-streamItem"to get each story. stories = parsedHtml.find_all('div', class_='streamItemstreamItem--postPreviewjs-streamItem')

Afterwards, we can iterate over each story in the stories array and obtain critical meta information such as article title and subtitle, number of claps, reading time, and URL.

formatted_stories = [] for story in stories: # Get the title of the story story_title = story.find('h3').text if story.find('h3') else'N/A'# get the subtitle of the story story_subtitle = story.find('h4').text if story.find('h4') else'N/A' # Get the number of claps clap_button = story.find('button', class_='button button--chromeless u-baseColor--buttonNormal js-multirecommendCountButton u-disablePointerEvents') claps = 0if (clap_button): # If clap button has a DOM reference, obtain its text claps = clap_button.text # Gget reference to the card header containing author info author_header = story.find('div', class_='postMetaInline u-floatLeft u-sm-maxWidthFullWidth') # Access the reading time span element andget its title attribute reading_time = author_header.find('span', class_='readingTime')['title'] # Get read more ref read_more_ref = story.find('a', class_='button button--smaller button--chromeless u-baseColor--buttonNormal') url = read_more_ref['href'] if read_more_ref else'N/A' # Add an object to formatted_stories formatted_stories.append({ 'title': story_title, 'subtitle': story_subtitle, 'claps': claps, 'reading_time': reading_time, 'url': url })

The above scraping script iterates over each story and performs 5 tasks:

  1. It obtains the article title by using the H3 element in the card.

  2. It obtains the article subtitle by using the

    H4 element in the card.

  3. It obtains the number of claps by using the clap button on the card. The script executes a query to find a Button with the class list - button button--chromeless u-baseColor--buttonNormal js-multirecommendCountButton u-disablePointerEvents and then uses its text attribute to get a count of total claps for the article.

  4. It obtains the reading time by accessing the card header. The card header can be accessed by performing a query to find the div element having the CSS class list - postMetaInline u-floatLeft u-sm-maxWidthFullWidth. Then, a subsequent query is done to find a

    span element with the class readingTime to obtain the reading time for the article.

  5. Finally, the script obtains the article URL by accessing the Read More section located on each card. The script uses the href attribute on the result and searches for the button elements with the class list: button button--smaller button--chromeless u-baseColor--buttonNormal.

After DOM Queries have obtained all the elements, the data is structured into a JSON object. Then, it is pushed into the array named formatted_stories.

Finally, the array is written into a JSON file using the file module of Python, as shown below.

file = open('stories.json', 'w') file.write(json.dumps(formatted_stories))

Step 05 - Viewing the Script in Action

After executing the script written in step 4, the following output is generated.

pic5

Figure: Viewing the scraped blog data

That's it. You have successfully implemented a blog scraper to scrape technical blogs on Medium using a simple Python script. In addition, you can make improvements to the code and extract more data using the DOM elements and CSS classes.

Finally, you can push these data into a Data Lake on AWS and create analytical dashboards to help identify trending, or most negligible preferred content (based on clap count) to help plan content for your next article.

Drawbacks of Using Blog Scrapers

Although blog scrapers help gather a vast amount of data for topic planning, there are two main drawbacks since we heavily depend on the user interface.

  1. You cannot get more data

    You only have access to the data available in the user interface (what you see is what you get).

  2. Changes to the UI

    If the site you are scraping makes significant UI changes, such as changing the CSS classes or the HTML elements, the code you implement for scraping will probably break as the elements cannot be identified anymore.

Therefore, it's essential to be mindful of these drawbacks when implementing a blog scraper.

Conclusion

Blog scrapers significantly improve content creation and drive the content planning industry. It helps content managers plan content across several iterations with minimal effort and time. The code implemented in this article is accessible in my GitHub repository.

I hope that you have found this article helpful. Thank you for reading.

Looking to find new roles that match your ambitions? Honeypot is Europe's job platform for developers and data specialists. Sign up today and get offers with salary and tech stack up front. (P.S. Honeypot is always free for developers.)