An ideal technical blog generally consists of desirable content users want to read. But finding a good topic is never easy. It requires a lot of searching and reading across many resources. Even if you come up with great ideas, there's no way of determining how your readers will react, or if your concept suits your target audience.
However, finding a topic for an article doesn't always have to be this difficult. You can use techniques like scraping to go through millions of technical blogs (in a few minutes) and create a database full of technical content. Later, you can use this data for predictive topic generation, or even compile analytical reports on what sort of content performs well over specific periods.
The use of the scraped blog data is limitless. Blog scrapers help plan content effectively, quickly, and efficiently, allowing you to design intuitive, exciting, and appealing topics in seconds. So, in this article, I will show you how to build your own blog scraper in 5 steps.
Getting Started With Scraping
Building a blog scraper is not technically challenging. However, it's recommended to use Python as it offers third-party libraries that help parse DOM elements and create spreadsheets to store data. Therefore, this article will focus on building a simple technical blog scraper using Python.
Step 01 - Creating a Virtual Environment
Since Python applications utilize third-party dependencies for scraping, a virtual environment must be used. Therefore, create a virtual environment by executing the below command.
python3 -m venv venv
After executing the command, a new directory titled venv
will get created in the project directory. Hereafter, activate the virtual environment using the command shown below.
source venv/bin/activate
After executing the command, the environment's name will be in your terminal. This indicates that your virtual environment has been activated successfully.
Figure: Activating the virtual environment
Step 02 - Installing the Required Libraries
After creating and activating your virtual environment, you need to install two third-party libraries. These libraries will help scrape data on web pages.
requests
: The requests library will be used to perform HTTP requests.
beautifulsoup4
: The Beautiful Soup library will be used to scrape information from web pages.
To install the two libraries, run the two commands displayed below.
python -m pip install requests
python -m pip install beautifulsoup4
After installing, you will see the output shown below.
Figure: Installing the required libraries
Step 03 - Analyzing the Blog to Scrape
You can now start on your scraping script. For demonstration purposes, this guide will show you how to implement a script that will scrape a Medium Publication.
First, it's essential to establish a reusable URL that can be used to scrap any publication in Medium. Thankfully, Medium has a URL that specific technical blogs can use to archive content. It fetches a list of all the articles a publication has published since it was created. The generic URL for archived content is shown below.
https://medium.com/{{publicationHandle}}/archive/{{yyyy}}/{{mm}}/{{dd}}
For example, you can compile a list of archived content of all content published on Enlear Academy using the URL - https://medium.com/enlear-academy/archive
. It will display the output shown below.
Figure - Viewing the archived content of Enlear Academy
This blog scraper will use the generic archive URL and fetch a list of all technical content published in the blog, then it will collect attributes such as:
Article Name
Article Subtitle
Claps
Reading Time
You can extract the above information by inspecting the HTML for the Medium Content Card, as shown below.
Figure: Identifying CSS classes and HTML Elements to target
All Medium content cards are wrapped with a div containing CSS classes - streamItem streamItem--postPreview js-streamItem
. Therefore, you can use Beautiful Soup to get a list of all div
elements having the specified classes and extract the list of articles on the archive page.
Step 04 - Implementing the Scraper
Create a file titled scraper.py
where the code for the scraper will be included. Initially, you must add the two imports as shown below.
from bs4 import BeautifulSoup # import BeautifulSoup
import requests # import requests
import json # import json for data storing
The requests
library will be used to perform a GET
request to the archives of the Medium publication.
# create request to archive pageblog_archive_url = 'https://enlear.academy/archive/2022/01'response = requests.get(blog_archive_url)
Then, the text returned by the response must be parsed into HTML using the HTML Parser of Beautiful Soup, as shown below:
# parse the response using HTML parser on BeautifulSoupparsedHtml = BeautifulSoup(response.text, 'html.parser')
Then, the stories can be queried by performing a DOM operation to fetch all the div
elements that have the class list streamItem streamItem--postPreview js-streamItem
.
# get list of all divs having the classes "streamItem streamItem--postPreview js-streamItem"to get each story.
stories = parsedHtml.find_all('div', class_='streamItemstreamItem--postPreviewjs-streamItem')
Afterwards, we can iterate over each story in the stories
array and obtain critical meta information such as article title and subtitle, number of claps, reading time, and URL.
formatted_stories = []
for story in stories:
# Get the title of the story
story_title = story.find('h3').text if story.find('h3') else'N/A'# get the subtitle of the story
story_subtitle = story.find('h4').text if story.find('h4') else'N/A'
# Get the number of claps
clap_button = story.find('button', class_='button button--chromeless u-baseColor--buttonNormal js-multirecommendCountButton u-disablePointerEvents')
claps = 0if (clap_button):
# If clap button has a DOM reference, obtain its text
claps = clap_button.text
# Gget reference to the card header containing author info
author_header = story.find('div', class_='postMetaInline u-floatLeft u-sm-maxWidthFullWidth')
# Access the reading time span element andget its title attribute
reading_time = author_header.find('span', class_='readingTime')['title']
# Get read more ref
read_more_ref = story.find('a', class_='button button--smaller button--chromeless u-baseColor--buttonNormal')
url = read_more_ref['href'] if read_more_ref else'N/A'
# Add an object to formatted_stories
formatted_stories.append({
'title': story_title,
'subtitle': story_subtitle,
'claps': claps,
'reading_time': reading_time,
'url': url
})
The above scraping script iterates over each story and performs 5 tasks:
It obtains the article title by using the
H3
element in the card.It obtains the article subtitle by using the
H4
element in the card.It obtains the number of claps by using the clap button on the card. The script executes a query to find a
Button
with the class list -button button--chromeless u-baseColor--buttonNormal js-multirecommendCountButton u-disablePointerEvents
and then uses itstext
attribute to get a count of total claps for the article.It obtains the reading time by accessing the card header. The card header can be accessed by performing a query to find the
div
element having the CSS class list -postMetaInline u-floatLeft u-sm-maxWidthFullWidth
. Then, a subsequent query is done to find aspan
element with the classreadingTime
to obtain the reading time for the article.Finally, the script obtains the article URL by accessing the Read More section located on each card. The script uses the
href
attribute on the result and searches for thebutton
elements with the class list:button button--smaller button--chromeless u-baseColor--buttonNormal
.
After DOM Queries have obtained all the elements, the data is structured into a JSON object. Then, it is pushed into the array named formatted_stories
.
Finally, the array is written into a JSON file using the file module of Python, as shown below.
file = open('stories.json', 'w')
file.write(json.dumps(formatted_stories))
Step 05 - Viewing the Script in Action
After executing the script written in step 4, the following output is generated.
Figure: Viewing the scraped blog data
That's it. You have successfully implemented a blog scraper to scrape technical blogs on Medium using a simple Python script. In addition, you can make improvements to the code and extract more data using the DOM elements and CSS classes.
Finally, you can push these data into a Data Lake on AWS and create analytical dashboards to help identify trending, or most negligible preferred content (based on clap count) to help plan content for your next article.
Drawbacks of Using Blog Scrapers
Although blog scrapers help gather a vast amount of data for topic planning, there are two main drawbacks since we heavily depend on the user interface.
You cannot get more data
You only have access to the data available in the user interface (what you see is what you get).
Changes to the UI
If the site you are scraping makes significant UI changes, such as changing the CSS classes or the HTML elements, the code you implement for scraping will probably break as the elements cannot be identified anymore.
Therefore, it's essential to be mindful of these drawbacks when implementing a blog scraper.
Conclusion
Blog scrapers significantly improve content creation and drive the content planning industry. It helps content managers plan content across several iterations with minimal effort and time. The code implemented in this article is accessible in my GitHub repository.
I hope that you have found this article helpful. Thank you for reading.