How to create a bot to extract content from reddit with BeautifulSoup?
What is RESTful API?
For creating a bot we would need an API. But, what is an API?
API is the acronym for Application Programming Interface, which is a software intermediary that allows two applications to talk to each other.
Here, our applications would be the user who is using the bot and reddit who is providing the content.
There are many types of APIs available regarding our need. In this scenario, we would be using RESTful API.
A RESTful web service (also called a RESTful web API) is a web service implemented using HTTP and the principles of REST. It is a collection of resources, with four defined aspects:
- the base URI for the web service, such as http://example.com/resources/
- the Internet media type of the data supported by the web service. This is often JSON or XML but can be any other valid Internet media type providing that it is a valid hypertext standard.
- the set of operations supported by the web service using HTTP methods (e.g., GET, PUT, POST, or DELETE).
- The API must be hypertext driven.
For creating out bot, we will choose Flask. Flask is a web development framework developed in Python. It is preferred for a light weight web application.
Our agenda: Extracting titles of the top 3 posts from specified subreddit
Before writing some code in Python, let’s explore the reddit page and associated URI.
After exploration, it has been observed that this link shows posts of top results of specified subreddit.
https://www.reddit.com/r/<specified-subreddit>/top
Now we have to extract the text from the website and fetch the titles from the posts. For this we will be using BeautifulSoup. It is a Python package for parsing documents. It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.
Challenges
While implementing the web scrapper with BeautifulSoup, I observed that reddit has blocked some bot requests when I was trying to fetch the page content.
The blocked requests are specified in this file.
If you try to request from python, you will get some error like — requests.exceptions.ConnectionError
So, I was thinking to switch to the reddit API — PRAW. But after analyzing the above file, I found that we can extract the webpage with ‘lxml’ format. So, I tried to implement my own API with few lines of codes and minimal dependencies.
Application Design
- URL is provided to perform a GET request to evaluate the results needed.
URL format:
http://localhost:80/top_posts?subreddit=<specified-subreddit>
- localhost:80— Domain hosted in localhost port 80 for testing in your local server with root permissions.
- top_posts — Sub-directory to differentiate APIs
- ?subreddit — Query to specify actions
2. The specified subreddit query is extracted from the URL and it is concatenated with the reddit URL for fetching top posts.
3. The request content is fetched in ‘lxml’ format.
4. For getting the titles of the post, I analyzed the HTML format of the reddit page and found that titles are part of “_eYtD2XCVieq6emjKBH3m” class. So, the first three texts associated with this class are fetched.
5. The text is returned in a JSON format with a GET request.
Like:
curl -X GET http://localhost:80/top_posts?subreddit=<specified-subreddit>
How to code this design?
First, import the required packages/libraries.
from flask import Flask,jsonify,request
import requests
from bs4 import BeautifulSoup
Initializing the flask app for development purpose in localhost.
app = Flask(__name__)
if __name__==’__main__’:
app.run(debug=True,host=’localhost’,port=’80')
Now, for routing the request with the certain method to a URL and getting the arguments from query we need a decorator and then a function to return a response for the query.
@app.route(‘/top_posts’,methods=[‘GET’])
def query():
subreddit = request.args.get(‘subreddit’)
num = 3 #no. of titles
ourFirstThreeTitles = getFirstThreeTitles(subreddit,num)
ourResponse = {“response”:{“code”:200,”message”:ourFirstThreeTitles}}
return jsonify(ourResponse)
Now we implement the function for getting first three titles from the top subreddit posts.
def getFirstThreeTitles(subreddit,num):
URL = “https://www.reddit.com/r/"+subreddit+"/top"
headers = {‘User-Agent’:’Mozilla/5.0'}
response=requests.get(URL,headers=headers)
soup=BeautifulSoup(response.content,’lxml’)
i=0
titles = []
while(i<num):
title =soup.select(‘._eYtD2XCVieq6emjKBH3m’)[i].get_text()
titles.append(title)
i+=1
return titles
Combining all code snippets into one, we get this —
Output:
Refer this link for the detailed implementation of this project.