Scraping with BeautifulSoup
BeautifulSoup is a handy library for web scraping that’s mature, easy to use and feature complete.
It can be regarded as jQuery’s equivalent in the Python world.
In this post we’re going to scrape the front page of wooptoo.com
and output a clean, JSON version of it.
The library can also handle DOM manipulation, i.e. adding elements to the HTML document, but that’s beyond the
scope of this article.
from bs4 import BeautifulSoup
import requests
resp = requests.get('http://wooptoo.com/')
page = BeautifulSoup(resp.content)
The bread and butter of scraping with BS is the find_all
method.
select
works similarly but it uses the CSS selector syntax à la jQuery.
Their output will be identical in this case.
posts = page.find_all(attrs={'class':'post'})
_posts = page.select('.post')
posts[0] is _posts[0]
> True
One catch to be aware of is that BS will work with special bs4 data structures, which inherit the
built-in Python structures. So a list of posts will yield a bs4.element.ResultSet
and each
individual entry will be a bs4.element.Tag
.
The find_all
method allows us to also select html elements using native regular expressions.
This enables us to fetch all the posts from 2014 for example:
import re
d2014 = page.find_all('time', {'datetime': re.compile('^2014')})
[p.parent.parent for p in d2014]
We can select the child of an element either using chained calls to find
or using the select_one
method. Both will only fetch the first match.
titles = [p.find(class_='post-title').find('a').text for p in posts]
titles = [p.select_one('.post-title a').text for p in posts]
Putting it all together:
import itertools
import json
_titles = [p.select_one('.post-title a') for p in posts]
titles = [t.text for t in _titles]
urls = [u.get('href') for u in _titles]
datetimes = [p.find('time').get('datetime') for p in posts]
tags = [t.select_one('.meta-tags a').text for t in posts]
summaries = [s.find(class_='post-summary').text.strip() for s in posts]
_posts = zip(titles, urls, datetimes, tags, summaries, itertools.count(1))
_keys = ('title', 'url', 'datetime', 'tags', 'summary', 'number')
output = [dict(zip(_keys, p)) for p in _posts]
json.dumps(output)
The _posts
will yield a generator object which can be iterated over only once,
as opposed to the _titles
list which does not have the same drawback.
The BeautifulSoup library is much more complex than the example provided here.
It allows for things like walking the DOM tree in a Javascript-esque manner: page.body.footer.p
,
fetching children nodes, parents, siblings on the same level, and much more.
Read more
- CSS Selector Syntax
- BeautifulSoup Docs
- Source code files for this post