BeautifulSoup is a handy library for web scraping that's mature, easy to use and feature complete. It can be regarded as jQuery's equivalent in the Python world. In this post we're going to scrape the front page of wooptoo.com and output a clean, JSON version of it. The library can also handle DOM manipulation, i.e. adding elements to the HTML document, but that's beyond the scope of this article.

from bs4 import BeautifulSoup
import requests

resp = requests.get('http://wooptoo.com/')
page = BeautifulSoup(resp.content)

The bread and butter of scraping with BS is the find_all method. select works similarly but it uses the CSS selector syntax à la jQuery. Their output will be identical in this case.

posts = page.find_all(attrs={'class':'post'})
_posts = page.select('.post')

posts[0] is _posts[0]
> True

One catch to be aware of is that BS will work with special bs4 data structures, which inherit the built-in Python structures. So a list of posts will yield a bs4.element.ResultSet and each individual entry will be a bs4.element.Tag.

The find_all method allows us to also select html elements using native regular expressions. This enables us to fetch all the posts from 2014 for example:

import re

d2014 = page.find_all('time', {'datetime': re.compile('^2014')})
[p.parent.parent for p in d2014]

We can select the child of an element either using chained calls to find or using the select_one method. Both will only fetch the first match.

titles = [p.find(class_='post-title').find('a').text for p in posts]
titles = [p.select_one('.post-title a').text for p in posts]

Putting it all together:

import itertools
import json

_titles = [p.select_one('.post-title a') for p in posts]
titles = [t.text for t in _titles]
urls = [u.get('href') for u in _titles]
datetimes = [p.find('time').get('datetime') for p in posts]
tags = [t.select_one('.meta-tags a').text for t in posts]
summaries = [s.find(class_='post-summary').text.strip() for s in posts]

_posts = zip(titles, urls, datetimes, tags, summaries, itertools.count(1))
_keys = ('title', 'url', 'datetime', 'tags', 'summary', 'number')
output = [dict(zip(_keys, p)) for p in _posts]

json.dumps(output)

The _posts will yield a generator object which can be iterated over only once, as opposed to the _titles list which does not have the same drawback.

The BeautifulSoup library is much more complex than the example provided here. It allows for things like walking the DOM tree in a Javascript-esque manner: page.body.footer.p, fetching children nodes, parents, siblings on the same level, and much more.

Read more