Modules: parsing html and xml pages

pattern

The pattern module can be used to parse HTML and XML pages. The pattern.web module is a submodule of pattern which can be used to parse HTML Document Object Model (DOM), strip HTML tags, and

from pattern import web

The HTML DOM parser in pattern.web is very easy to use. Use the Element() function to parse the text from the webpage. For example, if we want to parse the text from google.com by script, we would

r = requests.get("http://www.google.com")
dom = web.Element(r.text)
dom.by_tag('script')

This returns a list of nested Elements with a given tag name (e.g. script, graph, meta, div, etc). The typical use is to use a for loop to loop through this list. e.g.

url = 'http://www.imdb.com/search/title'
params = dict(sort='num_votes,desc', start=1, title_type='feature', year='1950,2012')
r = requests.get(url, params=params)

dom = web.Element(r.text)
for movie in dom.by_tag('td.title'):
    title = movie.by_tag('a')[0].content
    rating = movie.by_tag('span.value')[0].content
    print title, rating

BeautifulSoup

The BeautifulSoup module is similar to parsing HTML and XML pages using pattern. To import the module, use

from BeautifulSoup import BeautifulSoup

To extra the movie titles and ratings as above in the pattern module, use

url = 'http://www.imdb.com/search/title'
params = dict(sort='num_votes,desc', start=1, title_type='feature', year='1950,2012')
r = requests.get(url, params=params)

dom = web.Element(r.text)
for movie in bs.findAll('td', 'title'):
    title = movie.find('a').contents[0]
    rating = movie.find('span', 'value').contents[0]
    print title, rating

sgmllib

The sgmllib module is a simple SGML (Standard Generalized Mark-up Language) parser. The goal of the module is to process HTML code. The main function is SGMLParser which parses HTML code into 8 kinds of data:

Name Definition Example SGMLParser function
start tag an HTML tag that starts a block <html>, <head>, <body> start_tagname, do_tagname
end tag an HTML tag that ends a block </html>, </head>, </body> end_tagname
character reference an escaped character referenced by its decimal   handle_charref
entity reference an HTML entity \& copy; handle_entityref
comment an HTML comment comments inclosed in <|– … –> handle_comment
processing instruction an HTML processing instruction instruction enclosed in <? … > handle_pi
declaration an HTML declaration DOCTYPE enclosed in <! … > handle_dec1
text data a block of text anything that does not fall into the other 7 categories handle_data

Modules: parsing JSON objects

json

The json module can be used to parse JSON (JavaScript Object Notation) objects to a Python dictionary.

import json

The json.loads()function can be used to parse by line

path = 'myjsonfile.txt'
mydict = [json.loads(line) for line in open(path)]

Modules to manipulate regular expressions and string manipulation

re

This module is used to manipulate regular expressions. The function sub() searches a string (s) and replaces all the words that match the regular expression.

import re
re.sub(regex, replacementword, stringtosearch)
re.search()
re.compile()

Useful regular expressions in Python

Expression Interpretation
^ matches the beginning of a string
$ matches the end of a string
\b matches a word boundary
\d matches any numeric digit
\D matches any non-numeric digit
x? matches an optional x character
x* matches x zero or more times
x+ matches x one or more times
x{n, m} matches x at least n times, but not more than m times
(a | b | c) matches either a, b or c

string

  • string.punctuation()
  • string.strip()
  • string.replace()
  • string.translate()

collections

The collections module contains functions to expand upon Python’s built in data types (list, tuple, dict, etc).

collections class Interpretation
namedtuple() a class similar to a tuple with named fields
deque a class similar to a list that can quickly append and pop (on either end)
Counter a subclass of dict that can be used for counting hashable objects
OrderedDict a subclass of dict that remembers the order entries were added
defaultdict a subclass of dict that allows for missing values

datetime

The datetime module can be used to print out the date and time. the function now() will print out the current day and time. Some attributes inclue year, month, day, hour, minute, second.

import datetime 
now = datetime.now()
print '%s/%s/%s' % (now.month, now.day, now.year)