Friday, February 16, 2007

How to use python and Beautiful Soup to screen scrape the links from a Google Blog search

I wanted to organize some results of a Google Blog search and was planning to use pyGoogle and the Google SOAP search API but found out that they have discontinued giving out keys. (See this article.) So an alternative method is screen scraping the html. I searched for python screen scraping and found Beautiful Soup. It works well-- much better than my previous attempts at writing my own screen scraping code. Here is how to do it. (Note: This works for using the Google Blog search, but not the Google web search. I am getting a HTTP Error 403: Forbidden. Probably because this is a script? Maybe I need to investigate the Google AJAX search API.)

UPDATE (1/23/07): I was right about the 403 Error. Google's terms of service do not allow automated queries. There is a way around this, but I don't want to promote bad behavior. Alternatively, running the automated search on Yahoo and Dogpile seems to work just fine. Here is an interesting comment about why Google got rid of the SOAP search API.

Steps:
1. Goto http://www.crummy.com/software/BeautifulSoup/
2. At the bottom of the page under the "Download Beautiful Soup" heading, click tarball.
3. Save the file.
4. Uncompress the contents to c:\temp
5. Open a "cmd.exe" shell
6. "cd c:\temp"
7. "cd BeautifulSoup-3.0.3"
8. type "python setup.py install"

Create this file and run it:
from BeautifulSoup import BeautifulSoup
import re
import urllib2

url = 'http://blogsearch.google.com/blogsearch?q=python'
response = urllib2.urlopen(url)
html = response.read()

soup = BeautifulSoup(html)
links = soup.findAll('a', id=re.compile("^p-"))
for link in links:
    print link['href']
Your results should be something like this:
http://www.ejb.com/video/15767/Tiger_vs_leopard_vs_python.html
http://linuxtoday.com/news_story.php3?ltsn=2007-01-09-013-26-RV-DV
http://rootprompt.org/article.php3?article=10577
http://www.pythonware.com/daily/116834941049045108
http://python-advocacy.blogspot.com/2007/01/personal-schedule-application-for.html
http://packages.gentoo.org/ebuilds/?imaging-1.1.6
http://www.jobsite.co.uk/cgi-bin/vacdetails.pl?selection=921394846&src=rss_jbe
http://programming.reddit.com/goto?rss=true&id=xm17
http://www.totaljobs.com/JobSearch/JobDetails.aspx?JobId=26841462&Keywords=&amp;amp;amp;amp;amp;AndOr=0&Sort=2&amp;amp;amp;amp;amp;JobType1=20&Rate=180&RateType=4<xt=nn14+2je%2C+Kettering&Radius=40&LIds1=fi,E,Q,d,CFV,CFu,CF7,CGy,CHX,CHY,CIB,CII,CIS,CJL,CJM,CM3,CNm,COB,COh,CPx,CQQ,CSf,CTB,CTd,CVS,CWR,CWn,CW6,CW_,CXi,CXs&LIds2=c9,G,s,2,9,BL&LIds6=Drc,B,I,g,o,p,q,r,Ih,Ij,N9,Pb,b6,il&From=%2FJobSearch%2FAdvancedJobSearch.aspx&DCMP=R_RS_XML_XML_090107
http://www.gossamer-threads.com/lists/python/dev/541010

2 comments:

Anonymous said...

I got below result:
Traceback (most recent call last):
File "blogsearch.py", line 11, in <module>
soup = BeautifulSoup(html)
File "C:\Python26\Lib\site-packages\BeautifulSoup.py", line 1499, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "C:\Python26\Lib\site-packages\BeautifulSoup.py", line 1230, in __init__
self._feed(isHTML=isHTML)
File "C:\Python26\Lib\site-packages\BeautifulSoup.py", line 1263, in _feed
self.builder.feed(markup)
File "C:\Python26\lib\HTMLParser.py", line 108, in feed
self.goahead(0)
File "C:\Python26\lib\HTMLParser.py", line 148, in goahead
k = self.parse_starttag(i)
File "C:\Python26\lib\HTMLParser.py", line 226, in parse_starttag
endpos = self.check_for_whole_start_tag(i)
File "C:\Python26\lib\HTMLParser.py", line 301, in check_for_whole_start_tag
self.error("malformed start tag")
File "C:\Python26\lib\HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: malformed start tag, at line 53, column 9411

Anonymous said...

You need to uninstall the previous version of Beautiful Soup and use the 3.0.7a version there is a known bug in the latest version