Google.py

Sat, 22 Dec 2007 08:29:35 GMT
by pdp

Back in the days when Google Hacking was the hot topic of the day, I wrote a simple python script for scraping results from the google's search pages. Unlike other google search scripts, my script did not rely on any APIs. So, you may find it useful. Moreover, we are actually going to refer to this script in the upcoming days as we have some very interesting stuff to talk about here at GNUCITIZEN.

Use the script responsibly and keep in mind Google's Terms of Service before doing anything drastic. The command line interface is configured to run through up to 5 pages. You can change that at source level or if you embed it into your own scripts.

[/files/2007/12/google.py](/files/2007/12/google.py)

Let me know if you have any problems. The script is self documented and quite simple, so you should not have any problems getting used to it.

Archived Comments

Garrett GeeGarrett Gee
Many thanks for this python module. I recently started a project that needs to leverage search engine results, and I was very sad to see google removing their soap interface over the ajax one.
WillyWilly
If you put this in an infinite loop you can mess with google trends, although the results are only changed for whatever region you're connected through. I didn't know how to change the user agent from python when I wrote mine though, so I did it through another site that returns results from Google, they don't check.
pdppdp
Willy, the Google.py code shows how to change the user agent. You can just grab the Get object and use it in your projects.
zaggyzaggy
C:\shared>google.py define:lol
# define:lol
# crawls the first 5 pages only (50 results max)
Did I miss something?
WillyWilly
I noticed that in the script. I wish I'd known it when I was writing mine. It still seemed to work to just search through someone else's site that didn't check the user agent, but knowing that would have saved me about an hour of messing around trying to get past the forbidden error.
GooHackleGooHackle
First of all, congratulations for this site. Now about this post, back in the hot google hacking days I wrote a similar script but in Perl. Some months ago I wrote it in PHP and publish a online version of this "google parser" named GooParser to test a couple of things. You can try it here: Google Parser Cheers, GooHackle
pdppdp
10x for sharing. I will have a look.
barbsiebarbsie
Hi, I also wrote a googlescraper in ruby. During my tests I had the impression that google not only checks the agent. Therefore I also added random delay in getting next pages and resetting cookies on top of randomizing agents. Oh...and also the use of a proxy (so privoxy with tor can be used)...
pdppdp
yep, good enough solution to evade Google's bot detection mechanism. I was planning to implement all these stuff into the next release of MET (Massive Enumeration Toolset), but I never find time for it. I hope that the project is not dead just yet.
barbsiebarbsie
interesting :) . I'm working on something similar, together with someone else. Basically it combines dns enumeration with some google scraping and whois stuff. It's a major rewrite of the old dnsenum thingy. If you are interested to exchange ideas, drop me a mail...
pdppdp
might be a good idea to collaborate.
vattsvatts
Hey why do i run the code and get just this:
[vatts@VattServ Desktop]$ python google.py 
usage: google.py query
[vatts@VattServ Desktop]$ python google.py linux
# linux
# crawls the first 5 pages only (50 results max)
[vatts@VattServ Desktop]$
pdppdp
the code is old! perhaps you need to change the regexes slightly...