Sat, 06 Oct 2007 11:39:00 GMT
by pdp

I am learning to love Windows. Yeh it has its own problems and sometimes could be a bit insecure but in general, it just works well and every single piece of it is reusable and scriptable. Today, I quickly wrote a Google search scraper. It runs straight from the command line.

Pay attention on the size of the script and the level of clarity of the code. Moreover, it is well integrated with the host's setup and connection settings (proxies, socks, etc). You cannot do that even with Python. Thanks Microsoft.


There are two way you can run the script. If your default scripting engine is cscript, then you can just type: google.js **some query here**. If this is not the case then you either need to be explicit like this: cscript /nologo google.js **some query here**, or make cscript default like this: cscript //H:CScript. Whatever you do, the code will run and will work flawlessly.

I enjoy when things are plain and easy or just simply clever.

Archived Comments

Stephen BlochStephen Bloch
When running your script I'm getting:
"Access is denied" message.
The error code: 80070005,
the source: msxml3.dll
cscript //H:CScript
No output for me, when I debug, it seems like I have no tag <a> in the returned document but when I sniff, I could see answers from Google, furthermore the script sends 2 requests (q=xxx, q=xxx&start=10) I wrote a similar script in perl to find all web servers crawled by google for a specific domain (ie: I did not use the wap version, is there any reason to use it ? (speeder, smaller ... ?)
Hi, no outpout when using no proxy. But when passing through Paros, results are returned. Maybe an issue with Accept-Encoding: gzip,deflate...
strange, it should work flawlessly... are you sure that your system proxy settings are set correctly?
if (WScript.Arguments.length == 0) {
	WScript.Echo('usage: ' + WScript.ScriptName + ' <query>');
	WScript.Echo('       ' + WScript.ScriptName + ' ext:js');
	WScript.Echo('Google Search');
	WScript.Echo('by Petko D. Petkov (pdp) GNUCITIZEN (');
} else {
	var tmp = [];

	for (var i = 0; i < WScript.Arguments.length; i++) {

	var query = tmp.join(' ');

var pos = 0;
var doc = WScript.CreateObject('MSXML2.DOMDocument');
var xhr = WScript.CreateObject('Microsoft.XMLHTTP');
var resp;

doc.async = false;
doc.validateOnParse = false;

do {
	var lns = [];'GET','' + escape(query) + (pos != 0 ? '&start=' + pos : ''),false);
  //xhr.setRequestHeader("Accept-Encoding", "text");
	//doc.load('' + escape(query) + (pos != 0 ? '&start=' + pos : ''));
	//WScript.Echo('XML:' + xhr.responseText);
	var as = doc.getElementsByTagName('a');

	for (var i = 0; i < as.length; i++) {
		var href = as[i].getAttribute('href');
		var match = href.match(/^\/gwt\/.*?u=(.*?)$/);
		if (match) {
			var ln = unescape(match[1]);

	if (pns && pns.sort().join() == lns.sort().join()) {

	for (var i = 0; i < lns.length; i++) {

	var pns = lns;

	pos += 10;
} while (lns);
Yes, it sounds like MSXML2.DOMDocument could not deflate gzip. If I disable HTTP/1.1 on Internet Explorer settings, it works better, in this case no gzip encoding support. Nevertheless, foreach page Google returns, the DTD is checked. This represents 31 documents downloaded...
nori, cheers for that.
albert arul prakashalbert arul prakash
Working fine for me. Its lovely
same here. No output when not using any proxy
I don't know. it works really well for me. Check if cscript is your default scripting engine. If not, just follow the steps described above.
followed all three steps man, still, it doesn't work. Strange.