Yahoo Site Explorer Spider

Sun, 15 Jul 2007 21:47:10 GMT
by pdp

Here you go. On this page you will find a POC of a client-side spider that is based on the top of Yahoo Site Explorer PageData service which you can read more about from this page.

I've being talking about client-side spiders for quite some time now over here and here and I even came up with POC based on Yahoo Pipes for my OWASP presentation on "Advanced Web Hacking Reveled", which you can find over there.

Web spiders in particular are nothing interesting. They have been with us for quite some time now and there is no point of discussing what they can do. Though, spiders are the first step towards a successful web attack. Obviously, in order to find the weaknesses within a web application, first of all we have to enumerate all entry points. This is where we launch spiders. Sometimes spiders are semi-automatic or completely automatic and may contain attack payloads.

There are plenty of wormish spiders around over the Web, but most of them require server-side support. Fortunately for us or perhaps not, this is no longer the case when it comes to AJAX technologies and the fast developing world of Web2.0. Today, it is possible to write spiders that are completely client-side based, i.e. written entirely in JavaScript.

"But how is that possible? I though JavaScript cannot access pages outside of the current origin. Is that a browser bug?" Nope! This is not a browser bug. It is a feature of the Web. In my case, I am using Yahoo to provide me with an index of resources crowed by theirs and Google's spiders. This index is provided as a JSON service. Here is a description of what the service does:

The Page Data service allows you to retrieve information about the subpages in a domain or beneath a path that exist within the Yahoo! index. Yahoo Developer

"This is great but how can attackers use this service?" Well the most obvious way to make use of Yahoo Site Explorer service is in the situations where attackers want to find bugs in other sites in real time. Billy Hoffman and I have presented several real life scenarios how XSS vulnerabilities can be found almost automatically from withing the client without the support from a server side technology. This is somewhat dangerous because worms can be written entirely with client-side languages such as JavaScript or ActionScrpt, which way cooler than using boring general purpose languages.

You see, worms are often quite stupid in nature. They propagate either too fast or too slow. Very often, they are static and attack from specific IP ranges. During the first stage, we are able see a raise of particular type of traffic that originates from a particular geographical region. In order to stop further propagation, we can simply block the malicious traffic based on the worm signature. Game Over for the worm. The good guys win!

The spider that I wrote is anything by malicious. It just spiders. However, keep in mind that it will take no time to make it equipped with the latest client-side and server-side exploits. So, here is the spider's source code:

[/files/2007/07/spider.js](/files/2007/07/spider.js)

_and this is how I use it:_

[/files/2007/07/spider-init.js](/files/2007/07/spider-init.js)

Keep in mind that this spider is ultra fast. It does only several connects in order to obtain the entire directory structure of the targeted website. You can launch the POC from here.

Archived Comments

David KierznowskiDavid Kierznowski
Nice proof of concept. I agree, the speed of this spider and the depth option is awesome! Only a technical question, but can we call this a web spider as it doesn't really spider the site but rather fetches pages already indexed by Yahoo?
pdppdp
Interesting question. I am not quite sure, I must say. From my prospective, it is still a spider. However, what name do you think will suit best for a tool like this one?
David KierznowskiDavid Kierznowski
Yahoo calls it "Site Explorer" :) however, its interesting to me that the service supports a ping and update feature, so maybe web spider is just fine.
pdppdp
You are right. Although I seriously doubt that Yahoo will craw the pinged website right away, more likely it will schedule it for some latter point, attackers that are interested in particular sites can ping for updates constantly thus ensuring that Yahoo has the latest index.
Adrian PastorAdrian Pastor
nice one pdp. I'm really impressed by the speed of the results! Couldn't believe it! damn you, you got me excited about this, even though I have other exciting toys to play with on my todo list :) Regarding the name I would call it a "passive spider", since it doesn't actually visit the target site, which makes it more ideal from the cracker's perspective.
pdppdp
heh, I am working on another one which allows you to perform XSS scans right from the browser.
Antonio VivaldiAntonio Vivaldi
I only got excited until I read the code. I need a JS spider that actually HITS the site (not fetches pages from a pre-spidered list of yahoo's). The Web 2.0 world needs a new JS spider to handle link-checking, perf stats and a host of other things. AJAX components and DHTML screw every spider that is not based on JS and a DOM. Where can we go? Has anyone written a true JS and DOM based spider (like a FF plug in) that actually spiders???
pdppdp
it is not possible to construct true AJAX spider. Same Origin Policies apply.