Google Web Scraper
by Sapphire (July 14, 2005)
As I mentioned yesterday, I started playing with Mambo in May and immediately noticed some security issues on the site hosting the installation. Namely, a non-existant authorized user showing up in AwStats. I realized the timing might be a coincidence, but I backed off from Mambo until I had time to do some more research. (I’ve been in the thick of finding new hosts, getting everything up and running, and other general maintenance lately. Bleh.)
Then I noticed little issues on other domains, such as hits to a landing page that doesn’t exist in any way, shape or form. I began to doubt Mambo was an issue, after all. Having completed my research, I now think Google was the culprit. I’ll explain why, but first let me give you some background.
Graywolf’s article on the Google Web Accelerator is one of the best explanations of what it is, what it’s meant to do, and what it’s really likely to do. I’ll try to summarize: the Accelerator is basically scraping your site to get pages to hand people. It’s very likely they will scrape items you don’t want anyone seeing and waste your bandwidth sharing them with folks. The potential is a disastrous invasion of your privacy, but the reality has yet to be seen.
I think I’m seeing it. The weird things I’ve been seeing in my logs led me to think someone was scraping my site, and this is a scraper issue that started just a few weeks before I first saw problems.
Could it possibly be something else? Yes, always - there’s so much going on with websites that it’s always hard to pin down issues. But Graywolf also raised the possibility that they will screw up your HTML in the scraping, and I’ve noticed my cached pages in Google having a very strange rendition of my CSS lately that looks terrible. So Google’s doing something wrong.
Here are some tips for blocking the GWA bots altogether, although I find myself wondering will Google penalize sites who block them?

