How to recognize a scraper site

by Sapphire (November 17, 2005)

While cleaning up my outbound links on directory sites and article blogs, I realized: I’m not sure I know a scraper site from a plain old crappy site. I’m not sure there is a definite way to tell. But I need to get better at guessing. So I started searching on Google for “how to recognize a scraper site” or “identifying scraper site” and lots of other variations, and found nothing but scraper tools. You know how people are always yelling that Adsense promotes scraper sites and scrapers go to the top of the Google SERPs? Well, I tried dogpile, and suddenly the right results were coming out on top.

Here’s VileSilencer’s forum putting together a list of scraper sites with some tips for recognizing them on your own. Looking at this, I’m starting to think my confusion comes from the terms “scraper” and “spam” being used interchangeably. I was thinking of scraper sites as sites using scraper technology, but a lot of the ones in this list could be put together with simple HTML, for all I know. They’re just spammy.

Most frustrating thing of all - they all have Google PR.

One of the ones on VileSilencer’s thread has PR3 and is a classic example of crap. AdSense everywhere, keywords bolded for the spider’s benefit, plenty of keyword stuffing. I would have thought of it more as “Adsense spam” than a scraper, per se, but I think most of us would recognize it as a site we don’t want to give outbounds, and that’s the point.

Another one, however, also with PR3 looks quite nice. Just one AdSense unit instead of the normal 3 parked where you almost can’t avoid clicking them. :D But click one of the links at the bottom, and you get a page of a kajillion truncated articles, and lots of them have links at the end called “external links”, because they’re taking you to the site that got scraped.

One thing I realized today is, there’s a reason you can’t easily tell a scraper from someone making a portal of RSS feeds: there is no difference. According to this article on plagiarism, RSS feeds are the mainstay of scrapers (and now you know why I generally serve my posts in a truncated fashion and force you to click here to read the rest).

I guess the bottom line is, there’s nothing illegal about re-heating and serving up other people’s RSS feeds if you properly credit them. But it’s still spam, for both visitors and webmasters. Why? For visitors, it often puts one more click between them and the article they really want to read. Even when it doesn’t - when your entire article is posted on someone else’s site - the article is generally coming to them on an obnoxious site with popups and Adsense crap instead of your nice, navigable, visitor-friendly site. And for webmasters, the problem’s obvious: it puts a click between visitors and us, and keeps some visitors from ever reaching us.

There’s another site on VileSilencer’s that looks crappy but innocent enough at first, with just two Adsense units, PR3 and a nice enough layout. What’s wrong with it? When you click one of the links within the content article, it takes you to a page filled with truncated articles and “read more about” links. Now, what I’m wondering is: if the articles just contained contextual links to other complete articles, and at no point did you have to click to “read more”? Would that be a genuine site? That’s a real question. Anybody know? I’m guessing so, because that’s a legitimate way to keep people navigating around your site (and help them find more specifically what they need).

Reading further in the VileSilencer thread, I realized I’d approved some car and real estate sites before I realized there seemed to be billions of them coming from the same people, which raised my suspicions. Now, again, these aren’t what I think of as scraper sites - just really bad SEO, totally useless to visitors. I think I nixed them all, but I’ll be going through again soon to check.

Dan (VileSilencer admin) refers to one PR2 site as a perfect example of the type of site you don’t want to link to. As he notes, it may or may not be a “scraper site” per se, but it’s easily recognized as crap: “The worst thing about it is, the ads outweigh the level of the content. I have no qualms with adsense. It’s a system we all use, but when you get a page whose content has 900px worth of ads in height (in 4 separate google ads) and only this one paragraph as content The generic term “computer law” refers to a collection of laws aimed at protecting the computer and data transmitted via computer. The definitions stated in the law for “computer” and “computerized subject” are broad as their intention is to make secure a wide variety of issues that justify legal protection. for the entire page. You know you’ve got a scraper, or a purely made for adsense site. It’s 1 of the 2.”

I’m still fairly confused, but beginning to see some patterns.

Your Ad Here


Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Comments will be sent to the moderation queue.