Friday, May 31, 2013

Are Scraping Websites Destructive to Bloggers and Genealogy

In one of my last posts, I explored the reasons why I believe that "scraping" in the form I found it, may not be a violation of the copyright law. There are still other considerations that apply to this "scraping" type activity. The key issue with this type of website is content. But first, a review and an analysis to try and differentiate scraping from other types of legitimate activities.

There are three characteristics that define a "scraping" website:

  • The scraping websites copy and republish the exact content of other websites
  • The scraping website contains no original content or value
  • The scraping websites provide no unique organization or benefit to the user

The initial difficulty here is differentiating between "scraping," as an activity, and news aggregators or readers. Essentially, what a reader does and what a scraper does are identical. The main, and most important, difference is that a reader or aggregator bases its activity on the voluntary and intentional activity of a user who benefits from the aggregation. Commonly, a reader or aggregator will use an RSS feed that has to be initiated by a specific user of the reader or aggregator program. In contrast, the scraper website collects content from sites determined by the programmer of the website, not for the programmer's personal use, but for the purpose of attracting "hits" to promote advertising on the scraping website and thereby benefit the programmer.

At first glance, you may think that the key difference between the two types of activities is obtaining permission from the target websites; those aggregated by the reader user and those scraped by the website scraper. This is not the case. I can use any number of aggregator programs to "watch" the content of specific websites for changes in content both with or without the permission or knowledge of the target website. So we must look elsewhere to determine the reason why we would accept the actions of a reader or news aggregator and decry the actions of a website scraper.

I think the key difference is the venue where the scraped or aggregated content is used or displayed. If I subscribe to a website through a reader or aggregator, the resulting content is displayed for my personal use in my own program. As I noted already, a website scraper has an entirely different motivation. The content is displayed openly on the Web and is used to attract "hits" to the scraping website with the expectation that the hits will generate income. But then this point raises another issue. Aren't any of the websites that are monetized doing exactly the same thing? Don't those of us who have advertising on on blogs or websites hope that the clicks on the ads will generate some form of income? If I go on Google+ or Facebook and give people a link to a website, I suggest, as is done in many technology groups on the Web, aren't I doing the same thing as a scraping website? It would be nice to believe that the contributors with links to suggested websites were motivated by altruistic desires to better mankind, but their participation is more likely monetarily based. They would like to attract more traffic to their own websites and thereby promote their own economic well-being. But again, on one side we see this activity as "valid" when done in a certain context and other activities as invalid.

So what is it about a website scraper that is objectionable? If a monetary motive is commonly assumed for most web content and the scrapers aren't doing anything too much different from other legitimate activities, why do we think scraping is bad and the other activities are "good" or "acceptable?"

I need to get back to the idea of permission. Do I need your permission to provide a link to your website? Obviously not. Do I need your permission to summarize or quote some of your content? Again, obviously not. But the second question depends on the extent of my quote or copy. This gets back into the issue of fair use in copyright law, which I will leave alone for now. Is then the issue with scraping websites only dependent on the fact that the scraping is done without permission or even knowledge of the target site? It doesn't appear to me that permission is an issue. For example, if my site is included in an article talking about the forty best blogging sites, am I going to be upset? Not likely.

So why is there outrage about scraping? If the scraping website uses more than a link to the target sites, there may well be copyright issues. If the scraping website includes a substantial portion of the content of the target sites, again, there may be copyright issues. But the real issue here is something a lot less obvious.

Scraping websites are more like spam and graffiti than they are like legitimate websites containing content. At the core, they are destructive rather than constructive. They take up time and space on the Internet without adding anything of value in return. It is the lack of content that is the issue. You can't claim freedom of speech when there is no speech. In other words they are worse than the people who drop flyers and business cards in my front yard, because they are using my content to promote their own purposes without either adding value or providing meaningful service. They are essentially spam. In addition to wasting people's time with an unwanted website, spam and scrapers both eat up a lot of network bandwidth. In that way, scraping falls into the category of being destructive rather than constructive.


  1. A valid point of view on a subject that could effect all us genealogy bloggers. Thanks for posting an interesting and informative article.

  2. According to your differentiation between aggregator and scraper, Google News falls into scraping, right? The user of Google News doesnt subscribe to feeds privately. The content is provided as a means to generate ad revenue. (And I suspect a portion of the material isn't from RSS feeds, but collected via Google search bots.)

    1. Well, the difference is not that great. It is a difficult area. There is a continuum of websites and it very difficult to draw the line at what is a scraping site and what is not. Hence the post and my opinion.