Problems with Web Scraping

Web scraping is usually easy to get started, especially on a small scale. However, as you try to scale it up, it gets exponentially difficult. Scraping 10,000 records can easily be done with simple web scraper scripts in any programming language, but as you try to scrape millions of pages, you would need to architect and build features on your web scraping script that allows you to scale, maintain and unblock your scrapers.

DataHen Till solves the following problems:

Scaling your scraper

Scraping to millions or even billions of records requires much more pre-planning. It's not simply running your existing web scraper script in a bigger CPU/Ram machine.
More thoughts are needed, such as:

  • How to log massive amounts of HTTP requests.
  • How to troubleshoot HTTP requests, when it fails at scale.
  • How to minimize bandwidth usage.
  • How to rotate proxy IPs.
  • How to handle anti-scrapers.
  • What happens when a scraper fails.
  • How to resume scrapers after they are fixed.
  • etc.

Till provides a plug-and-play method of making your web scrapers scalable, and maintainable following best practices at DataHen that makes web scraping a pleasant experience.

Blocked scraper

As you try to scale up the number of requests, quite often, the target websites will detect your scraper and try to block your requests using Captcha, or throttling, or denying your request completely.

Till helps you circumvent detected as a web scraper by identifying your scraper as a real web browser. It does this by generating random user-agent headers and randomizing proxy IPs (that you supply) on every HTTP request.

Till also makes it easy for you to troubleshoot why the target website blocks your scraper.

Scraper Maintenance

Maintaining high-scale scrapers is challenging due to the massive volume of requests and interactions between your scrapers and the target websites. For a smooth operation, you need to think through how to maintain your scrapers regularly.

You need to know how to raise and triage errors as they occur on your scrapers, not all errors on web scraping should be treated equally. some are ignorable, and some are urgent. So, you will need to know what will be the details of your "development-deployment-maintenance" process will be.

Till solves this by logging all your HTTP requests and categorizing them whether it was successful (2XX statuses) or failures(non 2XX statuses). Till also provides a Web UI to analyze the request history and make sense of what happened during your scraping process.

Till makes it even easier for scraper maintenance by assigning each request with a unique Global ID (GID) that is derived from the request's URL, method, body, etc. You can then use this GID to troubleshoot your scrapers on where it went wrong.

Postmortem analysis & reproducability

The biggest difficulty facing any web scraper developer is when there are scraping failures. Your scraper fails when fetching or parsing certain URLs, but when you look at the target website and URLs, everything looks fine. How do you troubleshoot what already happened in the scenario?. How do you reproduce that failed scrape so that you can fix the issue?

Till stores all HTTP requests and the responses (including the response body/content) into a local cache. If at anytime your scraper encounters an error, you can then use the request's GID (Till assigns a Global ID, also called GID, on every request) to find the request and the actual response and content from the cache. In this way, you can analyze what went wrong with that particular request.

Starting over from scratch when it fails

Websites change all the time and without notice. Imagine running your web scraper for a week and then suddenly, somewhere along the way, it fails. It is frustrating that once you've fixed the scraper, there is a high chance that you'd need to start over from scratch again. And, on top of this, there are additional consequences, such as time delay, and further charges related to proxy usage, bandwidth, storage, VM costs, etc.

Till solves this by allowing you to replay your scrapers without actually needing to resend the HTTP requests to the target server.
Till does this by assigning each HTTP request its unique Global ID (GID) that is generated from the request's URL, method, headers, etc. It then stores all HTTP responses in the Cache based on their GID.

When you restart your scraper, the scraping process can go blazingly fast because Till now serves the cached version of the HTTP responses. All of this is without any code changes on your existing web scraper.