Documentations
Documentations
DataHen Till is a companion tool to your existing web scraper that instantly makes it scalable, and maintainable with minimal code changes on your scraper. Integrates with any scraper in 5 minutes.
Till was architected to follow best practices that DataHen has accumulated over the years of scraping at a massive scale.
Till works as a Man In The Middle (MITM) proxy that listens to incoming HTTP(S) requests and forwards those requests to the target server as needed. While it does so, it enhances each request to avoid being detected by anti-scrapers. It also logs and caches the responses to make your scraper maintainable and scalable.
Connect your scraper to Till via the proxy
protocol that is typically common in any HTTP library in any programming language.
Your scraper will then continue to run as-is and it will become more scalable and maintainable.
The following are Till features that are intended to make you easily scale, maintain and unblock your existing scrapers without much code changes on your part.
No need to build your cookie management logic in your scraper codes. Till can store the cookies for you so that you can easily reuse them on subsequent requests.
Till automatically generates random user-agent on every request. Choose to identify your scraper as a desktop browser, or a mobile browser, or you can even override it with your custom user-agent.
Supply a list of proxy IPs, and Till will randomly use them on every request. Saves you time in needing to set up a separate proxy rotation service.
Your scraper can selectively reuse the same user-agent, proxy IP, and cookie jar for multiple requests. This allows you to easily group your requests based on certain workflow, and allow you to avoid detection from anti-scraping systems.
Till will log your requests based on successful request (2XX status code) or failed request (non 2XX status code). This will allow you to easily troubleshoot your scraper later.
The Till UI will allow you to make sense of HTTP request history, and troubleshoot what happens during a scraping session.
Till caches all of your HTTP responses (and their contents), so that as needed, your web scraper will reuse the cache without needing to do another HTTP request to the target server.
You can selectively choose whether to use a particular cached content or not by specifying how fresh you want Till to serve the cache. For example: If Till holds an existing cached content that is 1 week old, but your web scraper only wants 1-day old content, Till will then only serve cached contents that are 1 day old.
Till uses DataHen Scrape Platform's convention of marking every unique request with a signature (we call this the Global ID or GID for short). Think of it like a Checksum of the actual request.
Anytime your scraper sends a request through Till, it will return a response with the header X-DH-GID
that contains the GID. This GID allows you to easily troubleshoot requests when you need to look up specific requests in the log, or contents in the cache.
Till allows you to intercept any request based on any pattern, and serve a different response. This is useful for scaling your scrapers, bypass third-party analytics like Google Analytics, etc.
Getting Started
How To Use
Integrations
Python
Node.js
Go
Ruby