HTTP Caching
Till allows the caching of HTTP responses that passes through from your scraper to the target server.
This is very beneficial for your scrapers in the following ways:
Note: When enabled, HTTP Caching works out of the box without much code changes on your scraper. Your scraper will immediately benefit from the cache being served. You can also use the cache in a more granular way, to do so, please read advanced usage.
Whenever a new request is sent through Till, it will first check if there is an existing cached response. If it doesn't exist, Till then forwards the request to the target server and then saves the response in the cache store.
The next time an identical request passes through till, if a cached response exist, it will be served.
Whenever a HTTP request is sent through Till, it identifies how unique this request is by generating a signature. This signature is called a GID (stands for Global ID). Till then uses this GID to store and read contents from the cache store.
The request signature is made up of the following:
GID is useful for your to troubleshoot and maintain your scrapers, as it allows you to pinpoint and trace how certain requests behave throughout the activities of your scrapers.
The following are some examples of the GIDs that are generated:
URL | GID |
---|---|
https://fetchtest.datahen.com | fetchtest.datahen.com-f4494a940b4f72fc42de128a8e227b34 |
https://fetchtest.datahen.com/echo/request | fetchtest.datahen.com-144a91f641d36c08dade39f739b05d31 |
https://www.datahen.com | www.datahen.com-86b31b8ca33241b881c3e71d4a210e6b |
On every http request via Till, it will send back a response header X-Dh-Gid
for you to know what is the GID of the request.
The following is an example that uses curl
to see this response header:
$ curl 'https://fetchtest.datahen.com/echo/request' -kv --proxy http://localhost:2933
...
< X-Dh-Gid: fetchtest.datahen.com-144a91f641d36c08dade39f739b05d31
...
By default, Till only caches and serves successful responses. This is useful because scrapers typically only care about the successful responses, and not failed ones. Failed responses are necessary to detect issues related to your scrapers, so, that is why Till do not cache them by default.
The following are the list of HTTP status codes that till considers as successful:
Note: If you would like Till to cache and serve all responses (both successful and failed ones), please read the Serving Failed Responses section.
Once http caching is enabled on the configuration, your scraper will immediately benefit from the http caching feature, without much changes to your scraper codes.
Your request can specify a certain freshness criteria, in order to determine whether an existing cache is fresh enough to be served. If not, Till will then do a real request to the target server, and serve the latest response.
Note: By default, Till only caches and serves successful responses. If you would like Till to cache and serve all responses (successes and failures), please look at the Serving failed responses section.
In this example we are going to send an identical request three times:
Let's first send a request to demonstrate that no cache is being served to this response.
Send to the following URL using curl curl
:
$ curl 'https://fetchtest.datahen.com/echo/request' -kv --proxy http://localhost:2933
...
< X-Dh-Gid: fetchtest.datahen.com-144a91f641d36c08dade39f739b05d31
<
GET /echo/request HTTP/2.0
Host: fetchtest.datahen.com
Accept: */*
Accept-Encoding: gzip
Cdn-Loop: cloudflare
Cf-Connecting-Ip: 142.252.79.231
Cf-Ipcountry: US
Cf-Ray: 67764d97980c2536-SJC
Cf-Visitor: {"scheme":"https"}
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103 Safari/537.36
X-Forwarded-For: 142.252.79.231
X-Forwarded-Proto: https
The above response headers do not contain the X-Dh-Cache-Created-At
header, which means that this was a cache miss.
There is another header called X-Dh-Gid
which is the Global ID (unique signature of this request). You can read more about about it in the GID (Global ID) section.
Next, send the same request again, and it will serve the response from cache.
curl 'https://fetchtest.datahen.com/echo/request' -kv --proxy http://localhost:2933
...
< X-Dh-Cache-Created-At: 2021-07-31 14:01:50.162934 +0300 +03
< X-Dh-Gid: fetchtest.datahen.com-144a91f641d36c08dade39f739b05d31
<
GET /echo/request HTTP/2.0
Host: fetchtest.datahen.com
Accept: */*
Accept-Encoding: gzip
Cdn-Loop: cloudflare
Cf-Connecting-Ip: 142.252.79.208
Cf-Ipcountry: US
Cf-Ray: 677654bbeb710256-SJC
Cf-Visitor: {"scheme":"https"}
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809 Safari/537.36
X-Forwarded-For: 142.252.79.208
X-Forwarded-Proto: https
The above response headers now shows X-Dh-Cache-Created-At
. This shows you when this cached response was created. You can learn more about this header in the Cache Freshness section.
Let's send the same request again, but this time let's ignore the usage of the cache.
In this example, we'll set the X-DH-Cache-Freshness: now
header. This means that we're telling Till to ignore the cache, and do a real request to the target server.
curl 'https://fetchtest.datahen.com/echo/request' -H 'X-Dh-Cache-Freshness: now' -kv --proxy http://localhost:2933
...
< X-Dh-Gid: fetchtest.datahen.com-144a91f641d36c08dade39f739b05d31
<
GET /echo/request HTTP/2.0
Host: fetchtest.datahen.com
Accept: */*
Accept-Encoding: gzip
Cdn-Loop: cloudflare
Cf-Connecting-Ip: 207.90.15.132
Cf-Ipcountry: US
Cf-Ray: 677679870f1203f2-LIS
Cf-Visitor: {"scheme":"https"}
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/606.1.36 (KHTML, like Gecko) Version/12.0.3 Safari/606.1.36
X-Forwarded-For: 207.90.15.132
X-Forwarded-Proto: https
Note that the above does not contain X-Dh-Cache-Created-At
header anymore. This means that Till correctly followed our instruction to not serve from cache.
Till allows for more fine grained behavior of the HTTP caching feature, that will make your scraper scalable and maintainable.
When Till detects that a cached response exists, it tries to determine whether this cached response is fresh enough to be served on that request. This is called the "freshness criteria".
Then, whenever a cache is served, Till will add a header X-DH-Cache-Created-At
on the HTTP response.
Internally, Till uses the freshness criteria, and compares it with this created-at value to determine whether the cache should be served or not.
In order to use the freshness criteria, you can specify the default freshness on the configuration.
You can also override the freshness criteria per request, by specifying the X-DH-Cache-Freshness
in the header of a HTTP request.
HTTP Header | Value |
---|---|
X-DH-Cache-Freshness | Determines whether a cache is fresh enough to be served a HTTP request. Allowed values: now ,minute , hour , day , week , fortnight , month , year , or any . |
The following examples demostrate the behavior of the freshness criteria:
Freshness | Last cached response | Serve this cache? |
---|---|---|
now | 1 second ago | No |
now | 1 hour ago | No |
now | 1 year ago | No |
any | 1 second ago | Yes |
any | 1 hour ago | Yes |
any | 1 year ago | Yes |
minute | 1 second ago | Yes |
minute | 1 hour ago | No |
minute | 1 year ago | No |
day | 1 second ago | Yes |
day | 1 hour ago | Yes |
day | 1 year ago | No |
By default, Till only caches and serves successful responses, unless you specifically configure it to also serve failed responses.
To do this, you can specify the default value on the cache.serve-failures
on the configuration.
You can also override the freshness on per request by specifying the X-DH-Cache-Serve-Failures
in the header of a HTTP request.
HTTP Header | Value |
---|---|
X-DH-Cache-Serve-Failures | Allowed values: true or false . |
Note: HTTP Caching is a Premium feature. If you've already upgraded your plan, you can restart Till and it will be turned on.
The following are the configuration options that you can set:
Configuration | Value |
---|---|
cache.disabled | Allowed values: true or false . (default false ) |
cache.serve-failures | Allowed values: true or false . (default false ) |
cache.freshness | Default freshness for cached responses. Allowed values: now ,minute , hour , day , week , fortnight , month , year , or any (default any ). |
cache.ttl | Time-to-live for the cache records. Allowed values: minute , hour , day , week , fortnight , month , year , or forever (default week ). |
Note: Till stores the cache data inside your Data Directory on your local disk. You can change the TTL settings in order to save disk space. The lesser the TTL, the smaller the space in your disk that will be used.
When you have already created a config file, you can add a configuration like so:
# Cache settings
cache:
# Disable the cache feature.
# Defaults to false.
disabled: false
# TTL (Time To Live). How long a cache record will be allowed to live before it gets deleted.
# Defaults to "week".
ttl: "week"
# Specifies by default on how fresh the Cache Hit will be.
# Defaults to "any"
freshness: "any"
# Specifies if Till should serve cached responses of failed HTTP requests (non 2XX statuses)
# Defaults to false.
serve-failures: false
Now, you just need to verify that your Till configuration is working.
To verify this, please follow this example in the basic usage section
Getting Started
How To Use
Integrations
Python
Node.js
Go
Ruby