HTTP Caching

Till allows the caching of HTTP responses that passes through from your scraper to the target server.

This is very beneficial for your scrapers in the following ways:

  • Making your scrapers more scalable
  • Ability to replay your scrapers
  • Saving bandwidths, and other costs

Note: When enabled, HTTP Caching works out of the box without much code changes on your scraper. Your scraper will immediately benefit from the cache being served. You can also use the cache in a more granular way, to do so, please read advanced usage.

How it works

HTTP Caching Flowchart

Whenever a new request is sent through Till, it will first check if there is an existing cached response. If it doesn't exist, Till then forwards the request to the target server and then saves the response in the cache store.

The next time an identical request passes through till, if a cached response exist, it will be served.

Global ID (GID)

Whenever a HTTP request is sent through Till, it identifies how unique this request is by generating a signature. This signature is called a GID (stands for Global ID). Till then uses this GID to store and read contents from the cache store.

The request signature is made up of the following:

  • Method
  • URL
  • Header
  • Body
  • Cookies

GID is useful for your to troubleshoot and maintain your scrapers, as it allows you to pinpoint and trace how certain requests behave throughout the activities of your scrapers.

The following are some examples of the GIDs that are generated:

URL GID
https://fetchtest.datahen.com fetchtest.datahen.com-f4494a940b4f72fc42de128a8e227b34
https://fetchtest.datahen.com/echo/request fetchtest.datahen.com-144a91f641d36c08dade39f739b05d31
https://www.datahen.com www.datahen.com-86b31b8ca33241b881c3e71d4a210e6b

On every http request via Till, it will send back a response header X-Dh-Gid for you to know what is the GID of the request.
The following is an example that uses curl to see this response header:

$ curl 'https://fetchtest.datahen.com/echo/request' -kv --proxy http://localhost:2933 
...
< X-Dh-Gid: fetchtest.datahen.com-144a91f641d36c08dade39f739b05d31
...

Serving Successful Responses

By default, Till only caches and serves successful responses. This is useful because scrapers typically only care about the successful responses, and not failed ones. Failed responses are necessary to detect issues related to your scrapers, so, that is why Till do not cache them by default.

The following are the list of HTTP status codes that till considers as successful:

  • 2XX
  • 3XX
  • 404

Note: If you would like Till to cache and serve all responses (both successful and failed ones), please read the Serving Failed Responses section.

Basic Usage

Once http caching is enabled on the configuration, your scraper will immediately benefit from the http caching feature, without much changes to your scraper codes.

Your request can specify a certain freshness criteria, in order to determine whether an existing cache is fresh enough to be served. If not, Till will then do a real request to the target server, and serve the latest response.

Note: By default, Till only caches and serves successful responses. If you would like Till to cache and serve all responses (successes and failures), please look at the Serving failed responses section.

Example

In this example we are going to send an identical request three times:

  1. Send a request (Cache Miss)
  2. Send the same request again (Cache Hit)
  3. Ignoring the cache (Cache Miss)

1. Send a request (Cache Miss)

Let's first send a request to demonstrate that no cache is being served to this response.

Send to the following URL using curl curl:

$ curl 'https://fetchtest.datahen.com/echo/request' -kv --proxy http://localhost:2933
...
< X-Dh-Gid: fetchtest.datahen.com-144a91f641d36c08dade39f739b05d31
<
GET /echo/request HTTP/2.0
Host: fetchtest.datahen.com
Accept: */*
Accept-Encoding: gzip
Cdn-Loop: cloudflare
Cf-Connecting-Ip: 142.252.79.231
Cf-Ipcountry: US
Cf-Ray: 67764d97980c2536-SJC
Cf-Visitor: {"scheme":"https"}
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103 Safari/537.36
X-Forwarded-For: 142.252.79.231
X-Forwarded-Proto: https

The above response headers do not contain the X-Dh-Cache-Created-At header, which means that this was a cache miss.

There is another header called X-Dh-Gid which is the Global ID (unique signature of this request). You can read more about about it in the GID (Global ID) section.

2. Send the same request again (Cache Hit)

Next, send the same request again, and it will serve the response from cache.

curl 'https://fetchtest.datahen.com/echo/request' -kv --proxy http://localhost:2933
...
< X-Dh-Cache-Created-At: 2021-07-31 14:01:50.162934 +0300 +03
< X-Dh-Gid: fetchtest.datahen.com-144a91f641d36c08dade39f739b05d31
<
GET /echo/request HTTP/2.0
Host: fetchtest.datahen.com
Accept: */*
Accept-Encoding: gzip
Cdn-Loop: cloudflare
Cf-Connecting-Ip: 142.252.79.208
Cf-Ipcountry: US
Cf-Ray: 677654bbeb710256-SJC
Cf-Visitor: {"scheme":"https"}
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809 Safari/537.36
X-Forwarded-For: 142.252.79.208
X-Forwarded-Proto: https

The above response headers now shows X-Dh-Cache-Created-At. This shows you when this cached response was created. You can learn more about this header in the Cache Freshness section.

3. Ignoring the cache (Cache Miss)

Let's send the same request again, but this time let's ignore the usage of the cache.

In this example, we'll set the X-DH-Cache-Freshness: now header. This means that we're telling Till to ignore the cache, and do a real request to the target server.

curl 'https://fetchtest.datahen.com/echo/request' -H 'X-Dh-Cache-Freshness: now' -kv --proxy http://localhost:2933
...
< X-Dh-Gid: fetchtest.datahen.com-144a91f641d36c08dade39f739b05d31
<
GET /echo/request HTTP/2.0
Host: fetchtest.datahen.com
Accept: */*
Accept-Encoding: gzip
Cdn-Loop: cloudflare
Cf-Connecting-Ip: 207.90.15.132
Cf-Ipcountry: US
Cf-Ray: 677679870f1203f2-LIS
Cf-Visitor: {"scheme":"https"}
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/606.1.36 (KHTML, like Gecko) Version/12.0.3 Safari/606.1.36
X-Forwarded-For: 207.90.15.132
X-Forwarded-Proto: https

Note that the above does not contain X-Dh-Cache-Created-At header anymore. This means that Till correctly followed our instruction to not serve from cache.

Advance usage

Till allows for more fine grained behavior of the HTTP caching feature, that will make your scraper scalable and maintainable.

Cache Freshness

When Till detects that a cached response exists, it tries to determine whether this cached response is fresh enough to be served on that request. This is called the "freshness criteria".

Then, whenever a cache is served, Till will add a header X-DH-Cache-Created-At on the HTTP response.

Internally, Till uses the freshness criteria, and compares it with this created-at value to determine whether the cache should be served or not.

In order to use the freshness criteria, you can specify the default freshness on the configuration.
You can also override the freshness criteria per request, by specifying the X-DH-Cache-Freshness in the header of a HTTP request.

HTTP Header Value
X-DH-Cache-Freshness Determines whether a cache is fresh enough to be served a HTTP request. Allowed values: now,minute, hour, day, week, fortnight, month, year, or any.

Freshness Behavior

The following examples demostrate the behavior of the freshness criteria:

Freshness Last cached response Serve this cache?
now 1 second ago No
now 1 hour ago No
now 1 year ago No
any 1 second ago Yes
any 1 hour ago Yes
any 1 year ago Yes
minute 1 second ago Yes
minute 1 hour ago No
minute 1 year ago No
day 1 second ago Yes
day 1 hour ago Yes
day 1 year ago No

Serving Failed Responses

By default, Till only caches and serves successful responses, unless you specifically configure it to also serve failed responses.

To do this, you can specify the default value on the cache.serve-failures on the configuration.
You can also override the freshness on per request by specifying the X-DH-Cache-Serve-Failures in the header of a HTTP request.

HTTP Header Value
X-DH-Cache-Serve-Failures Allowed values: true or false.

Configuration

Note: HTTP Caching is a Premium feature. If you've already upgraded your plan, you can restart Till and it will be turned on.

The following are the configuration options that you can set:

Configuration Value
cache.disabled Allowed values: true or false. (default false)
cache.serve-failures Allowed values: true or false. (default false)
cache.freshness Default freshness for cached responses. Allowed values: now,minute, hour, day, week, fortnight, month, year, or any (default any).
cache.ttl Time-to-live for the cache records. Allowed values: minute, hour, day, week, fortnight, month, year, or forever (default week).

Note: Till stores the cache data inside your Data Directory on your local disk. You can change the TTL settings in order to save disk space. The lesser the TTL, the smaller the space in your disk that will be used.

Configuration Steps

Step 1: Configure Till

When you have already created a config file, you can add a configuration like so:

# Cache settings
cache:
  # Disable the cache feature.
  # Defaults to false.
  disabled: false

  # TTL (Time To Live). How long a cache record will be allowed to live before it gets deleted.
  # Defaults to "week".
  ttl: "week"

  # Specifies by default on how fresh the Cache Hit will be.
  # Defaults to "any"
  freshness: "any" 
  
  # Specifies if Till should serve cached responses of failed HTTP requests (non 2XX statuses)
  # Defaults to false.
  serve-failures: false

Step 2: Verify Till

Now, you just need to verify that your Till configuration is working.

To verify this, please follow this example in the basic usage section