Integrating Till with Go Colly

Till can be easily integrated with your Go Colly scraper without much code changes.

Please follow the steps below.

Step 1: Install Till

Follow the instructions to install Till

Step 2: Modify your Colly project

Next, you need to modify your existing Colly project to integrate with Till.

The following is an example script:

// Tutorial from from https://github.com/gocolly/colly/blob/master/_examples/basic/basic.go
// modified to integrate with Till
package main

import (
	"crypto/tls"
	"fmt"
	"log"
	"net/http"
	"net/url"

	"github.com/gocolly/colly/v2"
)

func main() {
	// Instantiate default collector
	c := colly.NewCollector(
		// Visit only domains: hackerspaces.org, wiki.hackerspaces.org
		colly.AllowedDomains("hackerspaces.org", "wiki.hackerspaces.org"),
	)

	// Integration with Till
	proxyUrl, err := url.Parse("http://localhost:2933")
	if err != nil {
		log.Fatal(err)
	}
	tilltransport := http.Transport{
		Proxy:           http.ProxyURL(proxyUrl),
		TLSClientConfig: &tls.Config{InsecureSkipVerify: true},
	}
	c.WithTransport(&tilltransport)

	// Add custom headers to tell Till what to do
	c.OnRequest(func(req *colly.Request) {
		// Add the header to force a Cache Miss on Till
		req.Headers.Add("X-DH-Cache-Freshness", "now")
	})

	// On every a element which has href attribute call callback
	c.OnHTML("a[href]", func(e *colly.HTMLElement) {
		link := e.Attr("href")
		// Print link
		fmt.Printf("Link found: %q -> %s\n", e.Text, link)
		// Visit link found on page
		// Only those links are visited which are in AllowedDomains
		c.Visit(e.Request.AbsoluteURL(link))
	})

	// Before making a request print "Visiting ..."
	c.OnRequest(func(r *colly.Request) {
		fmt.Println("Visiting", r.URL.String())
	})

	// Start scraping on https://hackerspaces.org
	c.Visit("https://hackerspaces.org/")
}

Note: To see a working example, you can visit this link.

Step 3: Run your Colly project

Next, run your Colly project like you normally would.

Note: If you don't have an existing Go Colly project to try with Till, you can try our working example here.

Step 4: Verify that it works

Visit the Till UI at http://localhost:2980/requests to see that your new requests are shown.

Request Log UI