Integrating Till with Colly
Integration Steps:
Integrating Till with Colly
Integration Steps:
Till can be easily integrated with your Go Colly scraper without much code changes.
Please follow the steps below.
Follow the instructions to install Till
Next, you need to modify your existing Colly project to integrate with Till.
The following is an example script:
// Tutorial from from https://github.com/gocolly/colly/blob/master/_examples/basic/basic.go
// modified to integrate with Till
package main
import (
"crypto/tls"
"fmt"
"log"
"net/http"
"net/url"
"github.com/gocolly/colly/v2"
)
func main() {
// Instantiate default collector
c := colly.NewCollector(
// Visit only domains: hackerspaces.org, wiki.hackerspaces.org
colly.AllowedDomains("hackerspaces.org", "wiki.hackerspaces.org"),
)
// Integration with Till
proxyUrl, err := url.Parse("http://localhost:2933")
if err != nil {
log.Fatal(err)
}
tilltransport := http.Transport{
Proxy: http.ProxyURL(proxyUrl),
TLSClientConfig: &tls.Config{InsecureSkipVerify: true},
}
c.WithTransport(&tilltransport)
// Add custom headers to tell Till what to do
c.OnRequest(func(req *colly.Request) {
// Add the header to force a Cache Miss on Till
req.Headers.Add("X-DH-Cache-Freshness", "now")
})
// On every a element which has href attribute call callback
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
link := e.Attr("href")
// Print link
fmt.Printf("Link found: %q -> %s\n", e.Text, link)
// Visit link found on page
// Only those links are visited which are in AllowedDomains
c.Visit(e.Request.AbsoluteURL(link))
})
// Before making a request print "Visiting ..."
c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting", r.URL.String())
})
// Start scraping on https://hackerspaces.org
c.Visit("https://hackerspaces.org/")
}
Note: To see a working example, you can visit this link.
Next, run your Colly project like you normally would.
Note: If you don't have an existing Go Colly project to try with Till, you can try our working example here.
Visit the Till UI at http://localhost:2980/requests to see that your new requests are shown.
Getting Started
How To Use
Integrations
Python
Node.js
Go
Ruby