Integrating Till with Kimurai
Integration Steps:
Integrating Till with Kimurai
Integration Steps:
Till can be easily integrated with your Kimurai scraper without much code changes.
Please follow the steps below.
Follow the instructions to install Till
Next, you need to modify your existing Kimurai Scraper to integrate with Till.
The following is an example script:
# Code example from https://github.com/gitter-badger/kimurai
require 'kimurai'
class GithubSpider < Kimurai::Base
  @name = "github_spider"
  @engine = :mechanize
  @start_urls = ["https://github.com/search?q=Ruby%20Web%20Scraping"]
  @config = {
    
    # Integrate with Till
    proxy: "localhost:2933:http",
    ignore_ssl_errors: true,
    # IMPORTANT: Custom headers in Kimurai only works on :mechanize and :poltergeist_phantomjs drivers 
    # (Selenium don't allow to set/get headers)
    headers: { 
      # Add the header to force a Cache Miss on Till
      "X-DH-Cache-Freshness" => "now",
    },
  }
  def parse(response, url:, data: {})
    response.xpath("//ul[@class='repo-list']/div//h3/a").each do |a|
      request_to :parse_repo_page, url: absolute_url(a[:href], base: url)
    end
    if next_page = response.at_xpath("//a[@class='next_page']")
      request_to :parse, url: absolute_url(next_page[:href], base: url)
    end
  end
  def parse_repo_page(response, url:, data: {})
    item = {}
    item[:owner] = response.xpath("//h1//a[@rel='author']").text
    item[:repo_name] = response.xpath("//h1/strong[@itemprop='name']/a").text
    item[:repo_url] = url
    item[:description] = response.xpath("//span[@itemprop='about']").text.squish
    item[:tags] = response.xpath("//div[@id='topics-list-container']/div/a").map { |a| a.text.squish }
    item[:watch_count] = response.xpath("//ul[@class='pagehead-actions']/li[contains(., 'Watch')]/a[2]").text.squish
    item[:star_count] = response.xpath("//ul[@class='pagehead-actions']/li[contains(., 'Star')]/a[2]").text.squish
    item[:fork_count] = response.xpath("//ul[@class='pagehead-actions']/li[contains(., 'Fork')]/a[2]").text.squish
    item[:last_commit] = response.xpath("//span[@itemprop='dateModified']/*").text
    save_to "results.json", item, format: :pretty_json
  end
end
GithubSpider.crawl!
Note: To see a working example, you can visit this link.
Next, run your Kimurai Scraper like you normally would.
Note: If you don't have an existing Kimurai Scraper to try with Till, you can try our working example here.
Visit the Till UI at http://localhost:2980/requests to see that your new requests are shown.

Getting Started
How To Use
Integrations
Python
Node.js
Go
Ruby