xcrawler

A multi-threaded, open source web crawler


Keywords
crawler, spider, scraper
License
GPL-3.0
Install
pip install xcrawler==1.3.0

Documentation

xcrawler

A multi-threaded, open source web crawler

Features

  • Use multiple threads to visit web pages
  • Extract web page data using XPath expressions or CSS selectors
  • Extract urls from a web page and visit extracted urls
  • Write extracted data to an output file
  • Set HTTP session parameters such as: cookies, SSL certificates, proxies
  • Set HTTP request parameters such as: header, body, authentication
  • Download files from the urls
  • Supports Python 2 and Python 3

Installation

pip install xcrawler

When installing lxml library on Windows you may encounter Microsoft Visual C++ is required errors.
To install lxml library on Windows:
  1. Download and install Microsoft Windows SDK:
  2. Click the Start Menu, search for and open the command prompt:
    • For Python 2.6, 2.7, 3.0, 3.1, 3.2: CMD Shell
    • For Python 3.3, 3.4: Windows SDK 7.1 Command Prompt
  3. Install lxml
setenv /x86 /release && SET DISTUTILS_USE_SDK=1 && set STATICBUILD=true && pip install lxml

Usage

Data and urls are extracted from a web page by a page scraper.
To extract data and urls from a web page use the following methods:
extract returns data extracted from a web page
visit returns next Pages to be visited

A crawler can be configured before crawling web pages. A user can configure such settings of the crawler as:
* the number of threads used to visit web pages
* the name of an output file
* the request timeout
To run the crawler call:
crawler.run()

Examples how to use xcrawler can be found at: https://github.com/cardsurf/xcrawler/tree/master/examples

XPath Example

from xcrawler import XCrawler, Page, PageScraper


class Scraper(PageScraper):
    def extract(self, page):
        topics = page.xpath("//a[@class='question-hyperlink']/text()")
        return topics


start_pages = [ Page("http://stackoverflow.com/questions/16622802/center-image-within-div", Scraper()) ]
crawler = XCrawler(start_pages)
crawler.config.output_file_name = "stackoverflow_example_crawler_output.csv"
crawler.run()

CSS Example

from xcrawler import XCrawler, Page, PageScraper


class StackOverflowItem:
    def __init__(self):
        self.title = None
        self.votes = None
        self.tags = None
        self.url = None


class UrlsScraper(PageScraper):
    def visit(self, page):
        hrefs = page.css_attr(".question-summary h3 a", "href")
        urls = page.to_urls(hrefs)
        return [Page(url, QuestionScraper()) for url in urls]


class QuestionScraper(PageScraper):
    def extract(self, page):
        item = StackOverflowItem()
        item.title = page.css_text("h1 a")[0]
        item.votes = page.css_text(".question .vote-count-post")[0].strip()
        item.tags = page.css_text(".question .post-tag")[0]
        item.url = page.url
        return item


start_pages = [ Page("http://stackoverflow.com/questions?sort=votes", UrlsScraper()) ]
crawler = XCrawler(start_pages)
crawler.config.output_file_name = "stackoverflow_css_crawler_output.csv"
crawler.config.number_of_threads = 3
crawler.run()

File Example

from xcrawler import XCrawler, Page, PageScraper


class WikimediaItem:
    def __init__(self):
        self.name = None
        self.base64 = None


class EncodedScraper(PageScraper):
    def extract(self, page):
        url = page.xpath("//div[@class='fullImageLink']/a/@href")[0]
        item = WikimediaItem()
        item.name = url.split("/")[-1]
        item.base64 = page.file(url)
        return item


start_pages = [ Page("https://commons.wikimedia.org/wiki/File:Records.svg", EncodedScraper()) ]
crawler = XCrawler(start_pages)
crawler.config.output_file_name = "wikimedia_file_example_output.csv"
crawler.run()

Session Example

from xcrawler import XCrawler, Page, PageScraper
from requests.auth import HTTPBasicAuth


class Scraper(PageScraper):
    def extract(self, page):
        return page.__str__()


start_pages = [ Page("http://192.168.1.1/", Scraper()) ]
crawler = XCrawler(start_pages)
crawler.config.output_file_name = "router_session_example_output.csv"
crawler.config.session.headers = {"User-Agent": "Custom User Agent",
                                  "Accept-Language": "fr"}
crawler.config.session.auth = HTTPBasicAuth('admin', 'admin')
crawler.run()

Request Example

from xcrawler import XCrawler, Page, PageScraper


class Scraper(PageScraper):
    def extract(self, page):
        return page.__str__()


start_page = Page("http://192.168.5.5", Scraper())
start_page.request.cookies = {"theme": "classic"}
crawler = XCrawler([start_page])
crawler.config.request_timeout = (5, 5)
crawler.config.output_file_name = "router_request_example_output.csv"
crawler.run()

Documentation

For more information about xcrawler see the source code and Python Docstrings: source code
The documentation can also be accessed at runtime with Python's built-in help function:
>>> import xcrawler
>>> help(xcrawler.Config)
    # Information about the Config class
>>> help(xcrawler.PageScraper.extract)
    # Information about the extract method of the PageScraper class

Licence

GNU GPL v2.0