Command-line Utility for Web Scraping


Keywords
web, scrape, crawl, selenium, tor, xpath, regex, downloader, command, shell, repl
License
MIT
Install
pip install scr==0.10.1

Documentation

SCR

Command-line Utility for Web Scraping

GitHub Workflow Status (branch) Coverage Status Supported Versions PyPI - Python Version Lines of code PyPI - License

Core Features

  • Extract web content based on XPath-, Regex-, (Javascript-), and Python Format Expressions
  • Crawls through complex graphs of webpages using expressive match chains and forwarding rules
  • Selenium support, explicitly also for headless mode and for the Tor Browser
  • REPL mode for quick and dirty jobs / debugging larger commands
  • dd style Command-line Interface
  • Multithreaded downloads with optional progress output on the console
  • Interactive modes for rejecting false matches, adjusting filenames etc.

Setup

SCR can be installed from pypi.org using

pip install scr

Selenium drivers for Firefox/Tor (geckodriver), and chrome/chromium (chromedriver) can be installed e.g. using

scr selinstall=firefox

(and later updated e.g. using scr selupdate=chrome). You still need to have the browser installed, though.

Examples

Download and enumerate all images from a website into the current working directory:

scr url=google.com cx='//img/@src' cl csf="img_{ci}{fe}"

Open up a REPL, remote controlling a Firefox Browser using selenium

scr repl sel=firefox url="https://en.wikipedia.org/wiki/web_scraping"
scr> 'cx=//span[text()="Edit"]/../../@id' cjs="document.getElementById(cx).children[0].click()"
scr> exit

Interactively scroll through top reddit posts, following the 'next page' buttons:

scr url=old.reddit.com dx='//span[@class="next-button"]/a/@href' cx='//div[contains(@class,"entry")]//a[contains(@class,"title")]/text()' din mt=0

Downloading first 3 pdfs and first 5 gifs from a site, use a headless selenium tor browser for the fetch:

scr url=https://dtc.ucsf.edu/learning-library/resource-materials/ cx=//@href cr0='.*\.pdf$' cr1='.*\.gif$' cl csf='{fn}' cin=1 cimax0=3 cimax1=5 sel=tor selh

Options List

scr [OPTIONS]

    Matching chains are evaluated in the following order, skipping unspecified steps:
    xpath -> regex -> (javascript) -> python format string

    Content to Write out:
        cx=<xpath>            xpath for content matching
        cr=<regex>            regex for content matching
        cjs=<js string>       javascript to execute on the page, format args are available as js variables (selenium only)
        cf=<format string>    content format string (args: <cr capture groups>, xmatch, rmatch, di, ci)
        cxs=<int|empty>       also match siblings of xpath match in parents up to this number of levels (empty means any, default: 0)
        cmm=<bool>            allow multiple content matches in one document instead of picking the first (defaults to true)
        cimin=<int>           initial content index, each successful match gets one index
        cimax=<int>           max content index, matching stops here
        cicont=<bool>         don't reset the content index for each document
        csf=<format string>   save content to file at the path resulting from the format string, empty to enable
        cwf=<format string>   format to write to file. defaults to "{c}"
        cpf=<format string>   print the result of this format string for each content, empty to disable
                              defaults to "{c}\n" if cpf, csf and cfc are unspecified
        cshf=<format string>  execute a shell command resulting from the given format string
        cshif=<format string> format for the stdin of cshf
        cshp=<bool>           print the output of the shell commands on stdout and stderr
        cfc=<chain spec>      forward content match as a virtual document
        cff=<format string>   format of the virtual document forwarded to the cfc chains. defaults to "{c}"
        csin=<bool>            give a prompt to edit the save path for a file
        cin=<bool>            give a prompt to ignore a potential content match
        cl=<bool>             treat content match as a link to the actual content
        cesc=<string>         escape sequence to terminate content in cin mode, defaults to "<END>"
        cenc=<encoding>       default encoding to assume that content is in
        cfenc=<encoding>      encoding to always assume that content is in, even if http(s) says differently

    Labels to give each matched content (mostly useful for the filename in csf):
        lx=<xpath>           xpath for label matching
        lr=<regex>           regex for label matching
        ljs=<js string>      javascript to execute on the page, format args are available as js variables (selenium only)
        lf=<format string>   label format string
        lxs=<int|empty>      also match siblings of xpath match in parents up to this number of levels (empty means any, default: 0)
        lic=<bool>           match for the label within the content match instead of the hole document
        las=<bool>           allow slashes in labels
        lmm=<bool>           allow multiple label matches in one document instead of picking the first (for all content matches)
        lam=<bool>           allow missing label (default is to skip content if no label is found)
        lfd=<format string>  default label format string to use if there's no match
        lin=<bool>           give a prompt to edit the generated label

    Further documents to scan referenced in already found ones:
        dx=<xpath>           xpath for document matching
        dr=<regex>           regex for document matching
        djs=<js string>      javascript to execute on the page, format args are available as js variables (selenium only)
        df=<format string>   document format string
        dxs=<int|empty>      also match siblings of xpath match in parents up to this number of levels (empty means any, default: 0)
        dimin=<int>          initial document index, each successful match gets one index
        dimax=<int>          max document index, matching stops here
        dmm=<bool>           allow multiple document matches in one document instead of picking the first
        din=<bool>           give a prompt to ignore a potential document match
        denc=<encoding>      default document encoding to use for following documents, default is utf-8
        dfenc=<encoding>     force document encoding for following documents, even if http(s) says differently
        dsch=<scheme>        default scheme for urls derived from following documents, defaults to "https"
        dpsch=<bool>         use the parent documents scheme if available, defaults to true unless dsch is specified
        dfsch=<scheme>       force this scheme for urls derived from following documents
        doc=<chain spec>     chains that matched documents should apply to, default is the same chain
        dd=<duplication>     whether to allow document duplication (default: unique, values: allowed, nonrecursive, unique)
    Initial Documents:
        url=<url>            fetch a document from a url, derived document matches are (relative) urls
        file=<path>          fetch a document from a file, derived documents matches are (relative) file pathes
        rfile=<path>         fetch a document from a file, derived documents matches are urls

    Other:
        selstrat=<strategy>  matching strategy for selenium (default: plain, values: anymatch, plain, interactive, deduplicate)
        seldl=<dl strategy>  download strategy for selenium (default: external, values: external, internal, fetch)
        owf=<bool>           allow to overwrite existing files, defaults to true

    Format Args:
        Named arguments for <format string> arguments.
        Some only become available later in the pipeline (e.g. {cm} is not available inside cf).

        {cx}                 content xpath match
        {cr}                 content regex match, equal to {cx} if cr is unspecified
        <cr capture groups>  the named regex capture groups (?P<name>...) from cr are available as {name},
                             the unnamed ones (...) as {cg<unnamed capture group number>}
        {cf}                 content after applying cf
        {cjs}                output of cjs
        {cm}                 final content match after link normalization (cl) and user interaction (cin)
        {c}                  content, downloaded from cm in case of cl, otherwise equal to cm

        {lx}                 label xpath match
        {lr}                 label regex match, equal to {lx} if lr is unspecified
        <lr capture groups>  the named regex capture groups (?P<name>...) from cr are available as {name},
                             the unnamed ones (...) as {lg<unnamed capture group number>}
        {lf}                 label after applying lf
        {ljs}                output of ljs
        {l}                  final label after user interaction (lin)

        {dx}                 document link xpath match
        {dr}                 document link regex match, equal to {dx} if dr is unspecified
        <dr capture groups>  the named regex capture groups (?P<name>...) from dr are available as {name},
                             the unnamed ones (...) as {dg<unnamed capture group number>}
        {df}                 document link after applying df
        {djs}                output of djs
        {d}                  final document link after user interaction (din)

        {di}                 document index
        {ci}                 content index
        {dl}                 document link (inside df, this refers to the parent document)
        {cenc}               content encoding, deduced while respecting cenc and cfenc
        {cesc}               escape sequence for separating content, can be overwritten using cesc
        {chain}              id of the match chain that generated this content

        {fn}                 filename from the url of a cm with cl
        {fb}                 basename component of {fn} (extension stripped away)
        {fe}                 extension component of {fn}, including the dot (empty string if there is no extension)


    Chain Syntax:
        Any option above can restrict the matching chains is should apply to using opt<chainspec>=<value>.
        Use "-" for ranges, "," for multiple specifications, and "^" to except the following chains.
        Examples:
            lf1,3-5=foo        sets "lf" to "foo" for chains 1, 3, 4 and 5.
            lf2-^4=bar         sets "lf" to "bar" for all chains larger than or equal to 2, except chain 4

    Miscellaneous:
        help                   prints this help
        selinstall=<browser>   installs selenium driver for the specified browser in the directory of this script
        seluninstall=<browser> uninstalls selenium driver for the specified browser in the directory of this script
        selupdate=<browser>    updates (or installs) the local selenium driver for the specified browser
        version                print version information

    Global Options:
        timeout=<float>        seconds before a web request timeouts (default 30)
        bfs=<bool>             traverse the matched documents in breadth first order instead of depth first
        v=<verbosity>          output verbosity levels (default: warn, values: info, warn, error)
        ua=<string>            user agent to pass in the html header for url GETs
        uar=<bool>             use a rangom user agent
        selkeep=<bool>         keep selenium instance alive after the command finished
        cookiefile=<path>      path to a netscape cookie file. cookies are passed along for url GETs
        sel=<browser|empty>    use selenium to load urls into an interactive browser session
                               (empty means firefox, default: disabled, values: tor, chrome, firefox, disabled)
        selh=<bool>            use selenium in headless mode, implies sel
        tbdir=<path>           root directory of the tor browser installation, implies sel=tor
                               (default: environment variable TOR_BROWSER_DIR)
        mt=<int>               maximum threads for background downloads, 0 to disable. defaults to cpu core count
        repl=<bool>            accept commands in a read eval print loop
        exit=<bool>            exit the repl (with the result of the current command)