Ciur is a scrapper layer based on DSL for extracting data
pip install ciur==0.2.0
Ciur is a scrapper layer in development
Ciur is a lib because it has less black magic than a framework
It exports all scrapper related code into separate layer.
If you are annoyed by Spaghetti code, sql inside php and inline css inside html THEN you also are annoyed by xpath/css code inside crawler.
Samples of bad code.
Ciur gives the taste of Lasagna code generally by enforcing encapsulation for scrapping layer.
Ciur is Romanian for Sieve.
It fulfils the same purpose in the sense of being a "device for separating wanted elements from unwanted material".
This means that code may be included in proprietary code without any additional restrictions.
Please see LICENSE.
Ciur uses own dsl, for example
$ cat python-ciur/tests/ciur.d/example.org.ciur
root `/html/body` +1
name `.//h1/text()` +1
paragraph `.//p/text()` +1
$ ciur --url "http://example.org" --rules="example.org.ciur"
{
"root": {
"name": "Example Domain",
"paragraph": "This domain is established to be used for illustrative
examples in documents. You may use this
domain in examples without prior coordination or
asking for permission."
}
}
>>> import ciur >>> from ciur.shortcuts import pretty_parse_from_url >>> with ciur.open_file("example.org.ciur", __file__) as f: ... print pretty_parse_from_url( ... f, ... "http://example.org" ... ) { "root": { "name": "Example Domain", "paragraph": "This domain is established to be used for illustrative examples in documents. You may use this\n domain in examples without prior coordination or asking for permission." } }
Install virtualenv
$ sudo virtualenv -p python2 /opt/python-env/ciur_env/
[sudo] password for ada:
Running virtualenv with interpreter /usr/bin/python2
New python executable in /opt/python-env/ciur_env/bin/python2
Also creating executable in /opt/python-env/ciur_env/bin/python
Installing setuptools, pip, wheel...done.
Install ciur in virtualenv
$ sudo /opt/python-env/ciur_env2/bin/pip install git+https://bitbucket.org/ada/python-ciur.git#egg=ciur
...
Successfully installed cffi-1.4.2 ciur-0.1.2 cryptography-1.1.2
cssselect-0.9.1 enum34-1.1.2 html5lib-0.9999999 idna-2.0 ipaddress-1.0.16
lxml-3.5.0 ndg-httpsclient-0.4.0 pdfminer-20140328 pyOpenSSL-0.15.1
pyasn1-0.1.9 pycparser-2.14 pyparsing-2.0.7 python-dateutil-2.4.2
requests-2.9.1 six-1.10.0
...
demo on cloud9
build documentation on readthedocs
http://lxml.de/lxmlhtml.html#parsing-html
.cssselect(expr):
.base_url: