ftw.crawler
Installation
To install ftw.crawler
, the easiest way is to create a buildout that
contains the configuration, pulls in the egg using zc.recipe.egg
and
creates a script in the bin/
directory that directly launches the crawler
with the respective configuration as an argument:
-
First, create a configuration file for the crawler. You can base your configuration on ftw/crawler/tests/assets/basic_config.py by copying it to your buildout and adapting it as needed.
Make sure to configure at least the
tika
andsolr
URLs to point to the correct locations of the respective services, and to adapt thesites
list to your needs. -
Create a buildout config that installs
ftw.crawler
usingzc.recipe.egg
:crawler.cfg
[buildout] parts += crawler crawl-foo-org [crawler] recipe = zc.recipe.egg eggs = ftw.crawler
-
Further define a buildout section that creates a
bin/crawl-foo-org
script, which will callbin/crawl foo_org_config.py
using absolute paths (for easier use from cron jobs):[crawl-foo-org] recipe = collective.recipe.scriptgen cmd = ${buildout:bin-directory}/crawl arguments = ${buildout:directory}/foo_org_config.py --tika http://localhost:9998/ --solr http://localhost:8983/solr
(The
--tika
and--solr
command line arguments are optional, they can also be set in the configuration file. If given, the command line arguments take precedence over any parameters in the config file.) -
Add a buildout config that downloads and configures a Tika JAXRS server:
tika-server.cfg
[buildout] parts += supervisor tika-server-download tika-server [supervisor] recipe = collective.recipe.supervisor plugins = superlance port = 8091 user = supervisor password = admin programs = 10 tika-server (stopasgroup=true) ${buildout:bin-directory}/tika-server true your_os_user [tika-server-download] recipe = hexagonit.recipe.download url = http://repo1.maven.org/maven2/org/apache/tika/tika-server/1.5/tika-server-1.5.jar md5sum = 0f70548f233ead7c299bf7bc73bfec26 download-only = true filename = tika-server.jar [tika-server] port = 9998 recipe = collective.recipe.scriptgen cmd = java arguments = -jar ${tika-server-download:destination}/${tika-server-download:filename} --port ${:port}
Modify
your_os_user
and the supervisor and Tika ports as needed. -
Finally, add a bootstrap.py and create the
buildout.cfg
that pulls all of the above together:buildout.cfg
[buildout] extensions = mr.developer extends = tika-server.cfg crawler.cfg
-
Bootstrap and run buildout:
python bootstrap.py bin/buildout
Running the crawler
If you created the bin/crawl-foo-org
script with the buildout described
above, that's all you need to run the crawler:
- Make sure Tika and Solr are running
- Run
bin/crawl-foo-org
(with either a relative or absolute path, working directory doesn't matter, so it can easily be called from a cron job)
bin/crawl
directly
Running The bin/crawl-foo-org
is just a tiny wrapper that calls the bin/crawl
script, generated by ftw.crawler
's setuptools console_script
entry point, with the absolute path to the configuration file as the only
argument. Any other arguments to the bin/crawl-foo-org
script will be
forwarded to bin/crawl
.
Therefore running bin/crawl-foo-org [args]
is equivalent to
bin/crawl foo_org_config.py [args]
.
Provide known sitemap urls in site configs
If you know the sitemap url, you can configure one or many sitemap urls statically:
Site('http://example.org/foo/',
sitemap_urls=['http://example.org/foo/the_sitemap.xml'])
Configure site ID for purging
In order for the purging to work smoothly it is recommend to configure a crawler site ID. Make sure that each site ID is unique per solr core! Candidate documents for purging will be identified by this crawler site id.
Site('http://example.org/',
crawler_site_id='example.org-news')
Be aware that your solr core must provide a string-field crawler_site_id
.
Indexing only a particular URL
If you only want to index a particular URL, pass that URL as the first
argument to bin/crawl-foo-org
. The crawler will then only fetch and index
that specific URL.
Slack-Notifications
ftw.crawler
supports Slack-Notifications. Those notifications can be used
to monitor the crawler on possible errors while crawling.
To enable slack-notifications for your environment, you need to do the following things:
- Install
ftw.crawler
with theslack
extra. - Set the SLACK_TOKEN and the SLACK_CHANNEL params in your crawler config or
- use the --slacktoken and the --slackchannel arguments in the command line when calling the /crawl script.
To generate a valid slack token for your integration, you have to create a new bot in your slack-team. After you generated the new bot slack will automatically generate a valid token for this bot. This token can then be used for your integration. You can also generate a test token to test your integration, but don't forget to create a bot for this if your application goes to production!
Development
To start hacking on ftw.crawler
, use the development.cfg
buildout:
ln -s development.cfg buildout.cfg
python bootstrap.py
bin/buildout
This will build a Tika JAXRS server and a Solr instance for you. The Solr configuration is set up to be compatible with the testing / example configuration at ftw/crawler/tests/assets/basic_config.py.
To run the crawler against the example configuration:
bin/tika-server
bin/solr-instance fg
bin/crawl ftw/crawler/tests/assets/basic_config.py
Links
- Github: https://github.com/4teamwork/ftw.crawler
- Issues: https://github.com/4teamwork/ftw.crawler/issues
- Pypi: http://pypi.python.org/pypi/ftw.crawler
- Continuous integration: https://jenkins.4teamwork.ch/search?q=ftw.crawler
Copyright
This package is copyright by 4teamwork.
ftw.crawler
is licensed under GNU General Public License, version 2.