sitemapgenerator

Create an XML sitemap by crawling a website.


License
MIT
Install
pip install sitemapgenerator==0.1.0

Documentation

sitemapgenerator

Creates an XML sitemap of a domain.

Python3+.

Install

pip install sitemapgenerator

Usage

usage: sitemapgenerator [-h] [-f FILE] [-t THROTTLE] [-l LIMIT] [-q] domain

Generate an XML sitemap for a domain

positional arguments:
  domain                domain to crawl

optional arguments:
  -h, --help            show this help message and exit
  -f FILE, --file FILE  write the xml to a file
  -t THROTTLE, --throttle THROTTLE
                        max time in secs to wait between requesting URLs
  -l LIMIT, --limit LIMIT
                        max number of URLs to crawl
  -q, --quiet

Example Usage

$ sitemapgenerator -f site.xml -l 1 devopsreactions.tumblr.com

crawling homepage
crawling /post/146054449345/ops-report-three-out-of-five-app-servers#notes
crawled 2 URLs
wrote sitemap to /tmp/site.xml

Development

Setup

Set up virtualenv

pyenv install 3.5.0
pyenv local 3.5.0
pyvenv env
source env/bin/activate

Install requirements

pip install -r requirements.txt

Update requirements

pip install -r requirements-to-freeze.txt --upgrade
pip freeze > requirements.txt

Tests

py.test tests -q

TODO

  • normalize URLs to remove dupes
    • hashes from end of URLs (eg /some/url/#respond)
    • tailing slashes on URLs
  • add option to create sitemap of:
    • external URLs
    • non HTML URLs on same domain
  • refactor code
    • make class methods static which can be converted
    • create single getter method for Crawler class links and remove extra get_* methods
  • add concurrency (eventlet/gevent)
  • add progress bar to CLI
  • add support for Python 2
    • add tox tests for different python versions