panther

Simple crawling and extraction in Python.


License
Other
Install
pip install panther

Documentation

panther

Read about it at http://python-panther.org!

Get it?  It's a panther.

Panther is a very simple Python scraping library with an emphasis on rapid development, ease of use, and cute panthers. This package is still in a very early development stage but, hey, it works!

Installation

pip install panther

How to use

Panther exposes two main methods, pounce() and prowl().

pounce() takes two objects -- a URL (or list of URLs) to check and a CSS/XPath selector (or list of selectors) to extract, e.g.:

# Grab the top 125 subreddits.
url = "http://www.redditlist.com/"
links = panther.pounce(url, "#yw2 td:nth-child(2) a")
urls = map(lambda a: a.get('href') + "gilded", links)

prowl() takes those same two objects, as well as a third object -- another CSS/XPath selector (or list of selectors). If it finds any a matches in those selectors, it crawls those URLs as well, e.g.:

url = "http://dcurt.is/the-fight"
selectors = [".article_title a", ".num"]
next_button = "#readnext a"

for result in panther.prowl(url, selectors, next_button):
    print result.get(selectors[0])[0].text, result.get(selectors[1])[0].text

Check out the examples folder for, well, examples.

Dependencies

  • cssselect
  • lxml
  • requests