Protego is a pure-Python robots.txt
parser with support for modern conventions.
To install Protego, simply use pip:
pip install protego
>>> from protego import Protego >>> robotstxt = """ ... User-agent: * ... Disallow: / ... Allow: /about ... Allow: /account ... Disallow: /account/contact$ ... Disallow: /account/*/profile ... Crawl-delay: 4 ... Request-rate: 10/1m # 10 requests every 1 minute ... ... Sitemap: http://example.com/sitemap-index.xml ... Host: http://example.co.in ... """ >>> rp = Protego.parse(robotstxt) >>> rp.can_fetch("http://example.com/profiles", "mybot") False >>> rp.can_fetch("http://example.com/about", "mybot") True >>> rp.can_fetch("http://example.com/account", "mybot") True >>> rp.can_fetch("http://example.com/account/myuser/profile", "mybot") False >>> rp.can_fetch("http://example.com/account/contact", "mybot") False >>> rp.crawl_delay("mybot") 4.0 >>> rp.request_rate("mybot") RequestRate(requests=10, seconds=60, start_time=None, end_time=None) >>> list(rp.sitemaps) ['http://example.com/sitemap-index.xml'] >>> rp.preferred_host 'http://example.co.in'
Using Protego with Requests:
>>> from protego import Protego >>> import requests >>> r = requests.get("https://google.com/robots.txt") >>> rp = Protego.parse(r.text) >>> rp.can_fetch("https://google.com/search", "mybot") False >>> rp.can_fetch("https://google.com/search/about", "mybot") True >>> list(rp.sitemaps) ['https://www.google.com/sitemap.xml']
The following table compares Protego to the most popular robots.txt
parsers implemented in Python or featuring Python bindings:
Protego | RobotFileParser | Reppy | Robotexclusionrulesparser | |
---|---|---|---|---|
Implementation language | Python | Python | C++ | Python |
Reference specification | `Martijn Koster’s | 1996 dra | ft`_ | |
Wildcard support | âś“ | âś“ | âś“ | |
Length-based precedence | âś“ | âś“ | ||
Performance | +40% | +1300% | -25% |
Class protego.Protego
:
-
sitemaps
{list_iterator
} A list of sitemaps specified inrobots.txt
. -
preferred_host
{string} Preferred host specified inrobots.txt
.
-
parse(robotstxt_body)
Parserobots.txt
and return a new instance ofprotego.Protego
. -
can_fetch(url, user_agent)
Return True if the user agent can fetch the URL, otherwise returnFalse
. -
crawl_delay(user_agent)
Return the crawl delay specified for the user agent as a float. If nothing is specified, returnNone
. -
request_rate(user_agent)
Return the request rate specified for the user agent as a named tupleRequestRate(requests, seconds, start_time, end_time)
. If nothing is specified, returnNone
. -
visit_time(user_agent)
Return the visit time specified for the user agent as a named tupleVisitTime(start_time, end_time)
. If nothing is specified, returnNone
.