A package designed to scrape webpages using aiohttp and asyncio. Has some error handling to overcome common issues such as sites blocking you after n requests over a short period.


Keywords
scraping, async, requests
License
MIT
Install
pip install async-scrape==0.1.18

Documentation

Async-scrape

Perform webscrape asyncronously

Build Status

Async-scrape is a package which uses asyncio and aiohttp to scrape websites and has useful features built in.

Features

  • Breaks - pause scraping when a website blocks your requests consistently
  • Rate limit - slow down scraping to prevent being blocked

Installation

Async-scrape requires C++ Build tools v15+ to run.

pip install async-scrape

How to use it

#Create an instance
from async_scrape import AsyncScrape

def post_process(html, resp, **kwargs):
    """Function to process the gathered response from the request"""
    if resp.status == 200:
        return "Request worked"
    else:
        return "Request failed"

async_Scrape = AsyncScrape(
    post_process_func=post_process,
    post_process_kwargs={},
    fetch_error_handler=None,
    use_proxy=False,
    proxy=None,
    pac_url=None,
    acceptable_error_limit=100,
    attempt_limit=5,
    rest_between_attempts=True,
    rest_wait=60
)

urls = [
    "https://www.google.com",
    "https://www.bing.com",
]

resps = async_Scrape.scrape_all(urls)

Response object is a list of dicts in the format:

{
    "url":url, #url of request
    "func_resp":func_resp, #response from post processing function
    "status":resp.status, #http status
    "error":None #any error encountered
}

License

MIT

Free Software, Hell Yeah!