Web scraping API to outsource tons of GET & xpath to cloud computing


Keywords
umihico, minigun, web, scraping, requests, lxml, proxy, threading, multiprocessing, aws, cloud, lambda, crawler, crawling, scraping-api, scraping-framework, scraping-python, web-scraping
License
MIT
Install
pip install minigun==0.1.8

Documentation

minigun-requests

Web scraping API to outsource tons of GET & xpath to cloud computing

PyPI made-with-python

Features

  • Back-end process your requests between 1,000-200,000 rounds per minute like minigun's rate of fire.
  • Automatic concurrency scaling design to finish requests within 5 minutes regardless of the amount.

demoflowchart

Performance Examples

  • 6911 requests to get all stock prices from www.nasdaq.com in 72 seconds
  • 34453 requests to get all available rental names in tokyo from www.suumo.jp in 201 seconds
  • 500000 requests to get new question titles from www.stackoverflow.com in 151 seconds

Getting Started

Installing

pip install minigun

Running the tests enough with trial account

import minigun
urls = [
    "https://www.xxx.com/pages/1",
    "https://www.xxx.com/pages/2",
    "https://www.xxx.com/pages/3",
]
scraping_xpaths = [
    "//div[@id='xxx']",
    "//div[@id='yyy']",
]
result=minigun.requests(urls, scraping_xpaths, email='trial', password='trial')

# if you abort while waiting, use get_output_from_url function to get result
result=minigun.get_output_from_url("http://minigun.umihi.co/DISTPLAYED_NUMBERS.txt")

1 cent = 100 requests! from $3

 If you are sure your arguments works well and willing to do more requests, please go to PayPal page and top-up. After payment, PayPal's instant payment notification triggers immediately registering and top-up function. Then you can replace arguments to your PayPal email address and password you set.

import minigun
minigun.requests(urls, scraping_xpaths, email='YOUR PAYPAL EMAIL', password='YOUR PASSWORD')

Advanced Usage

Can I know how much my balance left?

This command will print your balance.

import minigun
minigun.get_left_balance(email="YOUR PAYPAL EMAIL", password="YOUR PASSWORD")

Can I change my password?

You can set and change only when you top-up. Only the newest password works.

When you get error from result

 If you get nested dictionary as output correctly but some values are "error", they happen when one of "validation_xpaths" always return False from the parsed html regardless of retrying many times with IP rotation. "validation_xpaths" are optional argument which is generated according to "scraping_xpaths" by default like this.

validation_xpath = "boolean(" + scraping_xpath + ")"

This default validation_xpaths with 'Error' means "one of scraping_xpaths couldn't find any elements." This is what's happening in back-end. Please check the url and make sure the all scraping_xpaths pick at least one elements from the page. If you notice the element you want is not always there, you need to customize validation_xpaths.

 Why are validation_xpaths neccesary? It's because in tons of requests, responses is not always what you want. They are busy one, IP blocking, and non-related responses from proxy servers. "validation_xpaths" are used to detect such unwanted responses and then process can retry with another IP. This is common problem of web scraping (some websites block you even if your rate of access is slow)

Examples of "validation_xpaths"

 Best practice is simplifying validation_xpaths, like specifying only elements which exist definitely and unique, not in busy/blocked/non-related response. For example if you are scarping personal profile webpage, "Name" sounds definite, but "email" and "LinkedIn" sounds optional. More special case examples are blow:

# Case1: scraping_xpaths are weak and high likely to match any responses
scraping_xpaths=['//title', ] 
# it's fine if you want only titles, but not useful to kick unwanted responses out.
validation_xpaths = ['boolean(//*[@id='something_unique'])', ] 
# specify something which doesn't exist in busy/wrong/blocked/unkonwn responses

# Case2: unsure the url(page) exist or not
# you can still scrap when 404 error if the content is html. telling that 404 is expected response stop retrying
validation_xpaths = ["boolean(//*[@id='something_unique_when_200']|//*[@id='something_unique_when_404'])", ] 
# use "|" as "or"

# Case3: error page is quite similar with normal response
validation_xpaths = ["not(//*[@id='busy_page_unique_element']", ] 
# detect element which appear when error response with "not" function

Issues

  • JavaScript dynamic pages are not supported.
  • CPUs are scalable but proxy servers are only between 2,000-5,000 so far. Their bandwidths and IP evaluations are the biggest speed limit.
  • AWS Lambda limits the amount of payload only 6MB, so the amount of urls in one API request are limited.
  • Sometimes result are partially "error" even when url and validation_xpath are correct. It happens because validation use statistical approach. Validations give up and return False when "good" proxy servers fail to GET many times. Proxy servers are evaluated "good" when they succeed many time in previous requests. When good ones get banned at the same time, Back-end judge wrongly as a result.

Contributing

  • Helping my language development is greatly appreciated
  • Feel free to tell me features you want and errors you are facing