A package for quick using utils


Keywords
crawler, distributed, ip, proxies, proxy, proxy-provider, proxypool, python, redis, spider, spoon
License
BSD-3-Clause
Install
pip install yhb==0.1.9

Documentation

Spoon - A package for building specific Proxy Pool for different Sites.

Spoon is a library for building Distributed Proxy Pool for each different sites as you assign.
Only running on python 3.

Install

Simply run: pip install spoonproxy or clone the repo and set it into your PYTHONPATH.

Run

Spoon-server

Please make sure the Redis is running. Default configuration is "host:localhost, port:6379". You can also modify the Redis connection.
Like example.py in spoon_server/example,
You can assign many different proxy providers.

from spoon_server.proxy.fetcher import Fetcher
from spoon_server.main.proxy_pipe import ProxyPipe
from spoon_server.proxy.kuai_provider import KuaiProvider
from spoon_server.proxy.xici_provider import XiciProvider
from spoon_server.database.redis_config import RedisConfig
from spoon_server.main.checker import CheckerBaidu

def main_run():
    redis = RedisConfig("127.0.0.1", 21009)
    p1 = ProxyPipe(url_prefix="https://www.baidu.com",
                   fetcher=Fetcher(use_default=False),
                   database=redis,
                   checker=CheckerBaidu()).set_fetcher([KuaiProvider()]).add_fetcher([XiciProvider()])
    p1.start()


if __name__ == '__main__':
    main_run()

Also, with different checker, you can validate the result precisely.

class CheckerBaidu(Checker):
    def checker_func(self, html=None):
        if isinstance(html, bytes):
            html = html.decode('utf-8')
        if re.search(r".*百度一下,你就知道.*", html):
            return True
        else:
            return False

Also, as the code shows in spoon_server/example/example_multi.py, by using multiprocess, you can get many queues to fetching & validating the proxies.
You can also assign different Providers for different url.
The default proxy providers are shown below, you can write your own providers.

name description
WebProvider Get proxy from http api
FileProvider Get proxy from file
GouProvider http://www.goubanjia.com
KuaiProvider http://www.kuaidaili.com
SixProvider http://m.66ip.cn
UsProvider https://www.us-proxy.org
WuyouProvider http://www.data5u.com
XiciProvider http://www.xicidaili.com
IP181Provider http://www.ip181.com
XunProvider http://www.xdaili.cn
PlpProvider https://list.proxylistplus.com
IP3366Provider http://www.ip3366.net
BusyProvider https://proxy.coderbusy.com
NianProvider http://www.nianshao.me
PdbProvider http://proxydb.net
ZdayeProvider http://ip.zdaye.com
YaoProvider http://www.httpsdaili.com/
FeilongProvider http://www.feilongip.com/
IP31Provider https://31f.cn/http-proxy/
XiaohexiaProvider http://www.xiaohexia.cn/
CoolProvider https://www.cool-proxy.net/
NNtimeProvider http://nntime.com/
ListendeProvider https://www.proxy-listen.de/
IhuanProvider https://ip.ihuan.me/
IphaiProvider http://www.iphai.com/
MimvpProvider(@NeedCaptcha) https://proxy.mimvp.com/
GPProvider(@NeedProxy if you're in China) http://www.gatherproxy.com
FPLProvider(@NeedProxy if you're in China) https://free-proxy-list.net
SSLProvider(@NeedProxy if you're in China) https://www.sslproxies.org
NordProvider(@NeedProxy if you're in China) https://nordvpn.com
PremProvider(@NeedProxy if you're in China) https://premproxy.com
YouProvider(@Deprecated) http://www.youdaili.net

Spoon-web

A Simple django web api demo. You could use any web server and write your own api.
Gently run python manager.py runserver **.**.**.**:*****
The simple apis include:

name description
http://127.0.0.1:21010/api/v1/get_keys Get all keys from redis
http://127.0.0.1:21010/api/v1/fetchone_from?target=www.google.com&filter=65 Get one useful proxy.
target: the specific url
filter: successful-revalidate times
http://127.0.0.1:21010/api/v1/fetchall_from?target=www.google.com&filter=65 Get all useful proxies.
http://127.0.0.1:21010/api/v1/fetch_hundred_recent?target=www.baidu.com&filter=5 Get recently joined full-scored proxies.
target: the specific url
filter: time in seconds
http://127.0.0.1:21010/api/v1/fetch_stale?num=100 Get recently proxies without check.
num: the specific number of proxies you want
http://127.0.0.1:21010/api/v1/fetch_recent?target=www.baidu.com Get recently proxies that successfully validated.
target: the specific url