winspider 0.0.11 on PyPI

winspider

winspider是基于pyspider的爬虫分享平台，并对pyspider做了一些扩展，你可以沿用pyspider的教程。在pyspider框架下开发的爬虫，完全可以在winspider中运行。

winspider提供了一些类和方法，使爬虫的开发变得更简单。例如，你需要采集登录状态下的数据，那么你不再需要分析网站的登录逻辑，只需要使用self.crawl_with_chrome_cookies这个方法就可以了，这个方法的意思是使用Chrome的cookie发送请求。因此，你只需事先用Chrome登录网站就好了。

当然，winspider更希望做到的是爬虫的分享，你可以在WinSpider上寻找你需要的爬虫，或者分享自己的爬虫。

安装

pip install winspider

以下是一个简单的例子：

from pyspider.libs.base_handler import *
from winspider import HandlerMixin, ChromeObj, ResponseInspector

chrome_obj = ChromeObj()
checkpoint1 = (
    lambda response: response.etree.xpath("*//form[@action='/errors/validateCaptcha']/div/div/div/div/img"), '出现验证码')


class AmzHandler(BaseHandler, HandlerMixin):
    crawl_config = {
        'headers': {
            'User-Agent': chrome_obj.useragent
        }
    }

    @every(minutes=3)
    def on_start(self):
        for f in self.walk_files(['.txt']):
            with open(f, 'r') as file:
                for asin in file.readlines():
                    asin = asin.strip()
                    self.crawl_with_chrome_cookies(
                        'https://www.amazon.com/dp/%s' % asin,
                        callback=self.details_page,
                        validate_cert=False
                    )
            self.recycle_file(f)

    @config(age=60 * 60)
    @ResponseInspector(checkpoint1)
    def details_page(self, response):
        return {
            'title': ''.join(response.etree.xpath(".//span[@id='productTitle']/text()")).strip()
        }

这个爬虫采集了亚马逊（美国）的商品页面，采集的信息只有商品的标题。

在这个爬虫中主要使用了winspider的三个类ChromeObj, HandlerMixin, ResponseInspector。

ChromeObj

用于查询Chrome浏览器的cookies和UserAgent。

get_cookies(url)

查询url对应的cookies。
useragent

返回当前Chrome的UserAgent。需事先用winspider保存Chrome的UserAgent，否则返回空字符。

HandlerMixin

扩展pyspider.BaseHandler的组件，提供了一些适用于winspider的方法。

HandlerMixin在初始化过程中会在 [我的文档] > winspider 下新建一个名为 [project_name] 的配置文件夹，[project_name]为爬虫的名称（唯一）。

folder

返回爬虫的配置文件夹路径。
crawl_with_chrome_cookies(url, **kwargs)

使用Chrome的cookies发送请求，参数与pyspider的self.crawl方法相同，详见pyspider文档。
remove_file(file)

删除配置文件夹下的文件。
recycle_file(file)

回收配置文件夹中的文件，将该文件移动到recycle目录中。
walk_files(suffixes=['.txt'])

遍历配置文件夹中后缀名在 suffixes 中的文件，主要用于解析这些文件生成新的爬虫任务。

ResponseInspector

用作装饰器，检查返回的Response是否正常。

checkpoint1 = (lambda response: response.etree.xpath("xpath1"), '提示1')
checkpoint2 = (lambda response: response.etree.xpath("xpath2"), '提示2')

...

@ResponseInspector(checkpoint1, checkpoint2)
def details_page(self, response):
    ...

检查项为元组。第一项为匿名函数，提供检查规则，第二项为提示信息，可为空字符，当提示信息不为空时会弹框提示用户，弹框在5分钟内不会重复出现。

注意事项

winspider只能在Windows上运行。

winspider
Release 0.0.11

Release 0.0.11

0.0.11

0.0.10

0.0.9

0.0.8

0.0.7

0.0.6

0.0.5

0.0.4

0.0.3

0.0.2

Documentation

winspider

安装

ChromeObj

HandlerMixin

ResponseInspector

注意事项

相关资料

Stats

Development practices

Releases

Contributors

winspider Release 0.0.11

Release 0.0.11 Toggle Dropdown 0.0.11 0.0.10 0.0.9 0.0.8 0.0.7 0.0.6 0.0.5 0.0.4 0.0.3 0.0.2

Documentation

winspider

安装

ChromeObj

HandlerMixin

ResponseInspector

注意事项

相关资料

Stats

Development practices

Releases

Contributors

winspider
Release 0.0.11

Release 0.0.11

0.0.11

0.0.10

0.0.9

0.0.8

0.0.7

0.0.6

0.0.5

0.0.4

0.0.3

0.0.2