cmip

一个高效的信息处理库。

安装

pip install -U cmip

用法

1. 动态渲染异步爬虫

example:

from cmip.web import web_scraping
import asyncio
urls = [
        "https://baidu.com",
        "https://qq.com",
        # ...More URL
    ]
asyncio.run(web_scraping(urls, output_path="output", max_concurrent_tasks=10, save_image=True, min_img_size=200))

参数含义：

urls	网页链接（包含协议头）
output_path	输出路径
max_concurrent_tasks	最大同时执行任务数，根据自身机器资源和网络情况调整
save_image	是否保存图片
min_img_size	当图片小于这个值时不爬取

2. 计算网页结构相似度

example:

from cmip.web import html, simhash_array, hamming_distance_array
import requests

headers = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}

html_text_baidu = requests.get("https://baidu.com", headers=headers).text
html_text_example_com = requests.get("https://example.com", headers=headers).text
html_text_example_org = requests.get("https://example.org", headers=headers).text
dom_tree_baidu = html.dom_tree(html_text_baidu)
dom_tree_example_com = html.dom_tree(html_text_example_com)
dom_tree_example_org = html.dom_tree(html_text_example_org)
simhash_baidu = simhash_array(dom_tree_baidu)
simhash_example_com = simhash_array(dom_tree_example_com)
simhash_example_org = simhash_array(dom_tree_example_org)

print("Similarity:", hamming_distance_array([simhash_example_com], [simhash_baidu, simhash_example_org]))

一般而言，当网页距离在4以内时，网页结构较为相似。

cmip
Release 0.0.3

Release 0.0.3

0.0.1

0.0.2

0.0.3

0.0.4

0.0.5

0.0.6

0.0.7

0.0.9

0.0.8

Documentation

cmip

安装

用法

1. 动态渲染异步爬虫

2. 计算网页结构相似度

Stats

Development practices

Releases

Contributors

cmip Release 0.0.3

Release 0.0.3 Toggle Dropdown 0.0.1 0.0.2 0.0.3 0.0.4 0.0.5 0.0.6 0.0.7 0.0.9 0.0.8

Documentation

cmip

安装

用法

1. 动态渲染异步爬虫

2. 计算网页结构相似度

Stats

Development practices

Releases

Contributors

cmip
Release 0.0.3

Release 0.0.3

0.0.1

0.0.2

0.0.3

0.0.4

0.0.5

0.0.6

0.0.7

0.0.9

0.0.8