uniparser

Provide a universal solution for crawler platforms. Python3.6+ is needed.

Install

pip install uniparser -U

Why?

Reduced the code quantity from plenty of similar crawlers & parsers. Don't Repeat Yourself.
Make the parsing process of different parsers persistent.
Separating parsing processes from the downloading.
Provide a universal solution for crawler platforms.
Summarize common string parsing tools on the market.

Quick Start

# -*- coding: utf-8 -*-
import requests
from uniparser import Uniparser, CrawlerRule, HostRules
from urllib.parse import urlparse


def test_default_usage():

    # prepare for storage
    uni = Uniparser()
    storage = {}
    test_url = 'http://httpbin.org/get'
    crawler_rule = CrawlerRule(
        'test',
        {
            'url': 'http://httpbin.org/get',
            'method': 'get'
        },
        [
            ['objectpath', 'JSON.url', ''],
            ['python', 'getitem', '[:4]'],
            ['udf', '(context.url, input_object)', ''],
        ],
        'https?://httpbin.org/get',
    )
    host = urlparse(test_url).netloc
    hrs = HostRules(host=host)
    hrs.add(crawler_rule)
    json_string = hrs.to_json()
    # print(json_string)
    assert json_string == '{"host": "httpbin.org", "rules": [{"name": "test", "parse_rules": [["objectpath", "JSON.url", ""], ["python", "getitem", "[:4]"], ["udf", "(context.url, input_object)", ""]], "request_args": {"url": "http://httpbin.org/get", "method": "get"}, "regex": "https?://httpbin.org/get"}]}'
    # add HostRules to storage, storage sometimes using in redis
    storage[hrs['host']] = json_string
    # ============================================
    # start to crawl
    # 1. set a example url
    test_url1 = test_url
    # 2. find the HostRules
    json_string = storage.get(host)
    # 3. HostRules init: load from json
    hrs = HostRules.from_json(json_string)
    # print(crawler_rule)
    # 4. now search / match the url with existing rules
    rule = hrs.search(test_url1)
    assert rule == {
        'name': 'test',
        'parse_rules': [['objectpath', 'JSON.url', ''],
                        ['python', 'getitem', '[:4]'],
                        ['udf', '(context.url, input_object)', '']],
        'request_args': {
            'url': 'http://httpbin.org/get',
            'method': 'get'
        },
        'regex': 'https?://httpbin.org/get'
    }
    assert rule == hrs.match(test_url1)
    # 5. download as rule's request_args
    resp = requests.request(**rule['request_args'])
    # 6. parse as rule's parse_rules
    result = uni.parse(resp.text, rule, context=resp)
    # print(result)
    assert result == ('http://httpbin.org/get', 'http')
    # ===================== while search failed =====================
    test_url2 = 'http://notmatch.com'
    rule = hrs.search(test_url2)
    assert rule is None


if __name__ == "__main__":
    test_default_usage()

More Usage

Talk is cheap, code == doc. ^_^

Watch the examples: test_parsers.py

Online Web UI for testing is coming soon...

Uniparser Test Console Demo

pip install bottle uniparser
python webui_bottle.py
open browser: http://127.0.0.1:8080/

TODO

Release to pypi.org
Add github actions for testing package
Web UI for testing rules
Complete the whole doc

uniparser
Release 3.0.2

Release 3.0.2

3.0.2

3.0.1

3.0.0

2.0.2

2.0.1

2.0.0

1.9.0

1.8.9

1.8.8

1.8.7

Documentation

uniparser

Install

Why?

Quick Start

More Usage

Uniparser Test Console Demo

TODO

Stats

Development practices

Releases

Contributors

uniparser Release 3.0.2

Release 3.0.2 Toggle Dropdown 3.0.2 3.0.1 3.0.0 2.0.2 2.0.1 2.0.0 1.9.0 1.8.9 1.8.8 1.8.7

Documentation

uniparser

Install

Why?

Quick Start

More Usage

Uniparser Test Console Demo

TODO

Stats

Development practices

Releases

Contributors

uniparser
Release 3.0.2

Release 3.0.2

3.0.2

3.0.1

3.0.0

2.0.2

2.0.1

2.0.0

1.9.0

1.8.9

1.8.8

1.8.7