Extracts content from html using rules.

pip install textminer==1.3.1



Textminer extracts values, lists and dicts from text. It works on all text formats and is heavily used on html pages.

Giving a piece of html

<div id="value1">111</div>
<div id="value2">222</div>

The usual way of extracting the two values "111" and "222" is:

start1 = html.find('<div id="value1">') + len('<div id="value1">')
if start1 == -1:
    end1 = 0
    value1 = None
    end1 = html.find('</div>', start1)
    value1 = html[start1:end1]
    value1 = int(value1)
start2 = html.find('<div id="value2">') + len('<div id="value1">', end1)
if start2 == -1:
    value2 = None
    end2 = html.find('</div>', start2)
    value2 = html[start2:end2]
    value2 = int(value2)

The textminer's way of doing the same thing is:

import textminer

rule = '''
- key: value1
  prefix: <div id="value1">
  suffix: </div>
  type: int
- key: value2
  prefix: <div id="value2">
  suffix: </div>
  type: int
results = textminer.extract(html, rule)

Textminer uses yaml to define rules, which is far more clear and expressive. This enables you to write very complicated rule for hierarchical extraction (see below).


pip install textminer


You can test your rules here.

Basic Usage

Extract a single value from html

import textminer

html = '<html><body><div>abc</div></body></html>'
rule = '''
  prefix: <div>
  suffix: </div>
result = textminer.extract(html, rule)
# result == 'abc'

Extract a list from html

import textminer

html = '''
rule = '''
  prefix: <li>
  suffix: </li>
result = textminer.extract(html, rule)
# result == ['aaa', 'bbb', 'ccc']

Extract a dict from html

import textminer

html = '''
<div id="code">001</div>
<div id="value">123</div>
rule = '''
- key: code
  prefix: <div id="code">
  suffix: </div>
- key: value
  prefix: <div id="value">
  suffix: </div>
result = textminer.extract(html, rule)
# result == {'code': '001', 'value': '123'}

Note that the fields in the rule should be in the order they appear in the html.

Hierarchical extraction

The real power of textminer is to do hierarchical extraction.

import textminer

html = '''
<h1>Test Page</h1>
rule = '''
- key: title
  prefix: <h1>
  suffix: </h1>
- key: items
  prefix: <table>
  suffix: </table>
    prefix: <tr>
    suffix: </tr>
    - key: id
      prefix: <td>
      suffix: </td>
    - key: value
      prefix: <td>
      suffix: </td>
      type: int
result = textminer.extract(html, rule)
# result == {
#     'title': 'Test Page',
#     'items': [
#         {'code': '001', 'value': 123},
#         {'code': '002', 'value': 321}
#     ]
# }

Extract from a url

Since textminer is heavily used on web pages. It provides a utility function extract_from_url to download html and extract from it. This saves you a few lines of code.

import textminer

rule = '''
  prefix: <title>
  suffix: </title>
textminer.extract_from_url('http://www.google.com/', rule)

Advanced Usage


import textminer

html = '<html><body><div>1<b>2</b>3</div></body></html>'
rule = '''
  prefix: <div>
  suffix: </div>
  - strip_html
  - float
  - eval('value / 100')
result = textminer.extract(html, rule)
# result == 1.23

Regular expressions for prefix & suffix

Regular expressions are denoted by "/" before and after the string.

import textminer

html = '<html><body><div sessionId="123456789">aaa</div></body></html>'
rule = '''
  prefix: /<div sessionId="\\d+">/
  suffix: </div>
result = textminer.extract(html, rule)

Using rules of other formats

Yaml is perfect for the rules, but textminer also supports json and raw python dict.

import textminer

html = '<html><body><div>123</div></body></html>'

python_rule = {'value': {'prefix': '<body>', 'suffix': '</body>'}}
result = textminer.extract(html, python_rule, fmt=None)

json_rule = '{"value": {"prefix": "<body>", "suffix": "</body>"}}'
result = textminer.extract(html, json_rule, fmt='json')

Python3 Support

Textminer is tested under python 2.7 and python 3.3.


Mengchen LEE: Google Plus, LinkedIn