fxml

Describing html/xml elements using function syntax (FPath).


Keywords
web
License
MIT
Install
pip install fxml==1.0.0

Documentation

Fxml

Describing html/xml elements using function syntax (FPath).

Fxml uses a different approach for describing specific regions of a given xml/html document and retrieving all the interesting entities at once.

Fxml is meant to handle all xpath scenaries yet adding some level of abstraction over some edge cases which xpath syntax would be tricky to use. it is possible to catch html elements using regex or conditions.

It is possible to use both XMLParser from lxml package or HTMLParser from standard python library. However, the XMLParser is the way faster.

You can mix up function conditions with regex patterns on element name/attributes, it permits some interesting approaches for extracting content from html.

It is possible to write rules to avoid grabbing data from specific elements, just the interesting data is collected, it increases performance in some situations as well as eases grouping together important chunks of information.

Basic example

This example shows how to avoid catching data from a given set of elements.

from fxml import Pattern as P, Regex as R, Text as T
from fxml import HtmlParser as Html
import requests

req = requests.get('http://www.thesaurus.com/browse/home?s=t')
pattern = P('body', words=T('div', ('class', 'relevancy-block'),
exclude_rule=P('span', ('class', R('star|inactive')))))

html = Html(pattern)
dom  = html.feed(req.text)

print('Synonyms:%s' % ' '.join(dom.words[0].strip()))
print('Antonyms:%s' % ' '.join(dom.words[1].strip()))

Which would print something like:

Synonyms:central family familiar household local national native homely homey inland 
...

Antonyms:hut dormitory apartment dwelling palace resort shelter house hospital cottage farm 
condominium place trailer residence mansion cabin villa dump hangout habitation asylum cave 
...

The example below extract quotes and authors from 'http://quotes.toscrape.com/'.

from fxml import Pattern as P, Regex as R, Text as T
from fxml import Html
import requests

req = requests.get('http://quotes.toscrape.com/')

pattern = P('div', ('class', 'quote'),
quotes=T(R('span|small'), ('class', R('text|author'))))

html = Html(pattern)
dom  = html.feed(req.text)

for indi in dom.quotes:
    print(indi.text())

Would output:


“The world as we have created it is a process of our thinking. 
It cannot be changed without changing our thinking.”
Albert Einstein

“It is our choices, Harry, that show what we truly are, far more than our abilities.”
J.K. Rowling

.
.
.

You describe your patterns and name them, these will be lists of html entities which matched the defined patterns.

This other example shows better the approah for catching elements with FPath syntax. It extract all pinned repositories from a github profile.

from fxml import Pattern as P, Text as T
from fxml import Html
import requests

req = requests.get('https://www.github.com/iogf/')

pattern = P('li', projects=T('div', ('class', 'pinned-repo-item-content')))

html = Html(pattern)
dom  = html.feed(req.text)

for indi in dom.projects:
    print('Project:%s\n' % '\n'.join(indi.strip()))

Which would output something like:

Project:lax
A pythonic way of writting latex.
Python
163
4

Project:untwisted
A library for asynchronous programming in python.
Python
27
7

.
.
.

Broken html

It should handle broken html as well, like when ending tags are missing etc. The following example shows how it understands some broken html. This one uses the default python HTMLParser.

from fxml import Pattern as P
from fxml import HtmlParser

data = """
<html>

<body>
<a> text0 <b> text1 <c> text2 </a>
<d> text3 </d>
</body>
</html>
"""

pattern = P('body', 
a=P('a', 
b=P('b', 
c=P('c'))),
d = P('d'))

html = HtmlParser(pattern)
dom = html.feed(data)

print('A:', dom.a[0].text())
print('B:', dom.b[0].text())
print('C:', dom.c[0].text())
print('D:', dom.d[0].text())

Would output:

[tau@sigma demo]$ python ex5.py 
A:  text0  text1  text2 
B:  text1  text2 
C:  text2 
D:  text3 

Writing complex patterns

The example below uses the wrapper Condition to capture all 'b' elements whose value attribute is lesser than 10.

from fxml import Pattern as P
from fxml import Condition as C
from fxml import Html

data = """
<html>

<body>
<a>
<b value="2"> text0 </b>
<b value="10"> text0 </b>
<b value="5"> text0 </b>
</a>

<d value="11">  text2 </d>
</body>
</html>
"""

# It is possible to use the wrapper C to determine
# whether the tag attribute should be matched, it gives
# an interesting approach to write down conditions.
pattern = P('a', 
    elems=P('b', ('value', C(lambda x: int(x) < 10))))

html = Html(pattern)
dom = html.feed(data)

print('Captured elements:', dom.elems)

Would output:

Captured elements: <b value="2" ></b><b value="5" ></b>

Install

Note: it runs on python3 only.

pip install -r requirements.txt
pip install fxml