HTML Table Extractor

Note: This is a re-release of html-table-extractor of yuanxu-li, existing just because I've been waiting for too long for an actual release to fix the incorrect dependency (pipenv would refuse to install new version of BeautifulSoup using the original version 1.4.0). I've kept changes to a minimum, just to add this notice, fix setup.py to make it PyPI friendly, and change the PyPI package name.

HTML Table Extractor is a python library that uses Beautiful Soup to extract data from complicated and messy html table

Important links

Repository: https://github.com/yuanxu-li/html-table-extractor
Issues: https://github.com/yuanxu-li/html-table-extractor/issues

Installation

pip install 'beautifulsoup4==4.5.3'
pip install html-table-extractor

Usage

Example 1 - Simple

1	2
3	4

from html_table_extractor.extractor import Extractor
table_doc = """
<table><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table>
"""
extractor = Extractor(table_doc)
extractor.parse()
extractor.return_list()

It will print out:

[[u'1', u'2'], [u'3', u'4']]

Example 2 - Transformer

1	2
3	4

from html_table_extractor.extractor import Extractor
table_doc = """
<table><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table>
"""
extractor = Extractor(table_doc, transformer=int)
extractor.parse()
extractor.return_list()

It will print out:

[[1, 2], [3, 4]]

Example 3 - Pass BS4 Tag

1	2
3	4

from html_table_extractor.extractor import Extractor
from bs4 import BeautifulSoup
table_doc = """
<html><table id='wanted'><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table><table id='unwanted'><tr><td>not wanted</td></tr></table></html>
"""
soup = BeautifulSoup(table_doc, 'html.parser')
extractor = Extractor(soup, id_='wanted')
extractor.parse()
extractor.return_list()

It will print out:

[[u'1', u'2'], [u'3', u'4']]

Example 4 - Complex

1	2	3
	4
5

from html_table_extractor.extractor import Extractor
table_doc = """
<table>
  <tr>
    <td rowspan=2>1</td>
    <td>2</td>
    <td>3</td>
  </tr>
  <tr>
    <td colspan=2>4</td>
  </tr>
  <tr>
    <td colspan=3>5</td>
  </tr>
</table>
"""
extractor = Extractor(table_doc)
extractor.parse()
extractor.return_list()

It will print out:

[[u'1', u'2', u'3'], [u'1', u'4', u'4'], [u'5', u'5', u'5']]

Example 5 - Conflicted

1	2	3
	4
5

from html_table_extractor.extractor import Extractor
table_doc = """
<table>
    <tr>
        <td rowspan=2>1</td>
        <td>2</td>
        <td rowspan=3>3</td>
    </tr>
    <tr>
        <td colspan=2>4</td>
    </tr>
    <tr>
        <td colspan=2>5</td>
    </tr>
</table>
"""
extractor = Extractor(table_doc)
extractor.parse()
extractor.return_list()

It will print out:

[[u'1', u'2', u'3'], [u'1', u'4', u'3'], [u'5', u'5', u'3']]

Example 6 - Write to file

1	2
3	4

from html_table_extractor.extractor import Extractor
table_doc = """
<table><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table>
"""
extractor = Extractor(table_doc).parse()
extractor.write_to_csv(path='.')

It will write to a given path and create a new csv file called output.csv:

1,2
3,4

Team

@yuanxu-li

Errors/ Bugs

If something is not working correctly, or if you have any suggestion on improvements, report it here

Copyright

Third-party copyright in this distribution is noted where applicable.

isaacto-html-table-extractor
Release 1.4.0.1

Release 1.4.0.1

1.4.0.1

Documentation

HTML Table Extractor

Important links

Installation

Usage

Example 1 - Simple

Example 2 - Transformer

Example 3 - Pass BS4 Tag

Example 4 - Complex

Example 5 - Conflicted

Example 6 - Write to file

Team

Errors/ Bugs

Copyright

Stats

Development practices

Releases

isaacto-html-table-extractor Release 1.4.0.1

Release 1.4.0.1 Toggle Dropdown 1.4.0.1

Documentation

HTML Table Extractor

Important links

Installation

Usage

Example 1 - Simple

Example 2 - Transformer

Example 3 - Pass BS4 Tag

Example 4 - Complex

Example 5 - Conflicted

Example 6 - Write to file

Team

Errors/ Bugs

Copyright

Stats

Development practices

Releases

isaacto-html-table-extractor
Release 1.4.0.1

Release 1.4.0.1

1.4.0.1