Purify HTML by filtering tags and classes
pip install --upgrade purehtml
from purehtml import purify_html_files
from pathlib import Path
html_root = Path(__file__).parent / "samples"
html_paths = list(html_root.glob("*.html"))
html_path_and_purified_content_list = purify_html_files(html_paths)
for item in html_path_and_purified_content_list:
html_path = item["html_path"]
purified_content = item["purified_content"]
print(html_path)
print(purified_content)
purify_html_str ( html_str : str )
purify_html_file ( html_path : Union[Path, str] )
purify_html_files( html_paths: list[Union[Path, str]] )
Here are the params of purify_html_files()
:
-
verbose:
bool
(defaultFalse
)-
True
: Output to console -
False
: No output to console
-
-
output_format:
str
(default"html"
)-
"html"
: Output HTML format (.html.pure
) -
"markdown"
: Output markdown format (.md
)
-
-
keep_href:
bool
(defaultFalse
)-
True
: Keephref
in<a>
tags, and keepsrc
in<img>
tags- This is useful for detailed information retrieval
-
False
: Do not keephref
andsrc
-
-
keep_format_tags:
bool
(defaultTrue
)-
True
: Keep format tags- such as:
<sub>
,<sup>
,<b>
,<strong>
,<em>
,<a>
,<i>
,<u>
,mark
,del
,cite
,blockquote
- This is useful for rendering HTML
- such as:
-
False
: Remove format tags
-
-
keep_group_tags:
bool
(defaultTrue
)-
True
: Keep group tags:<div>
,<section>
,<details>
- This is useful for hierarchical processing, such as grouping texts in RAG
-
False
: Remove group tags
-
-
math_style:
str
(default"latex"
)-
"latex"
: Convert math tag to latex string- This is useful for LLM and RAG
-
"latex_in_tag"
: Wrap above latex string with tag-
<div>
for block,<span>
for inline - This is useful for hierarchical processing
-
-
"html"
: Keep math formulas in mathml format- This is useful for rendering HTML
-
Hierarchical:
results = purify_html_files(
html_paths,
verbose=False,
output_format="html",
keep_href=False,
keep_format_tags=False,
keep_group_tags=True,
math_style="latex_in_tag",
)
Flat:
results = purify_html_files(
html_paths,
verbose=False,
output_format="html",
keep_href=False,
keep_format_tags=False,
keep_group_tags=False, # <--
math_style="latex", # <--
)
With links:
results = purify_html_files(
html_paths,
verbose=False,
output_format="html",
keep_href=True, # <--
keep_format_tags=True, # <--
keep_group_tags=True, # <--
math_style="html", # <--
)
Without links: (This is the default config in dev)
results = purify_html_files(
html_paths,
verbose=False,
output_format="html",
keep_href=False, # <--
keep_format_tags=True,
keep_group_tags=True,
math_style="html",
)
Even without any format:
results = purify_html_files(
html_paths,
verbose=False,
output_format="html",
keep_href=False,
keep_format_tags=False, # <--
keep_group_tags=True,
math_style="html",
)