frenchtext

NLP library to process french text.

In this early pre-version, the library provides :

datasets to train business-oriented french text models
a characters normalization pipeline tailored for french text

Install

pip install frenchtext

Dependencies

Licence

APACHE licence 2.0 : https://www.apache.org/licenses/LICENSE-2.0

How to use

The detailed documentation for each module is available through the menu on the left side of this page.

You will find below an overview of the library.

French datasets

Data sources

The text content of the main french websites in the domain of finance and business (+ wikipedia) was extracted in september 2019 using nlptextdoc.

This extraction was done as "politely" as possible:

extract only freely and publicly available content
respect the robots.txt directives of each website (pages forbidden for indexing, maximum extraction rate)
detect when websites use tools to prevent indexing (like Datadome) and abort the crawl

IMPORTANT: The original authors of the websites own the copyright on all text blocks in this dataset.

To be able to link each text block to its original author, we track the origin URL of each text block throughout the whole process.

YOU CAN'T REUSE THE TEXT BLOCKS FOR ANY PURPOSE EXCEPT TRAINING A NATURAL LANGUAGE PROCESSING MODEL.

See the new European copyright rules : European Parliament approves new copyright rules for the internet

"The directive aims to make it easier for copyrighted material to be used freely through text and data mining, thereby removing a significant competitive disadvantage that European researchers currently face."

=> 131 websites and 2 564 755 HTML pages

Data preparation

The text blocks were then:

deduplicated to keep only distinct text blocks for each website (forgetting part of the original document structure),
tagged (but not filtered) by language (using https://fasttext.cc/docs/en/language-identification.html),
grouped in categories according to the main theme of the original website,
split in Pandas dataframes of size < 2 GB.

=> 10 categories: 'Assurance', 'Banque', 'Bourse', 'Comparateur', 'Crédit', 'Forum', 'Institution', 'Presse', 'SiteInfo', 'Wikipedia'

In each dataframe, the text blocks were additionnaly SHUFFLED IN A RANDOM ORDER to make it very difficult to reconstruct the original articles (safety measure to help protect the copyrights of the authors).

The results of this second step can be downloaded in the config.datasets directory, as dataframes serialized in the feather format, in files named according to the 'DatasetFile' column of the datasets table.

=> 19 dataset files: 'assurance', 'banque', 'bourse', 'comparateur', 'crédit', 'forum', 'institution', 'presse-1', 'presse-2', 'presse-3', 'presse-4', 'presse-5', 'presse-6', 'siteinfo', 'wikipedia-1', 'wikipedia-2', 'wikipedia-3', 'wikipedia-4', 'wikipedia-5'

Dataset size

The number of words in each text block was computed using the default french tokenizer from spaCy v2.1.

This business-oriented dataset contains 2 billion french words.

Here is a summary of the number of words contributed by each category in millions:

Assurance : 12
Banque : 20
Bourse : 26
Comparateur : 20
Crédit : 1
Forum : 152
Institution : 4
Presse : 963
SiteInfo : 78
Wikipedia : 727

Dataset files

from frenchtext.core import *
from frenchtext.datasets import *

List available dataset files :

datasetfiles = list_dataset_files()
datasetfiles

['assurance',
 'banque',
 'bourse',
 'comparateur',
 'crédit',
 'forum',
 'institution',
 'presse-1',
 'presse-2',
 'presse-3',
 'presse-4',
 'presse-5',
 'presse-6',
 'siteinfo',
 'wikipedia-1',
 'wikipedia-2',
 'wikipedia-3',
 'wikipedia-4',
 'wikipedia-5']

Source websites and number of words in each dataset file :

datasetsdf = list_datasets()
datasetsdf[["DatasetFile","Url","Pages","Words"]].iloc[80:100]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	DatasetFile	Url	Pages	Words
80	comparateur	https://www.panorabanques.com/	4341	2584038
81	crédit	https://www.cetelem.fr/	274	157191
82	crédit	https://www.cofidis.fr/	347	243904
83	crédit	https://www.cofinoga.fr/	413	86796
84	crédit	https://www.sofinco.fr/	916	597221
85	crédit	https://www.younited-credit.com/	1341	665115
86	forum	https://droit-finances.commentcamarche.com/	96450	56120562
87	forum	http://forum.doctissimo.fr/famille/argent-budg...	26981	61020453
88	forum	http://forum.doctissimo.fr/viepratique/finance...	5745	4962230
89	forum	http://forum.doctissimo.fr/viepratique/Impots/...	2338	1422143
90	forum	https://forum.lesarnaques.com/assurance-automo...	3530	3085101
91	forum	https://forum.lesarnaques.com/banque/	6206	5766116
92	forum	https://www.60millions-mag.com/forum/	3692	2222882
93	forum	https://www.boursorama.com/patrimoine/forum/	13020	10497065
94	forum	https://www.cbanque.com/forums/	12098	7702002
95	institution	https://acpr.banque-france.fr/	470	51397
96	institution	https://www.banque-france.fr/	728	75101
97	institution	https://www.ffa-assurance.fr/	301	146499
98	institution	https://www.economie.gouv.fr/	2720	159663
99	institution	https://www.impots.gouv.fr/portail/	1631	653735

Download dataset files

download_dataset_file("assurance")

Downloading dataset file : assurance (17 MB)

download_all_datasets()

Downloading dataset file : assurance (17 MB)
Downloading dataset file : banque (28 MB)
Downloading dataset file : bourse (38 MB)
Downloading dataset file : comparateur (28 MB)
Downloading dataset file : crédit (2 MB)
Downloading dataset file : forum (220 MB)
Downloading dataset file : institution (5 MB)
Downloading dataset file : presse-1 (218 MB)
Downloading dataset file : presse-2 (196 MB)
Downloading dataset file : presse-3 (190 MB)
Downloading dataset file : presse-4 (234 MB)
Downloading dataset file : presse-5 (269 MB)
Downloading dataset file : presse-6 (334 MB)
Downloading dataset file : siteinfo (116 MB)
Downloading dataset file : wikipedia-1 (131 MB)
Downloading dataset file : wikipedia-2 (182 MB)
Downloading dataset file : wikipedia-3 (263 MB)
Downloading dataset file : wikipedia-4 (269 MB)
Downloading dataset file : wikipedia-5 (267 MB)

You can change the local directory where the dataset files are downloaded :

config.datasets

PosixPath('/home/laurent/.frenchtext/datasets')

config["datasets_path"] = "/tmp/datasets"
config.datasets.mkdir(parents=True, exist_ok=True)

config.datasets

PosixPath('/tmp/datasets')

Read dataset files

datasetdf = read_dataset_file("assurance")
datasetdf

Loaded dataframe for dataset assurance : 563613 text blocks

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Website	DocId	DocEltType	DocEltCmd	NestingLevel	Text	Lang	Words	Unique
0	11	22332	ListItem	Text	2	5 tournages catastrophe pour un assureur	fr	6	True
1	74	710	Section	Start	1	Tout connaitre sur la nouvelle formation post-...	fr	7	True
2	11	12082	TextBlock	Text	1	Votre Agent Mandataire AXA - Civry Marie Claud...	?	18	True
3	87	461	TextBlock	Text	4	60 ans et 4 mois	fr	5	True
4	7	200	TextBlock	Text	1	Mon devis sur mesure	fr	4	True
...	...	...	...	...	...	...	...	...	...
563608	138	255	Section	Start	2	Les autres pouvoirs de police	fr	5	True
563609	11	19483	TextBlock	Text	1	Yves Nicolau assurance Laon	?	4	True
563610	106	1644	ListItem	Text	3	Evènements sportifs	fr	2	True
563611	58	4155	Section	Start	1	Agence Groupama Chalon	?	3	True
563612	10	150	TextBlock	Text	2	Nos agences d'assurance Aviva à OYONNAX sont h...	fr	26	True

563613 rows × 9 columns

Access text blocks in dataset files

Filter and iterate over the rows of a dataset file :

rowsiterator = get_rows_from_datasetdf(datasetdf, minwords=None, maxwords=5, lang="?")
show_first_rows(rowsiterator,10)

12 - COORDONNEES
41 - 01 30 41 67 33
49 - Dmitriy G.
57 - Les atouts du Multisupport CONFIANCE
74 - 01XXL meribel hiver
76 - Garantie en cas de vol
87 - Par AXA, le 01/08/2016
96 - mgr@enderby.eu
127 - 18 place De Strasbourg
131 - Saint Gaudens

Filter and iterate over the text blocks of a full dataset (across multiple files) :

textiterator = get_textblocks_from_dataset("Assurance", minwords=None, maxwords=10, lang="fr")
show_first_textblocks(textiterator,skip=2000,count=10)

Loaded dataframe for dataset assurance : 563613 text blocks
2001 - Rééquipement à neuf à vie
2002 - Définition Conducteur secondaire- Lexique
2003 - Comment éviter les fraudes
2004 - Comment demander un remboursement santé - GENERALI
2005 - Simulateur pour connaître les obligations de votre accord de branche
2006 - Complémentaire Epargne retraite des indépendants et TNS - Malakoff Médéric
2007 - Experts-Comptables, découvrez la mission épargne salariale
2008 - Vous n’êtes pas encore client :
2009 - Actualités (Page 6) | ameli.fr | Pharmacien
2010 - Dépression : quelle prise en charge ? - Matmut

Access a specific row :

get_text_from_rowindex(datasetdf,100)

'Les inondations de plaine : débordement de cours d’eau avec une durée d’immersion longue (prévisibles plusieurs jours ou heures à l’avance).'

Find text blocks with a specific char or substring :

find_textblocks_with_chars(datasetdf,"rétroviseur",count=20,ctxsize=15)

350594     ore dans notre rétroviseur gauche lorsque 
149029     de glace ? Les rétroviseurs ainsi que les 
51349      ace. Quant aux rétroviseurs, ils le sont d
310354     vant, arrière, rétroviseurs et vitres laté
489866    \naussi dans le rétroviseur pour ne pas se 
364550     ôté ou sous le rétroviseur intérieur de vo
560539     tionnement des rétroviseurs.              
560700     é (pare-brise, rétroviseurs…),            
223621     riorations des rétroviseurs et des phares.
543903     es miroirs des rétroviseurs lorsqu’ils peu
502075      logo dans son rétroviseur et par un signa
53237      vous cassez le rétroviseur d’une voiture. 
310456      éraflures, un rétroviseur abîmé, ou un au
375158     ant, moteur de rétroviseurs…              
539914     nt et arrière, rétroviseurs intérieurs et 
171367     t utilisez vos rétroviseurs               
485058      ainsi que les rétroviseurs ne sont pas ga
277390     ant, moteur de rétroviseurs...            
20222      sont offerts : rétroviseurs électriques, c
317634     res, y compris rétroviseurs et feux       
Name: Text, dtype: object

find_textblocks_with_chars(datasetdf,64257,count=10,wrap=True)

175413    x besoins de diversi[ﬁ]cation des placements
337398    e 30 villes ont béné[ﬁ]cié de ces animations
265114    nt règlementaire et [ﬁ]nancier, nous accompa
74267          La Fondation a [ﬁ]nancé depuis 2009, l’
424584    tion de l’équilibre [ﬁ]nancier des régimes d
219195    d, Jérôme Powell con[ﬁ]rmera que, dans l’att
489511    s besoins de diversi[ﬁ]cation de la clientèl
517563    si en présence d’un [ﬁ]nancement par crédit,
479694    nt règlementaire et [ﬁ]nancier, La Mondiale 
252202    n de disponibilités [ﬁ]nancières mais aussi,
Name: Text, dtype: object

Track the source URL for each text block

Optionally download and read urls file to track the origin of each text block :

urlsdf = read_urls_file()
urlsdf.head()

Loaded datasets urls : 2668787 urls

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Website	DocId	DocUrl	Words	fr	en	?	%fr	%en	%?
0	4	1	https://www.afer.fr/	573.0	524.0	3.0	46.0	0.914485	0.005236	0.080279
1	4	2	https://www.afer.fr/afer/adhesion/	74.0	74.0	0.0	0.0	1.000000	0.000000	0.000000
2	4	3	https://www.afer.fr/afer/adhesion/adherent-ass...	475.0	457.0	5.0	13.0	0.962105	0.010526	0.027368
3	4	4	https://www.afer.fr/afer/adhesion/adherer-assu...	519.0	519.0	0.0	0.0	1.000000	0.000000	0.000000
4	4	5	https://www.afer.fr/afer/adhesion/parrainage-a...	355.0	345.0	0.0	10.0	0.971831	0.000000	0.028169

get_text_from_rowindex(datasetdf,100)

'Les inondations de plaine : débordement de cours d’eau avec une durée d’immersion longue (prévisibles plusieurs jours ou heures à l’avance).'

get_url_from_rowindex(datasetdf, 100)

'https://www.maif.fr/conseils-prevention/risques-majeurs/inondation.html'

Characters normalization pipeline

Motivation

French datasets often contain several thousands distinct Unicode characters.

Characters stats in Wikipedia dataset :

35.6 billion chars
13 502 distinct Unicode chars

Characters stats in Business dataset :

27.5 billion chars
3 763 distinct Unicode chars

We need to reduce the number of distinct characters fed to our natural language processing applications, for three reasons :

chars considered by the user as visually equivalent will often produce a different application behavior : this is a huge problem for the user experience
with so many chars, the designer of the NLP application will not be able to reason about all possible combinations : this could harm the explainability of the system
this huge number of distinct characters brings a significant amount complexity the NLP models will have to deal with

Characters stats in Wikipedia dataset :

Only 1316 chars more frequent than 1 in 100 million
99.9987 % of Wikipedia chars would be preserved if we only kept the frequent chars

Characters stats in Business dataset :

Only 531 chars more frequent than 1 in 100 million
99.9996 % of Business chars would be preserved if we only kept the frequent chars

We can be smarter than that and replace rare chars with equivalent (or mostly equivalent) more frequent chars to preserve a maximum of information.

Target characters set

After a detailed study of all the frequent chars, the goal is to design a noramization pipeline which can retain as much information as possible while greatly reducing the number of dinstinct chars.

We saw before that it is possible to preserve 99.9996% of the original chars while keeping only 500 distinct chars. By being clever and replacing equivalent chars, we can divide this number by 2 and still retain the same amount of information.

It may then be useful to limit the number of distinct characters after normalization to 255 distinct characters :

if needed, french text chars can then be encoded with a single byte
the list of supported chars can be memorized by NLP application developers and users

from frenchtext.core import *
from frenchtext.chars import *

255 supported characters after normalization :

import pandas as pd
dfcharsnorm = pd.read_csv(chardatadir / "charset-fr.csv", sep=";")
dfcharsnorm

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	FrCode	Category	SubCategory	Code	Char	CharName	CountBusiness
0	0	separator	control	0	NaN	Reserved - End of string	0
1	1	separator	space	32		Space	88494564
2	2	separator	space	10	\n	Char 10	9588147
3	3	separator	space	9	\t	Char 9	1522053
4	4	separator	punctuation	44	,	Comma	286106887
...	...	...	...	...	...	...	...
251	251	emoticon	object	9792	♀	Female Sign	515
252	252	emoticon	object	127881	🎉	Party Popper	356
253	253	emoticon	object	9997	✍	Writing Hand	157
254	254	emoticon	object	9993	✉	Envelope	55
255	255	emoticon	object	10013	✝	Latin Cross	22

256 rows × 7 columns

The table below shows the number of chars in each category (after normalization) per 100 million characters :

dfblocks = dfcharsnorm.groupby(by=["Category","SubCategory"]).agg({"Char":["count","sum"],"CountBusiness":"sum"})
dfblocks["CountBusiness"] = (dfblocks["CountBusiness"] / 27577304956 * 100000000).astype(int)
dfblocks

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead tr th {
    text-align: left;
}

.dataframe thead tr:last-of-type th {
    text-align: right;
}

</style>

		Char		CountBusiness
		count	sum	sum
Category	SubCategory
emoticon	hand	12	💪👉👍👏🙏🙌👇👊👎👌✌✊	42
	head	28	🙂😉😀😂😁😊🙁😅😍😃😡🤣😄🤔😎😭👹😱😜😋🤩🙄😆😛🤪😢😇🤦	233
	object	16	⚠🔴🔥🏆⚽💡🚨💥⚡♫♂♀🎉✍✉✝	60
letter	digit	10	0123549876	3271115
	encoding	3	Ã�	249
	greek	2	λπ	2
	latin-fr	84	abcdefghijklmnopqrstuvwxyzàâäçèéêëîïôöùûüÿABCD...	91437146
	latin-other	25	áãåćčėğıíìńñóòõøšşßúÁÅŠÚŽ	712
	other	5	_&@\#	40814
separator	control	0	0	0
	punctuation	23	,'.-:/")(?!»«\|…;[]}{•¿¡	4684722
	space	3	\n\t	361183
symbol	currency	6	€$¤£¥¢	21099
	math	14	=>+<^~×≤÷≥±≠∞√	50056
	shape	15	*✓⇒♥¦→★¯↓❌❐†↑←↔	7954
	sign	3	©®™	1754
	unit	6	%°§µØ‰	102213

Normalization pipeline overview

The normalization pipeline applies the following 14 steps, which are explained and illustrated in the sections below.

Fix encoding errors
- fix windows1252 text read as iso8859-1
- fix utf8 text read as windows1252
- fix windows1252 text read as utf8
- merge Unicode combining chars
- ignore control chars
Remove display attributes
- replace latin letter symbols
- replace latin letter ligatures
- replace latin number symbols
Normalize visually equivalent chars
- replace equivalent chars
- replace cyrillic and greek chars looking like latin letters
Encode infrequent chars while losing a little bit of information
- replace infrequent latin letters with diacritics
- replace infrequent chars from other scripts
- replace infrequent symbols
- ignore remaining chars with no glyph

The statistics below count the number of chars normalized for 1 million chars in 4 distinct parts of the french datasets : business websites, forums, news, wikipedia.

The first line of the table below shows that :

in 1 million chars extracted from forum pages (raw users input), 41.8 chars will be encoding errors (windows1252 read as iso8859-1)
in 1 million chars extracted from wikipedia (curated content), only 0.006 chars will be encoding errors

These numbers show that characters normalization is much more important in real world applications than in academic papers based on clean wikipedia text.

normstats = pd.read_csv(chardatadir / "stats" / "normalization.total.stats.csv")
normstats[["Transform","FreqBusiness","FreqForum","FreqPresse","FreqWikipedia"]]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Transform	FreqBusiness	FreqForum	FreqPresse	FreqWikipedia
0	Fix encoding errors : windows1252 read as iso8...	0.510560	41.818746	0.813485	0.006025
1	Fix encoding errors : utf8 read as windows1252	0.126815	0.058024	0.072456	0.001037
2	Fix encoding errors : windows1252 read as utf8	0.000000	0.000000	0.019315	0.000000
3	Merge Unicode combining chars	2.811983	0.432638	0.568146	0.000140
4	Ignore control chars	6.450737	349.052995	6.454367	4.118586
5	Replace latin letter symbols	0.019360	0.039701	0.297372	0.150550
6	Replace latin letter ligatures	6.603815	6.541480	10.097290	17.204422
7	Replace latin number symbols	2.528338	4.162482	2.560933	0.429792
8	Normalize equivalent chars	814.327384	1248.410777	684.333730	242.391239
9	Replace cyrillic and greek chars looking like ...	0.062432	0.760424	0.491996	7.479907
10	Replace infrequent chars : latin letters with ...	0.063782	0.078384	0.099106	9.124948
11	Replace infrequent chars : other scripts	0.085694	0.468776	1.192548	16.612142
12	Replace infrequent chars : symbols	0.139271	0.159821	0.399064	0.073566
13	Replace infrequent chars : chars to ignore	0.018910	0.044282	0.021320	0.016423

Most frequent chars replaced from equivalent characters :

replacestats = pd.read_csv(chardatadir / "stats" / "normalization.layer8.stats.csv")
replacestats[["Char","CharName","FreqBusiness","FreqForum","FreqPresse","FreqWikipedia"]].head(20)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Char	CharName	FreqBusiness	FreqForum	FreqPresse	FreqWikipedia
0	'	Apostrophe	486.034805	160.264219	376.104982	134.658673
1		Space	310.411117	1082.845985	288.635983	87.877649
2	-	Hyphen-Minus	14.431203	2.903761	12.828203	16.223154
3	«	Left-Pointing Double Angle Quotation Mark	1.429478	0.680513	3.002426	0.559632
4	»	Right-Pointing Double Angle Quotation Mark	1.323524	0.533926	2.461880	0.544134
5	\|	Vertical Line	0.003452	0.001018	0.005488	0.875894
6	•	Bullet	0.204104	0.243295	0.189664	0.543237
7	.	Full Stop	0.059280	0.078893	0.856230	0.069278
8	"	Quotation Mark	0.085093	0.023413	0.011504	0.292385
9	:	Colon	0.000150	0.000509	0.000053	0.169047
10	°	Degree Sign	0.148726	0.181199	0.014618	0.078302
11	é	Latin Small Letter E With Acute	0.001651	0.006108	0.003166	0.101114
12	←	Leftwards Arrow	0.000000	0.000000	0.000158	0.047194
13	=	Equals Sign	0.004802	0.029012	0.000686	0.041589
14	→	Rightwards Arrow	0.026113	0.002545	0.034302	0.015862
15	d	Latin Small Letter D	0.000000	0.024940	0.000000	0.036405
16	<	Less-Than Sign	0.004202	0.142007	0.001267	0.024073
17	,	Comma	0.006453	0.101288	0.004538	0.022756
18	↓	Downwards Arrow	0.007504	0.001527	0.011188	0.021888
19	★	Black Star	0.001351	0.013743	0.022006	0.011686

For example, list of all Unicode chars wich will be projected to a regular 'apostrophe' :

replacechars = pd.read_csv(chardatadir / "normalizedchars.csv", sep=';')
replacechars[replacechars["NormChar"]=="'"][["Code","Char","CharName"]]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Code	Char	CharName
23	96	`	Grave Accent
24	180	´	Acute Accent
25	697	ʹ	Modifier Letter Prime
26	699	ʻ	Modifier Letter Turned Comma
27	700	ʼ	Modifier Letter Apostrophe
28	702	ʾ	Modifier Letter Right Half Ring
29	703	ʿ	Modifier Letter Left Half Ring
30	712	ˈ	Modifier Letter Vertical Line
31	714	ˊ	Modifier Letter Acute Accent
32	715	ˋ	Modifier Letter Grave Accent
33	729	˙	Dot Above
34	8216	‘	Left Single Quotation Mark
35	8217	’	Right Single Quotation Mark
36	8219	‛	Single High-Reversed-9 Quotation Mark
37	8223	‟	Double High-Reversed-9 Quotation Mark
38	8242	′	Prime

Frequency of characters from other scripts (chinese, arabic, cyrillic ...) :

scriptsstats = pd.read_csv(chardatadir / "stats" / "normalization.layer11.stats.csv")
scriptsstats[["CharFamily","FreqBusiness","FreqForum","FreqPresse","FreqWikipedia"]]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	CharFamily	FreqBusiness	FreqForum	FreqPresse	FreqWikipedia
0	ChineseJapaneseKorean	0.012456	0.177127	0.194677	4.059173
1	Arabic	0.012306	0.026467	0.460280	3.140120
2	Cyrillic	0.024462	0.166438	0.237159	3.118961
3	Greek	0.016058	0.022904	0.031347	2.423996
4	Hebrew	0.000150	0.000000	0.184914	1.132155
5	Other	0.000750	0.029012	0.004063	0.800871
6	Indian	0.000750	0.037665	0.033458	0.737955
7	Phonetic	0.002401	0.001527	0.001636	0.298579
8	Latin	0.013507	0.006108	0.007283	0.269377
9	Math	0.001801	0.000509	0.000528	0.240707
10	LaoThai	0.000000	0.001018	0.033194	0.217867
11	Armenian	0.001051	0.000000	0.004011	0.172382

Normalization pipeline API

Initialize a text normalizer :

%time norm = TextNormalizer()
norm

CPU times: user 1.83 s, sys: 15.6 ms, total: 1.84 s
Wall time: 2 s





1 - Fix encoding errors : windows1252 read as iso8859-1
2 - Fix encoding errors : utf8 read as windows1252
3 - Fix encoding errors :  windows1252 read as utf8
4 - Merge Unicode combining chars
5 - Ignore control chars
6 - Replace latin letter symbols
7 - Replace latin letter ligatures
8 - Replace latin number symbols
9 - Normalize equivalent chars
10 - Replace cyrillic and greek chars looking like latin letters
11 - Replace infrequent chars : latin letters with diacritics
12 - Replace infrequent chars : other scripts
13 - Replace infrequent chars : symbols
14 - Replace infrequent chars : chars to ignore

Normalize text :

teststring = chr(127995)+"① l`"+chr(156)+"uv"+chr(127)+"re est¨ "+chr(147)+"belle"+chr(148)+"¸ Ã  Â½ â‚¬ énième â€° "+chr(133)+" ⁽🇪ﬃc🇦ce⁾ ！"
teststring

'🏻① l`\x9cuv\x7fre est¨ \x93belle\x94¸ Ã  Â½ â‚¬ énième â€° \x85 ⁽🇪ﬃc🇦ce⁾ ！'

result = norm(teststring)
result

(1) l'oeuvre est «belle», Ã  1/2 € énième ‰ … (EfficAce) !

Describe the changes applied by the normalization pipeline :

print(result.describeChanges())

Fix encoding errors : windows1252 read as iso8859-1
 < 🏻① l` [�] uv�re est¨  [�] belle [�] ¸ Ã  Â½ â‚¬ énième â€°  [�]  ⁽🇪ﬃc🇦ce⁾ ！
 < 🏻① l` [œ] uv�re est¨  [“] belle [”] ¸ Ã  Â½ â‚¬ énième â€°  […]  ⁽🇪ﬃc🇦ce⁾ ！
Fix encoding errors : utf8 read as windows1252
 < 🏻① l`œuv�re est¨ “belle”¸ Ã   [Â½]   [â‚¬]  énième  [â€°]  … ⁽🇪ﬃc🇦ce⁾ ！
 < 🏻① l`œuv�re est¨ “belle”¸ Ã   [½_]   [€__]  énième  [‰__]  … ⁽🇪ﬃc🇦ce⁾ ！
Merge Unicode combining chars
 < 🏻① l`œuv�re est¨ “belle”¸ Ã  ½ €  [é] ni [è] me ‰ … ⁽🇪ﬃc🇦ce⁾ ！
 < 🏻① l`œuv�re est¨ “belle”¸ Ã  ½ €  [é_] ni [è_] me ‰ … ⁽🇪ﬃc🇦ce⁾ ！
Ignore control chars
 <  [🏻] ① l`œuv [�] re est [¨]  “belle”¸ Ã  ½ € énième ‰ … ⁽🇪ﬃc🇦ce⁾ ！
 <  [_] ① l`œuv [_] re est [_]  “belle”¸ Ã  ½ € énième ‰ … ⁽🇪ﬃc🇦ce⁾ ！
Replace latin letter symbols
 < ① l`œuvre est “belle”¸ Ã  ½ € énième ‰ … ⁽ [🇪] ﬃc [🇦] ce⁾ ！
 < ① l`œuvre est “belle”¸ Ã  ½ € énième ‰ … ⁽ [E] ﬃc [A] ce⁾ ！
Replace latin letter ligatures
 < ① l` [œ ] uvre est “belle”¸ Ã  ½ € énième ‰ … ⁽E [ﬃ  ] cAce⁾ ！
 < ① l` [oe] uvre est “belle”¸ Ã  ½ € énième ‰ … ⁽E [ffi] cAce⁾ ！
Replace latin number symbols
 <  [①  ]  l`oeuvre est “belle”¸ Ã   [½  ]  € énième ‰ … ⁽EfficAce⁾ ！
 <  [(1)]  l`oeuvre est “belle”¸ Ã   [1/2]  € énième ‰ … ⁽EfficAce⁾ ！
Normalize equivalent chars
 < (1) l [`] oeuvre est  [“] belle [”]  [¸]  Ã  1/2 € énième ‰ …  [⁽] EfficAce [⁾]   [！] 
 < (1) l ['] oeuvre est  [«] belle [»]  [,]  Ã  1/2 € énième ‰ …  [(] EfficAce [)]   [!]

Compute spans for equivalent substrings before and after normalization :

result.output[0:12]

"(1) l'oeuvre"

result.input[result.mapOutputIndexToInput(0):result.mapOutputIndexToInput(12)]

'🏻① l`\x9cuv\x7fre'

result.output[3:10]

" l'oeuv"

result.input[result.mapOutputIndexToInput(3):result.mapOutputIndexToInput(10)]

' l`\x9cuv\x7f'

Performance test : 2500 sentences per second => fast enough but will be optimized in a later version.

%timeit -n100 norm(teststring)

397 µs ± 89.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Appendix : Unicode utility functions

Unicode characters properties :

charname("🙂")

'Slightly Smiling Face'

charcategory("🙂")

'Symbol'

charsubcategory("🙂")

'Other'

charblock("🙂")

'Emoticons'

blockfamily('Emoticons')

'Symbols'

frenchtext
Release 0.0.5

Release 0.0.5

0.0.5

0.0.4

0.0.3

0.0.2

0.0.1

Documentation

frenchtext

Install

Dependencies

Licence

How to use

French datasets

Data sources

Data preparation

Dataset size

Dataset files

Download dataset files

Read dataset files

Access text blocks in dataset files

Track the source URL for each text block

Characters normalization pipeline

Motivation

Target characters set

Normalization pipeline overview

Normalization pipeline API

Appendix : Unicode utility functions

Stats

Development practices

Releases

Contributors

frenchtext Release 0.0.5

Release 0.0.5 Toggle Dropdown 0.0.5 0.0.4 0.0.3 0.0.2 0.0.1

Documentation

frenchtext

Install

Dependencies

Licence

How to use

French datasets

Data sources

Data preparation

Dataset size

Dataset files

Download dataset files

Read dataset files

Access text blocks in dataset files

Track the source URL for each text block

Characters normalization pipeline

Motivation

Target characters set

Normalization pipeline overview

Normalization pipeline API

Appendix : Unicode utility functions

Stats

Development practices

Releases

Contributors

frenchtext
Release 0.0.5

Release 0.0.5

0.0.5

0.0.4

0.0.3

0.0.2

0.0.1