#Using this method, you can get each match from REPEATED CAPTURE GROUPS! (A very rare feature in regex engines)#Besides that, you will see the exact position of each group/match.df=pd.read_csv( "https://github.com/pandas-dev/pandas/raw/main/doc/data/titanic.csv")
special=df.ds_regex_find_all_special(r'\b(Ma(\w)+)(\w+)\b', dtype_string=False)
aa_start_0 ... aa_match_6aa_indexaa_columnaa_whole_matchaa_whole_startaa_whole_endaa_group ...
7NameMaster91509 ... NaN19 ... NaN211 ... NaN314 ... NaN10NameMarguerite1727017 ... NaN
... ... ...
885NameMargaret2028327 ... NaN887NameMargaret1422014 ... NaN114 ... NaN216 ... NaN321 ... NaN#If you use any common regex engine, you can't get the repeated capture groups, since every new result overwrites the old one:importrere.findall('(Ma(\w)+)', 'Margaret')
Out[11]: [('Margaret', 't')]
#Using this method you will get all repeated capture groups, they won't be overwritten!#Results for index 887aa_start_0aa_start_1aa_start_2aa_start_3aa_start_4aa_start_5aa_start_6aa_stop_0aa_stop_1aa_stop_2aa_stop_3aa_stop_4aa_stop_5aa_stop_6aa_match_0aa_match_1aa_match_2aa_match_3aa_match_4aa_match_5aa_match_6aa_columnaa_whole_matchaa_whole_startaa_whole_endaa_groupNameMargaret1422014<NA><NA><NA><NA><NA><NA>22<NA><NA><NA><NA><NA><NA>Margaret<NA><NA><NA><NA><NA><NA>114<NA><NA><NA><NA><NA><NA>21<NA><NA><NA><NA><NA><NA>Margare<NA><NA><NA><NA><NA><NA>21617181920<NA><NA>1718192021<NA><NA>rgare<NA><NA>321<NA><NA><NA><NA><NA><NA>22<NA><NA><NA><NA><NA><NA>t<NA><NA><NA><NA><NA><NA>Ifyouwanttoconverttheresultstothebestavailabledtype, use:
df.ds_regex_find_all_special(r'\b(Ma(\w)+)(\w+)\b', dtype_string=False)
Out[3]:
aa_start_0 ... aa_match_6#aa_index aa_column aa_whole_match aa_whole_start aa_whole_end aa_group ...7NameMaster91509 ... <NA>19 ... <NA>211 ... <NA>314 ... <NA>10NameMarguerite1727017 ... <NA>
... ... ...
885NameMargaret2028327 ... <NA>887NameMargaret1422014 ... <NA>114 ... <NA>216 ... <NA>321 ... <NA>
[764rowsx21columns]
aa_start_0uint8aa_start_1Int64aa_start_2Int64aa_start_3Int64aa_start_4Int64aa_start_5Int64aa_start_6Int64aa_stop_0uint8aa_stop_1Int64aa_stop_2Int64aa_stop_3Int64aa_stop_4Int64aa_stop_5Int64aa_stop_6Int64aa_match_0categoryaa_match_1categoryaa_match_2categoryaa_match_3categoryaa_match_4categoryaa_match_5categoryaa_match_6categoryParameters:
df: Union[pd.DataFrame, pd.Series]
regular_expression: strSyntaxfromhttps://pypi.org/project/regex/flags:intYoucanuseanyflagthatisavailablehere: https://pypi.org/project/regex/
(default=regex.UNICODE)
dtype_string:boolIfTrue, itreturnsallresultsasastringIfFalse, datatypesareconvertedtothebestavailable
(default=True)
Returns:
Union[pd.Series, pd.DataFrame]
#Use regex.findall against a DataFrame/Series without having to fear any exception! You can get#the results as strings (dtype_string=True) or even as float, int, category (dtype_string=False) - Whatever#fits best!#Some examplesdf=pd.read_csv( "https://github.com/pandas-dev/pandas/raw/main/doc/data/titanic.csv")
df.Name.ds_regex_find_all(regular_expression=r'(\bM\w+\b)\s+(\bW\w+\b)')
result_0result_1426NameMariaWinfield472NameMaryWorth862NameMargaretWellesmultilinetest=df.Name.map(lambdax: f'{x}\n'*3) #Every name 3x in each cellmultilinetest.ds_regex_find_all(regular_expression=r'^.*(\bM\w+\b)\s+(\bW\w+\b)', line_by_line=False)
Out[3]:
result_0result_158NameMiriumWest426NameMariaWinfield472NameMaryWorth862NameMargaretWellesmultilinetest.ds_regex_find_all(regular_expression=r'^.*(\bM\w+\b)\s+(\bW\w+\b)', line_by_line=True)
Out[7]:
result_0result_1426NameMariaWinfieldNameMariaWinfieldNameMariaWinfield472NameMaryWorthNameMaryWorthNameMaryWorth862NameMargaretWellesNameMargaretWellesNameMargaretWelles#By using line_by_line=True you can be sure that the regex engine will check every single line!Parameters:
df: Union[pd.DataFrame, pd.Series]
regular_expression: strSyntaxfromhttps://pypi.org/project/regex/flags:intYoucanuseanyflagthatisavailablehere: https://pypi.org/project/regex/
(default=regex.UNICODE)
dtype_string:boolIfTrue, itreturnsallresultsasastringIfFalse, datatypesareconvertedtothebestavailable
(default=True)
line_by_line:boolIfyouwanttosplitthelinebeforesearching. Useful, ifyouwanttouse^....$ morethanonce.
(default=False)
Returns:
Union[pd.Series, pd.DataFrame]
#If you have a huge list of words you want to search/sub/find_all on this list, you can try to use the Trie regex methods to get the job done faster#It is worth trying if:#1) your DataFrame/Series has a lot of text in each cell#2) you want to search for a lot of words in each cell##The more words you have, and the more text is in each cell, the faster it gets.#If you want to know more about, I recommend: https://stackoverflow.com/a/42789508/15096247Example:
df=pd.read_csv( "https://github.com/pandas-dev/pandas/raw/main/doc/data/titanic.csv")
allstrings=pd.DataFrame([[df.Name.to_string() *2] *2,[df.Name.to_string() *2] *2]) #lets create a little dataframe with a lot of text in each cellhugeregexlist=df.Name.str.extract(r'^\s*(\w+)').drop_duplicates()[0].to_list() #lets get all names (first word) in the titanic DataFrame#it should look like that: ['Braund', 'Cumings', 'Heikkinen', 'Futrelle', 'Allen', 'Moran', 'McCarthy', 'Palsson', 'Johnson', 'Nasser' ... ]%timeitallstrings.ds_trie_regex_find_all(hugeregexlist,add_left_to_regex=r'\b',add_right_to_regex=r'\b')
776ms ± 2.83msperloop (mean ± std. dev. of7runs, 1loopeach)
allstrings.ds_trie_regex_find_all(hugeregexlist, add_left_to_regex=r'\b', add_right_to_regex=r'\b')
Out[6]:
result_0result_1result_2 ... result_2133result_2134result_213500BraundHarrisCumings ... JohnstonBehrDooley1BraundHarrisCumings ... JohnstonBehrDooley10BraundHarrisCumings ... JohnstonBehrDooley1BraundHarrisCumings ... JohnstonBehrDooleyLet'scomparewitharegularregexsearchhugeregex=r"\b(?:"+"|".join([f'(?:{y})'foryindf.Name.str.extract(r'^\s*(\w+)').drop_duplicates()[0].to_list()]) +r")\b"#let's create a regex from all names#it should look like this: '\\b(?:(?:Braund)|(?:Cumings)|(?:Heikkinen)|(?:Futrelle)|(?:Allen)|(?:Moran)|(?:McCarthy)|(?:Palsson)|(?:Johnson)|(?:Na...%timeitallstrings.ds_regex_find_all(hugeregex)
945ms ± 3.14msperloop (mean ± std. dev. of7runs, 1loopeach)
#Not bad, right? But it still can a lot better! Try it yourself!#Another good thing is that you can search in every cell, no matter what dtype it is.#There won't be thrown any exception, because everything is converted to string before performing any action.#If you pass "dtype_string=False", each column will be converted to the best available dtype after the actions have been completedParameters:
df: Union[pd.DataFrame, pd.Series]
wordlist: list[str]
Allstringsyouarelookingforadd_left_to_regex: strifyouwanttoaddsomethingbeforethegeneratedTrieregex-> \b forexampleallstrings.ds_trie_regex_find_all(hugeregexlist,add_left_to_regex=r'\b',add_right_to_regex=r'\b')
(default="")
add_right_to_regex: strifyouwanttoaddsomethingafterthegeneratedTrieregex-> \b forexampleallstrings.ds_trie_regex_find_all(hugeregexlist,add_left_to_regex=r'\b',add_right_to_regex=r'\b')
(default="")
flags:intYoucanuseanyflagthatisavailablehere: https://pypi.org/project/regex/
(default=regex.UNICODE)
dtype_string:boolIfTrue, itreturnsallresultsasastringIfFalse, datatypesareconvertedtothebestavailable
(default=True)
line_by_line:boolIfyouwanttosplitthelinebeforesearching. Useful, ifyouwanttouse^....$ morethanonce.
(default=False)
Returns:
Union[pd.Series, pd.DataFrame]
#Use regex.sub against a DataFrame/Series without having to fear any exception! You can get#the results as strings (dtype_string=True) or even as float, int, category (dtype_string=False) - Whatever#fits best!##Some examples:df=pd.read_csv( "https://github.com/pandas-dev/pandas/raw/main/doc/data/titanic.csv")
PassengerIdSurvivedPclass ... FareCabinEmbarked0103 ... 7.2500NaNS1211 ... 71.2833C85C2313 ... 7.9250NaNS3411 ... 53.1000C123S4503 ... 8.0500NaNS
.. ... ... ... ... ... ... ...
88688702 ... 13.0000NaNS88788811 ... 30.0000B42S88888903 ... 23.4500NaNS88989011 ... 30.0000C148C89089103 ... 7.7500NaNQ
[891rowsx12columns]
subst=df.ds_regex_sub(regular_expression=r'^\b8\d(\d)\b', replace=r'\g<1>00000',dtype_string=False)
Out[5]:
PassengerIdSurvivedPclass ... FareCabinEmbarked0103 ... 7.2500<NA>S1211 ... 71.2833C85C2313 ... 7.9250<NA>S3411 ... 53.1000C123S4503 ... 8.0500<NA>S
.. ... ... ... ... ... ... ...
88670000002 ... 13.0000<NA>S88780000011 ... 30.0000B42S88890000003 ... 23.4500<NA>S889011 ... 30.0000C148C89010000003 ... 7.7500<NA>Q
[891rowsx12columns]
subst.dtypesOut[8]:
PassengerIduint32Surviveduint8Pclassuint8NamestringSexcategoryAgeobjectSibSpuint8Parchuint8TicketobjectFarefloat64CabincategoryEmbarkedcategory#As you can see, the numbers that we have substituted have been converted to int#Let's do something like math.floor in a very unconventional way :)df.FareOut[16]:
07.2500171.283327.9250353.100048.0500
...
88613.000088730.000088823.450088930.00008907.7500Name: Fare, Length: 891, dtype: float64Fareint=df.Fare.ds_regex_sub(r'(\d+)\.\d+$', r'\g<1>',dtype_string=False)
071712735348
.. ...
886138873088823889308907Fareint.dtypesOut[18]:
Fareuint16#You should not use this method if there are other ways to convert float to int.#It serves best for data cleaning, at least that's what I am using it for.Parameters:
df: Union[pd.DataFrame, pd.Series]
regular_expression: strSyntaxfromhttps://pypi.org/project/regex/replace: strthereplacementyouwanttouse (groupsareallowed)
flags:intYoucanuseanyflagthatisavailablehere: https://pypi.org/project/regex/
(default=regex.UNICODE)
dtype_string:boolIfTrue, itreturnsallresultsasastringIfFalse, datatypesareconvertedtothebestavailable
(default=True)
line_by_line:boolIfyouwanttosplitthelinebeforesearching. Useful, ifyouwanttouse^....$ morethanonce.
(default=False)
Returns:
Union[pd.Series, pd.DataFrame]
#Use regex.search against a DataFrame/Series without having to fear any exception! You can get#the results as strings (dtype_string=True) or even as float, int, category (dtype_string=False) - Whatever#fits best!##Some examplesdf=pd.read_csv( "https://github.com/pandas-dev/pandas/raw/main/doc/data/titanic.csv")
multilinetest=df.Name.map(lambdax: f'{x}\n'*3) #Every name 3x in each cell to test line_by_line#using line_by_line=Falsemultilinetest.ds_regex_search(regular_expression=r'^.*(\bM\w+\b)\s+(\bW\w+\b)', line_by_line=False, flags=re.IGNORECASE)
Out[13]:
result_058NameWest, Miss. ConstanceMirium\nWestNameMiriumNameWest426NameClarke, Mrs. CharlesV (AdaMariaWinfieldNameMariaNameWinfield472NameWest, Mrs. EdwyArthur (AdaMaryWorthNameMaryNameWorth862NameSwift, Mrs. FrederickJoel (MargaretWellesNameMargaretNameWelles#using line_by_line=Truemultilinetest.ds_regex_search(regular_expression=r'^.*(\bM\w+\b)\s+(\bW\w+\b)', line_by_line=True, flags=re.IGNORECASE)
Out[19]:
result_0426NameClarke, Mrs. CharlesV (AdaMariaWinfieldNameMariaNameWinfieldNameClarke, Mrs. CharlesV (AdaMariaWinfieldNameMariaNameWinfieldNameClarke, Mrs. CharlesV (AdaMariaWinfieldNameMariaNameWinfield472NameWest, Mrs. EdwyArthur (AdaMaryWorthNameMaryNameWorthNameWest, Mrs. EdwyArthur (AdaMaryWorthNameMaryNameWorthNameWest, Mrs. EdwyArthur (AdaMaryWorthNameMaryNameWorth862NameSwift, Mrs. FrederickJoel (MargaretWellesNameMargaretNameWellesNameSwift, Mrs. FrederickJoel (MargaretWellesNameMargaretNameWellesNameSwift, Mrs. FrederickJoel (MargaretWellesNameMargaretNameWelles#Now, we get a match for each line!Parameters:
df: Union[pd.DataFrame, pd.Series]
regular_expression: strSyntaxfromhttps://pypi.org/project/regex/flags:intYoucanuseanyflagthatisavailablehere: https://pypi.org/project/regex/
(default=regex.UNICODE)
dtype_string:boolIfTrue, itreturnsallresultsasastringIfFalse, datatypesareconvertedtothebestavailable
(default=True)
line_by_line:boolIfyouwanttosplitthelinebeforesearching. Useful, ifyouwanttouse^....$ morethanonce.
(default=False)
Returns:
Union[pd.Series, pd.DataFrame]
#Use regex.sub against a DataFrame/Series without having to fear any exception! You can get#the results as strings (dtype_string=True) or even as float, int, category (dtype_string=False) - Whatever#fits best!##Some examplesdf=pd.read_csv( "https://github.com/pandas-dev/pandas/raw/main/doc/data/titanic.csv")
PassengerIdSurvivedPclass ... FareCabinEmbarked0103 ... 7.2500NaNS1211 ... 71.2833C85C2313 ... 7.9250NaNS3411 ... 53.1000C123S4503 ... 8.0500NaNS
.. ... ... ... ... ... ... ...
88688702 ... 13.0000NaNS88788811 ... 30.0000B42S88888903 ... 23.4500NaNS88989011 ... 30.0000C148C89089103 ... 7.7500NaNQ
[891rowsx12columns]
subst=df.ds_regex_sub(regular_expression=r'^\b8\d(\d)\b', replace=r'\g<1>00000',dtype_string=False)
Out[5]:
PassengerIdSurvivedPclass ... FareCabinEmbarked0103 ... 7.2500<NA>S1211 ... 71.2833C85C2313 ... 7.9250<NA>S3411 ... 53.1000C123S4503 ... 8.0500<NA>S
.. ... ... ... ... ... ... ...
88670000002 ... 13.0000<NA>S88780000011 ... 30.0000B42S88890000003 ... 23.4500<NA>S889011 ... 30.0000C148C89010000003 ... 7.7500<NA>Q
[891rowsx12columns]
subst.dtypesOut[8]:
PassengerIduint32Surviveduint8Pclassuint8NamestringSexcategoryAgeobjectSibSpuint8Parchuint8TicketobjectFarefloat64CabincategoryEmbarkedcategory#As you can see, the numbers that we have substituted have been converted to int#Let's do something like math.floor in a very unconventional way :)df.FareOut[16]:
07.2500171.283327.9250353.100048.0500
...
88613.000088730.000088823.450088930.00008907.7500Name: Fare, Length: 891, dtype: float64Fareint=df.Fare.ds_regex_sub(r'(\d+)\.\d+$', r'\g<1>',dtype_string=False)
071712735348
.. ...
886138873088823889308907Fareint.dtypesOut[18]:
Fareuint16#You should not use this method if there are other ways to convert float to int.#It serves best for data cleaning, at least that's what I am using it for.Parameters:
df: Union[pd.DataFrame, pd.Series]
regular_expression: strSyntaxfromhttps://pypi.org/project/regex/replace: strthereplacementyouwanttouse (groupsareallowed)
flags:intYoucanuseanyflagthatisavailablehere: https://pypi.org/project/regex/
(default=regex.UNICODE)
dtype_string:boolIfTrue, itreturnsallresultsasastringIfFalse, datatypesareconvertedtothebestavailable
(default=True)
line_by_line:boolIfyouwanttosplitthelinebeforesearching. Useful, ifyouwanttouse^....$ morethanonce.
(default=False)
Returns:
Union[pd.Series, pd.DataFrame]
The Tidelift Subscription provides access to a continuously curated stream of human-researched and maintainer-verified data on open source packages and their licenses, releases, vulnerabilities, and development practices.