a-pandas-ex-regex-enhancements

Get repeated capture groups, search without having to fear Exceptions in any df/Series, convert results to appropriate dtypes, use fast Trie regex


Keywords
pandas, dtypes, regex, regular, expressions, re, pandas-dataframe
License
MIT
Install
pip install a-pandas-ex-regex-enhancements==0.11

Documentation

Regex enhancements for Pandas DataFrame / Series

Installation

pip install a-pandas-ex-regex-enhancements

Usage

from a_pandas_ex_regex_enhancements import pd_add_regex_enhancements
pd_add_regex_enhancements()
import pandas as pd
The code above will add some methods to! You can use pandas like you did before, but you will have a couple of methods more:
  • pandas.DataFrame.ds_trie_regex_search / pandas.Series.ds_trie_regex_search
  • pandas.DataFrame.ds_trie_regex_sub / pandas.Series.ds_trie_regex_sub
  • pandas.DataFrame.ds_trie_regex_find_all / pandas.Series.ds_trie_regex_find_all
  • pandas.DataFrame.ds_regex_find_all / pandas.Series.ds_regex_find_all
  • pandas.DataFrame.ds_regex_find_all_special / pandas.Series.ds_regex_find_all_special
  • pandas.DataFrame.ds_regex_search / pandas.Series.ds_regex_search
  • pandas.DataFrame.ds_regex_sub / pandas.Series.ds_regex_sub

How to use the new methods

pandas.DataFrame.ds_regex_find_all_special / pandas.Series.ds_regex_find_all_special*
    #Using this method, you can get each match from REPEATED CAPTURE GROUPS! (A very rare feature in regex engines)
    #Besides that, you will see the exact position of each group/match.

    df=pd.read_csv( "https://github.com/pandas-dev/pandas/raw/main/doc/data/titanic.csv")

    special=df.ds_regex_find_all_special(r'\b(Ma(\w)+)(\w+)\b', dtype_string=False)


                                                                           aa_start_0  ... aa_match_6
    aa_index aa_column aa_whole_match aa_whole_start aa_whole_end aa_group             ...
    7        Name      Master         9              15           0                 9  ...        NaN
                                                                  1                 9  ...        NaN
                                                                  2                11  ...        NaN
                                                                  3                14  ...        NaN
    10       Name      Marguerite     17             27           0                17  ...        NaN
                                                                               ...  ...        ...
    885      Name      Margaret       20             28           3                27  ...        NaN
    887      Name      Margaret       14             22           0                14  ...        NaN
                                                                  1                14  ...        NaN
                                                                  2                16  ...        NaN
                                                                  3                21  ...        NaN

    #If you use any common regex engine, you can't get the repeated capture groups, since every new result overwrites the old one:
    import re
    re.findall('(Ma(\w)+)', 'Margaret')
    Out[11]: [('Margaret', 't')]

    #Using this method you will get all repeated capture groups, they won't be overwritten!

    #Results for index 887
                                                                       aa_start_0  aa_start_1  aa_start_2  aa_start_3  aa_start_4  aa_start_5  aa_start_6  aa_stop_0  aa_stop_1  aa_stop_2  aa_stop_3  aa_stop_4  aa_stop_5  aa_stop_6 aa_match_0 aa_match_1 aa_match_2 aa_match_3 aa_match_4 aa_match_5 aa_match_6
    aa_column aa_whole_match aa_whole_start aa_whole_end aa_group
    Name      Margaret       14             22           0                 14        <NA>        <NA>        <NA>        <NA>        <NA>        <NA>         22       <NA>       <NA>       <NA>       <NA>       <NA>       <NA>   Margaret       <NA>       <NA>       <NA>       <NA>       <NA>       <NA>
                                                         1                 14        <NA>        <NA>        <NA>        <NA>        <NA>        <NA>         21       <NA>       <NA>       <NA>       <NA>       <NA>       <NA>    Margare       <NA>       <NA>       <NA>       <NA>       <NA>       <NA>
                                                         2                 16          17          18          19          20        <NA>        <NA>         17         18         19         20         21       <NA>       <NA>          r          g          a          r          e       <NA>       <NA>
                                                         3                 21        <NA>        <NA>        <NA>        <NA>        <NA>        <NA>         22       <NA>       <NA>       <NA>       <NA>       <NA>       <NA>          t       <NA>       <NA>       <NA>       <NA>       <NA>       <NA>


    If you want to convert the results to the best available dtype, use:

    df.ds_regex_find_all_special(r'\b(Ma(\w)+)(\w+)\b', dtype_string=False)

    Out[3]:
                                                                            aa_start_0  ...  aa_match_6
    #aa_index aa_column aa_whole_match aa_whole_start aa_whole_end aa_group              ...
    7        Name      Master         9              15           0                  9  ...        <NA>
                                                                  1                  9  ...        <NA>
                                                                  2                 11  ...        <NA>
                                                                  3                 14  ...        <NA>
    10       Name      Marguerite     17             27           0                 17  ...        <NA>
                                                                                ...  ...         ...
    885      Name      Margaret       20             28           3                 27  ...        <NA>
    887      Name      Margaret       14             22           0                 14  ...        <NA>
                                                                  1                 14  ...        <NA>
                                                                  2                 16  ...        <NA>
                                                                  3                 21  ...        <NA>
    [764 rows x 21 columns]


    aa_start_0       uint8
    aa_start_1       Int64
    aa_start_2       Int64
    aa_start_3       Int64
    aa_start_4       Int64
    aa_start_5       Int64
    aa_start_6       Int64
    aa_stop_0        uint8
    aa_stop_1        Int64
    aa_stop_2        Int64
    aa_stop_3        Int64
    aa_stop_4        Int64
    aa_stop_5        Int64
    aa_stop_6        Int64
    aa_match_0    category
    aa_match_1    category
    aa_match_2    category
    aa_match_3    category
    aa_match_4    category
    aa_match_5    category
    aa_match_6    category

        Parameters:
            df: Union[pd.DataFrame, pd.Series]
            regular_expression: str
               Syntax from https://pypi.org/project/regex/
            flags:int
                You can use any flag that is available here: https://pypi.org/project/regex/
               (default  =regex.UNICODE)
            dtype_string:bool
                If True, it returns all results as a string
                If False, data types are converted to the best available
               (default  =True)
        Returns:
            Union[pd.Series, pd.DataFrame]
pandas.DataFrame.ds_regex_find_all / pandas.Series.ds_regex_find_all
    #Use regex.findall against a DataFrame/Series without having to fear any exception! You can get
    #the results as strings (dtype_string=True) or even as float, int, category (dtype_string=False) - Whatever
    #fits best!

    #Some examples

    df=pd.read_csv( "https://github.com/pandas-dev/pandas/raw/main/doc/data/titanic.csv")


    df.Name.ds_regex_find_all(regular_expression=r'(\bM\w+\b)\s+(\bW\w+\b)')
              result_0  result_1
    426 Name     Maria  Winfield
    472 Name      Mary     Worth
    862 Name  Margaret    Welles

    multilinetest=df.Name.map(lambda x: f'{x}\n' * 3) #Every name 3x in each cell

    multilinetest.ds_regex_find_all(regular_expression=r'^.*(\bM\w+\b)\s+(\bW\w+\b)', line_by_line=False)

    Out[3]:
              result_0  result_1
    58  Name    Mirium      West
    426 Name     Maria  Winfield
    472 Name      Mary     Worth
    862 Name  Margaret    Welles


    multilinetest.ds_regex_find_all(regular_expression=r'^.*(\bM\w+\b)\s+(\bW\w+\b)', line_by_line=True)
    Out[7]:
              result_0  result_1
    426 Name     Maria  Winfield
        Name     Maria  Winfield
        Name     Maria  Winfield
    472 Name      Mary     Worth
        Name      Mary     Worth
        Name      Mary     Worth
    862 Name  Margaret    Welles
        Name  Margaret    Welles
        Name  Margaret    Welles

    #By using line_by_line=True you can be sure that the regex engine will check every single line!

        Parameters:
            df: Union[pd.DataFrame, pd.Series]
            regular_expression: str
               Syntax from https://pypi.org/project/regex/
            flags:int
                You can use any flag that is available here: https://pypi.org/project/regex/
               (default  =regex.UNICODE)
            dtype_string:bool
                If True, it returns all results as a string
                If False, data types are converted to the best available
               (default  =True)
            line_by_line:bool
                If you want to split the line before searching. Useful, if you want to use ^....$ more than once.
               (default  =False)
        Returns:
            Union[pd.Series, pd.DataFrame]
pandas.DataFrame.ds_trie_regex_find_all / pandas.Series.ds_trie_regex_find_all
    #If you have a huge list of words you want to  search/sub/find_all on this list, you can try to use the Trie regex methods to get the job done faster
    #It is worth trying if:
    #1) your DataFrame/Series has a lot of text in each cell
    #2) you want to search for a lot of words in each cell
#
    #The more words you have, and the more text is in each cell, the faster it gets.
    #If you want to know more about, I recommend: https://stackoverflow.com/a/42789508/15096247

    Example:

    df=pd.read_csv( "https://github.com/pandas-dev/pandas/raw/main/doc/data/titanic.csv")
    allstrings=pd.DataFrame([[df.Name.to_string() *2] *2,[df.Name.to_string() *2] *2]) #lets create a little dataframe with a lot of text in each cell
    hugeregexlist=df.Name.str.extract(r'^\s*(\w+)').drop_duplicates()[0].to_list() #lets get all names (first word) in the titanic DataFrame
    #it should look like that: ['Braund',  'Cumings',  'Heikkinen',  'Futrelle',  'Allen',  'Moran',  'McCarthy',  'Palsson',  'Johnson',  'Nasser' ... ]
    %timeit allstrings.ds_trie_regex_find_all(hugeregexlist,add_left_to_regex=r'\b',add_right_to_regex=r'\b')

    776 ms ± 2.83 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

    allstrings.ds_trie_regex_find_all(hugeregexlist, add_left_to_regex=r'\b', add_right_to_regex=r'\b')
    Out[6]:
        result_0 result_1 result_2  ... result_2133 result_2134 result_2135
    0 0   Braund   Harris  Cumings  ...    Johnston        Behr      Dooley
      1   Braund   Harris  Cumings  ...    Johnston        Behr      Dooley
    1 0   Braund   Harris  Cumings  ...    Johnston        Behr      Dooley
      1   Braund   Harris  Cumings  ...    Johnston        Behr      Dooley


    Let's compare with a regular regex search
    hugeregex=r"\b(?:" + "|".join([f'(?:{y})' for y in df.Name.str.extract(r'^\s*(\w+)').drop_duplicates()[0].to_list()]) + r")\b"  #let's create a regex from all names
    #it should look like this: '\\b(?:(?:Braund)|(?:Cumings)|(?:Heikkinen)|(?:Futrelle)|(?:Allen)|(?:Moran)|(?:McCarthy)|(?:Palsson)|(?:Johnson)|(?:Na...
    %timeit allstrings.ds_regex_find_all(hugeregex)

    945 ms ± 3.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

    #Not bad, right? But it still can a lot better! Try it yourself!

    #Another good thing is that you can search in every cell, no matter what dtype it is.
    #There won't be thrown any exception, because everything is converted to string before performing any action.
    #If you pass "dtype_string=False", each column will be converted to the best available dtype after the actions have been completed

        Parameters:
            df: Union[pd.DataFrame, pd.Series]
            wordlist: list[str]
               All strings you are looking for
            add_left_to_regex: str
                if you want to add something before the generated Trie regex -> \b for example
                allstrings.ds_trie_regex_find_all(hugeregexlist,add_left_to_regex=r'\b',add_right_to_regex=r'\b')
               (default  = "")
            add_right_to_regex: str
                if you want to add something after the generated Trie regex -> \b for example
                allstrings.ds_trie_regex_find_all(hugeregexlist,add_left_to_regex=r'\b',add_right_to_regex=r'\b')
               (default  = "")
            flags:int
                You can use any flag that is available here: https://pypi.org/project/regex/
               (default  =regex.UNICODE)
            dtype_string:bool
                If True, it returns all results as a string
                If False, data types are converted to the best available
               (default  =True)
            line_by_line:bool
                If you want to split the line before searching. Useful, if you want to use ^....$ more than once.
               (default  =False)
        Returns:
            Union[pd.Series, pd.DataFrame]
pandas.DataFrame.ds_regex_sub / pandas.Series.ds_regex_sub
    #Use regex.sub against a DataFrame/Series without having to fear any exception! You can get
    #the results as strings (dtype_string=True) or even as float, int, category (dtype_string=False) - Whatever
    #fits best!
#
    #Some examples:
    df=pd.read_csv( "https://github.com/pandas-dev/pandas/raw/main/doc/data/titanic.csv")


         PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked
    0              1         0       3  ...   7.2500   NaN         S
    1              2         1       1  ...  71.2833   C85         C
    2              3         1       3  ...   7.9250   NaN         S
    3              4         1       1  ...  53.1000  C123         S
    4              5         0       3  ...   8.0500   NaN         S
    ..           ...       ...     ...  ...      ...   ...       ...
    886          887         0       2  ...  13.0000   NaN         S
    887          888         1       1  ...  30.0000   B42         S
    888          889         0       3  ...  23.4500   NaN         S
    889          890         1       1  ...  30.0000  C148         C
    890          891         0       3  ...   7.7500   NaN         Q
    [891 rows x 12 columns]


    subst=df.ds_regex_sub(regular_expression=r'^\b8\d(\d)\b', replace=r'\g<1>00000',dtype_string=False)

    Out[5]:
         PassengerId  Survived  Pclass  ...     Fare Cabin Embarked
    0              1         0       3  ...   7.2500  <NA>        S
    1              2         1       1  ...  71.2833   C85        C
    2              3         1       3  ...   7.9250  <NA>        S
    3              4         1       1  ...  53.1000  C123        S
    4              5         0       3  ...   8.0500  <NA>        S
    ..           ...       ...     ...  ...      ...   ...      ...
    886       700000         0       2  ...  13.0000  <NA>        S
    887       800000         1       1  ...  30.0000   B42        S
    888       900000         0       3  ...  23.4500  <NA>        S
    889            0         1       1  ...  30.0000  C148        C
    890       100000         0       3  ...   7.7500  <NA>        Q
    [891 rows x 12 columns]


    subst.dtypes
    Out[8]:
    PassengerId      uint32
    Survived          uint8
    Pclass            uint8
    Name             string
    Sex            category
    Age              object
    SibSp             uint8
    Parch             uint8
    Ticket           object
    Fare            float64
    Cabin          category
    Embarked       category

    #As you can see, the numbers that we have substituted have been converted to int

    #Let's do something like math.floor in a very unconventional way :)

    df.Fare
    Out[16]:
    0       7.2500
    1      71.2833
    2       7.9250
    3      53.1000
    4       8.0500
            ...
    886    13.0000
    887    30.0000
    888    23.4500
    889    30.0000
    890     7.7500
    Name: Fare, Length: 891, dtype: float64

    Fareint=df.Fare.ds_regex_sub(r'(\d+)\.\d+$', r'\g<1>',dtype_string=False)

    0       7
    1      71
    2       7
    3      53
    4       8
    ..    ...
    886    13
    887    30
    888    23
    889    30
    890     7

    Fareint.dtypes
    Out[18]:
    Fare    uint16
    #You should not use this method if there are other ways to convert float to int.
    #It serves best for data cleaning, at least that's what I am using it for.

        Parameters:
            df: Union[pd.DataFrame, pd.Series]
            regular_expression: str
               Syntax from https://pypi.org/project/regex/
            replace: str
               the replacement you want to use (groups are allowed)
            flags:int
                You can use any flag that is available here: https://pypi.org/project/regex/
               (default  =regex.UNICODE)
            dtype_string:bool
                If True, it returns all results as a string
                If False, data types are converted to the best available
               (default  =True)
            line_by_line:bool
                If you want to split the line before searching. Useful, if you want to use ^....$ more than once.
               (default  =False)
        Returns:
            Union[pd.Series, pd.DataFrame]
pandas.DataFrame.ds_trie_regex_sub / pandas.Series.ds_trie_regex_sub
    #Check out the docs of df.trie_regex_find_all() for detailed information

    #Some examples with DataFrames / Series
    df=pd.read_csv( "https://github.com/pandas-dev/pandas/raw/main/doc/data/titanic.csv")
    #hugeregexlist = ['Braund',  'Cumings',  'Heikkinen',  'Futrelle',  'Allen',  'Moran',  'McCarthy',  'Palsson',  'Johnson',  'Nasser' ... ]

    df.Name.ds_trie_regex_sub(hugeregexlist, 'HANS',add_left_to_regex=r'^\b',add_right_to_regex=r'\b')

    Out[16]:
                                                     Name
    0                               HANS, Mr. Owen Harris
    1    HANS, Mrs. John Bradley (Florence Briggs Thayer)
    2                                   HANS, Miss. Laina
    3            HANS, Mrs. Jacques Heath (Lily May Peel)
    4                             HANS, Mr. William Henry
    ..                                                ...
    886                                 HANS, Rev. Juozas
    887                        HANS, Miss. Margaret Edith
    888              HANS, Miss. Catherine Helen "Carrie"
    889                             HANS, Mr. Karl Howell
    890                                 HANS, Mr. Patrick
    [891 rows x 1 columns]


    allstrings.ds_trie_regex_search(hugeregexlist,line_by_line=True)

    Out[25]:
          result_0
    0 0     Braund
      0    Cumings
      0  Heikkinen
      0   Futrelle
      0      Allen
    ..         ...
    1 1   Montvila
      1     Graham
      1   Johnston
      1       Behr
      1     Dooley
    [7124 rows x 1 columns]



        Parameters:
            df: Union[pd.DataFrame, pd.Series]
            wordlist: list[str]
               All strings you are looking for
            replace: str
               the replacement you want to use (groups are allowed)
            add_left_to_regex: str
                if you want to add something before the generated Trie regex -> \b for example
                allstrings.ds_trie_regex_find_all(hugeregexlist,add_left_to_regex=r'\b',add_right_to_regex=r'\b')
               (default  = "")
            add_right_to_regex: str
                if you want to add something after the generated Trie regex -> \b for example
                allstrings.ds_trie_regex_find_all(hugeregexlist,add_left_to_regex=r'\b',add_right_to_regex=r'\b')
               (default  = "")
            flags:int
                You can use any flag that is available here: https://pypi.org/project/regex/
               (default  =regex.UNICODE)
            dtype_string:bool
                If True, it returns all results as a string
                If False, data types are converted to the best available
               (default  =True)
            line_by_line:bool
                If you want to split the line before searching. Useful, if you want to use ^....$ more than once.
               (default  =False)
        Returns:
            Union[pd.Series, pd.DataFrame]
pandas.DataFrame.ds_regex_search / pandas.Series.ds_regex_search
    #Use regex.search against a DataFrame/Series without having to fear any exception! You can get
    #the results as strings (dtype_string=True) or even as float, int, category (dtype_string=False) - Whatever
    #fits best!
#
    #Some examples

    df=pd.read_csv( "https://github.com/pandas-dev/pandas/raw/main/doc/data/titanic.csv")

    multilinetest=df.Name.map(lambda x: f'{x}\n' * 3) #Every name 3x in each cell to test line_by_line

    #using line_by_line=False
    multilinetest.ds_regex_search(regular_expression=r'^.*(\bM\w+\b)\s+(\bW\w+\b)', line_by_line=False, flags=re.IGNORECASE)
    Out[13]:
                                                 result_0
    58  Name           West, Miss. Constance Mirium\nWest
        Name                                       Mirium
        Name                                         West
    426 Name   Clarke, Mrs. Charles V (Ada Maria Winfield
        Name                                        Maria
        Name                                     Winfield
    472 Name       West, Mrs. Edwy Arthur (Ada Mary Worth
        Name                                         Mary
        Name                                        Worth
    862 Name  Swift, Mrs. Frederick Joel (Margaret Welles
        Name                                     Margaret
        Name                                       Welles

    #using line_by_line=True
    multilinetest.ds_regex_search(regular_expression=r'^.*(\bM\w+\b)\s+(\bW\w+\b)', line_by_line=True, flags=re.IGNORECASE)
    Out[19]:
                                                 result_0
    426 Name   Clarke, Mrs. Charles V (Ada Maria Winfield
        Name                                        Maria
        Name                                     Winfield
        Name   Clarke, Mrs. Charles V (Ada Maria Winfield
        Name                                        Maria
        Name                                     Winfield
        Name   Clarke, Mrs. Charles V (Ada Maria Winfield
        Name                                        Maria
        Name                                     Winfield
    472 Name       West, Mrs. Edwy Arthur (Ada Mary Worth
        Name                                         Mary
        Name                                        Worth
        Name       West, Mrs. Edwy Arthur (Ada Mary Worth
        Name                                         Mary
        Name                                        Worth
        Name       West, Mrs. Edwy Arthur (Ada Mary Worth
        Name                                         Mary
        Name                                        Worth
    862 Name  Swift, Mrs. Frederick Joel (Margaret Welles
        Name                                     Margaret
        Name                                       Welles
        Name  Swift, Mrs. Frederick Joel (Margaret Welles
        Name                                     Margaret
        Name                                       Welles
        Name  Swift, Mrs. Frederick Joel (Margaret Welles
        Name                                     Margaret
        Name                                       Welles

    #Now, we get a match for each line!

        Parameters:
            df: Union[pd.DataFrame, pd.Series]
            regular_expression: str
               Syntax from https://pypi.org/project/regex/
            flags:int
                You can use any flag that is available here: https://pypi.org/project/regex/
               (default  =regex.UNICODE)
            dtype_string:bool
                If True, it returns all results as a string
                If False, data types are converted to the best available
               (default  =True)
            line_by_line:bool
                If you want to split the line before searching. Useful, if you want to use ^....$ more than once.
               (default  =False)
        Returns:
            Union[pd.Series, pd.DataFrame]
pandas.DataFrame.ds_regex_sub / pandas.Series.ds_regex_sub
    #Use regex.sub against a DataFrame/Series without having to fear any exception! You can get
    #the results as strings (dtype_string=True) or even as float, int, category (dtype_string=False) - Whatever
    #fits best!
#
    #Some examples
    df=pd.read_csv( "https://github.com/pandas-dev/pandas/raw/main/doc/data/titanic.csv")


         PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked
    0              1         0       3  ...   7.2500   NaN         S
    1              2         1       1  ...  71.2833   C85         C
    2              3         1       3  ...   7.9250   NaN         S
    3              4         1       1  ...  53.1000  C123         S
    4              5         0       3  ...   8.0500   NaN         S
    ..           ...       ...     ...  ...      ...   ...       ...
    886          887         0       2  ...  13.0000   NaN         S
    887          888         1       1  ...  30.0000   B42         S
    888          889         0       3  ...  23.4500   NaN         S
    889          890         1       1  ...  30.0000  C148         C
    890          891         0       3  ...   7.7500   NaN         Q
    [891 rows x 12 columns]


    subst=df.ds_regex_sub(regular_expression=r'^\b8\d(\d)\b', replace=r'\g<1>00000',dtype_string=False)

    Out[5]:
         PassengerId  Survived  Pclass  ...     Fare Cabin Embarked
    0              1         0       3  ...   7.2500  <NA>        S
    1              2         1       1  ...  71.2833   C85        C
    2              3         1       3  ...   7.9250  <NA>        S
    3              4         1       1  ...  53.1000  C123        S
    4              5         0       3  ...   8.0500  <NA>        S
    ..           ...       ...     ...  ...      ...   ...      ...
    886       700000         0       2  ...  13.0000  <NA>        S
    887       800000         1       1  ...  30.0000   B42        S
    888       900000         0       3  ...  23.4500  <NA>        S
    889            0         1       1  ...  30.0000  C148        C
    890       100000         0       3  ...   7.7500  <NA>        Q
    [891 rows x 12 columns]


    subst.dtypes
    Out[8]:
    PassengerId      uint32
    Survived          uint8
    Pclass            uint8
    Name             string
    Sex            category
    Age              object
    SibSp             uint8
    Parch             uint8
    Ticket           object
    Fare            float64
    Cabin          category
    Embarked       category

    #As you can see, the numbers that we have substituted have been converted to int

    #Let's do something like math.floor in a very unconventional way :)

    df.Fare
    Out[16]:
    0       7.2500
    1      71.2833
    2       7.9250
    3      53.1000
    4       8.0500
            ...
    886    13.0000
    887    30.0000
    888    23.4500
    889    30.0000
    890     7.7500
    Name: Fare, Length: 891, dtype: float64

    Fareint=df.Fare.ds_regex_sub(r'(\d+)\.\d+$', r'\g<1>',dtype_string=False)

    0       7
    1      71
    2       7
    3      53
    4       8
    ..    ...
    886    13
    887    30
    888    23
    889    30
    890     7

    Fareint.dtypes
    Out[18]:
    Fare    uint16
    #You should not use this method if there are other ways to convert float to int.
    #It serves best for data cleaning, at least that's what I am using it for.

        Parameters:
            df: Union[pd.DataFrame, pd.Series]
            regular_expression: str
               Syntax from https://pypi.org/project/regex/
            replace: str
               the replacement you want to use (groups are allowed)
            flags:int
                You can use any flag that is available here: https://pypi.org/project/regex/
               (default  =regex.UNICODE)
            dtype_string:bool
                If True, it returns all results as a string
                If False, data types are converted to the best available
               (default  =True)
            line_by_line:bool
                If you want to split the line before searching. Useful, if you want to use ^....$ more than once.
               (default  =False)
        Returns:
            Union[pd.Series, pd.DataFrame]