Python parse dataframe element

2019-08-13 07:48发布

I have a pandas dataframe column (Data Type) which I want to split into three columns

target_table_df = LoadS_A [['Attribute Name',
                              'Data Type',
                              'Primary Key Indicator']]

Example input (target_table_df)

                 Attribute Name      Data Type Primary Key Indicator
0                       ACC_LIM  DECIMAL(18,4)                 False
1                        ACC_NO   NUMBER(11,0)                 False
2                   ACC_OPEN_DT           DATE                 False
3                          ACCB  DECIMAL(18,4)                 False
4                          ACDB  DECIMAL(18,4)                 False
5                     AGRMNT_ID   NUMBER(11,0)                  True
6                     BRNCH_NUM   NUMBER(11,0)                 False
7                      CLRD_BAL  DECIMAL(18,4)                 False
8              CR_INT_ACRD_GRSS  DECIMAL(18,4)                 False
9               CR_INT_ACRD_NET  DECIMAL(18,4)                 False

I aim to:

  • Reassign 'Data Type' to the text preceding the parenthesis

[..if parenthesis exists in 'Data Type']:

  • Create new column 'Precision' and assign to first comma separated value
  • Create new column 'Scale' and assign to second comma separated value

Intended output would therefore become:

    Data Type   Precision   Scale
0   decimal 18  4
1   number  11  0
2   date        
3   decimal 18  4
4   decimal 18  4
5   number  4   0

I have tried in anger to achieve this but i'm new to dataframes....can't work out if I am to iterate over all rows or if there is a way to apply to all values in the dataframe?

Any help much appreciated

1条回答
迷人小祖宗
2楼-- · 2019-08-13 07:50

Use target_table_df['Data Type'].str.extract(pattern)

You'll need to assign pattern to be a regular expression that captures each of the components you're looking for.

pattern = r'([^\(]+)(\(([^,]*),(.*)\))?'

([^\(]+) says grab as many non-open parenthesis characters you can up to the first open parenthesis.

\(([^,]*, says to grab the first set of non-comma characters after an open parenthesis and stop at the comma.

,(.*)\) says to grab the rest of the characters between the comma and the close parenthesis.

(\(([^,]*),(.*)\))? says the whole parenthesis thing may not even happen, grab it if you can.

Solution

everything together looks like this:

pattern = r'([^\(]+)(\(([^,]*),(.*)\))?'
df = s.str.extract(pattern, expand=True).iloc[:, [0, 2, 3]]

# Formatting to get it how you wanted
df.columns = ['Data Type', 'Precision', 'Scale']
df.index.name = None
print df

I put a .iloc[:, [0, 2, 3]] at the end because the pattern I used grabs the whole parenthesis in column 1 and I wanted to skip it. Leave it off and see.

  Data Type Precision Scale
0   decimal        18     4
1    number        11     0
2      date       NaN   NaN
3   decimal        18     4
4   decimal        18     4
5    number        11     0
查看更多
登录 后发表回答