How do you read in a dataframe with lists using pd

2020-07-06 08:03发布

问题:

Here's some data from another question:

                          positive                 negative          neutral
1   [marvel, moral, bold, destiny]                       []   [view, should]
2                      [beautiful]      [complicated, need]               []
3                      [celebrate]   [crippling, addiction]            [big]

What I would do first is to add quotes across all words, and then:

import ast

df = pd.read_clipboard(sep='\s{2,}')
df = df.applymap(ast.literal_eval)

Is there a smarter way to do this?

回答1:

For basic structures you can use yaml without having to add quotes:

import yaml
df = pd.read_clipboard(sep='\s{2,}').applymap(yaml.load)

type(df.iloc[0, 0])
Out: list


回答2:

I did it this way:

df = pd.read_clipboard(sep='\s{2,}', engine='python')
df = df.apply(lambda x: x.str.replace(r'[\[\]]*', '').str.split(',\s*', expand=False))

PS i'm sure - there must be a better way to do that...



回答3:

Another alternative is

In [43]:  df.applymap(lambda x: x[1:-1].split(', '))
Out[43]: 
                         positive                negative         neutral
1  [marvel, moral, bold, destiny]                      []  [view, should]
2                     [beautiful]     [complicated, need]              []
3                     [celebrate]  [crippling, addiction]           [big]

Note that this assumes the first and last character in each cell is [ and ]. It also assumes there is exactly one space after the commas.



回答4:

Another version:

df.applymap(lambda x:
            ast.literal_eval("[" + re.sub(r"[[\]]", "'", 
                                          re.sub("[,\s]+", "','", x)) + "]"))


回答5:

Per help from @MaxU

df = pd.read_clipboard(sep='\s{2,}', engine='python')

Then:

>>> df.apply(lambda col: col.str[1:-1].str.split(', '))
                         positive                negative         neutral
1  [marvel, moral, bold, destiny]                      []  [view, should]
2                     [beautiful]     [complicated, need]              []
3                     [celebrate]  [crippling, addiction]           [big]

>>> df.apply(lambda col: col.str[1:-1].str.split()).loc[3, 'negative']
['crippling', 'addiction']

And per the notes from @unutbu who came up with a similar solution:

assumes the first and last character in each cell is [ and ]. It also assumes there is exactly one space after the commas.