I am pulling a subset of data from a column based on conditions in another column being met.
I can get the correct values back but it is in pandas.core.frame.DataFrame. How do I convert that to list?
import pandas as pd
tst = pd.read_csv('C:\\SomeCSV.csv')
lookupValue = tst['SomeCol'] == "SomeValue"
ID = tst[lookupValue][['SomeCol']]
#How To convert ID to a list
Use .values
to get a numpy.array
and then .tolist()
to get a list.
For example:
import pandas as pd
df = pd.DataFrame({'a':[1,3,5,7,4,5,6,4,7,8,9],
'b':[3,5,6,2,4,6,7,8,7,8,9]})
Result:
>>> df['a'].values.tolist()
[1, 3, 5, 7, 4, 5, 6, 4, 7, 8, 9]
or you can just use
>>> df['a'].tolist()
[1, 3, 5, 7, 4, 5, 6, 4, 7, 8, 9]
To drop duplicates you can do one of the following:
>>> df['a'].drop_duplicates().values.tolist()
[1, 3, 5, 7, 4, 6, 8, 9]
>>> list(set(df['a'])) # as pointed out by EdChum
[1, 3, 4, 5, 6, 7, 8, 9]
I'd like to clarify a few things:
- As other answers have pointed out, the simplest thing to do is use
pandas.Series.tolist()
. I'm not sure why the top voted answer
leads off with using pandas.Series.values.tolist()
since as far as I can tell, it adds syntax/confusion with no added benefit.
tst[lookupValue][['SomeCol']]
is a dataframe (as stated in the
question), not a series (as stated in a comment to the question). This is because tst[lookupValue]
is a dataframe, and slicing it with [['SomeCol']]
asks for
a list of columns (that list that happens to have a length of 1), resulting in a dataframe being returned. If you
remove the extra set of brackets, as in
tst[lookupValue]['SomeCol']
, then you are asking for just that one
column rather than a list of columns, and thus you get a series back.
- You need a series to use
pandas.Series.tolist()
, so you should
definitely skip the second set of brackets in this case. FYI, if you
ever end up with a one-column dataframe that isn't easily avoidable
like this, you can use pandas.DataFrame.squeeze()
to convert it to
a series.
tst[lookupValue]['SomeCol']
is getting a subset of a particular column via
chained slicing. It slices once to get a dataframe with only certain rows
left, and then it slices again to get a certain column. You can get
away with it here since you are just reading, not writing, but
the proper way to do it is tst.loc[lookupValue, 'SomeCol']
(which returns a series).
- Using the syntax from #4, you could reasonably do everything in one line:
ID = tst.loc[tst['SomeCol'] == 'SomeValue', 'SomeCol'].tolist()
Demo Code:
import pandas as pd
df = pd.DataFrame({'colA':[1,2,1],
'colB':[4,5,6]})
filter_value = 1
print "df"
print df
print type(df)
rows_to_keep = df['colA'] == filter_value
print "\ndf['colA'] == filter_value"
print rows_to_keep
print type(rows_to_keep)
result = df[rows_to_keep]['colB']
print "\ndf[rows_to_keep]['colB']"
print result
print type(result)
result = df[rows_to_keep][['colB']]
print "\ndf[rows_to_keep][['colB']]"
print result
print type(result)
result = df[rows_to_keep][['colB']].squeeze()
print "\ndf[rows_to_keep][['colB']].squeeze()"
print result
print type(result)
result = df.loc[rows_to_keep, 'colB']
print "\ndf.loc[rows_to_keep, 'colB']"
print result
print type(result)
result = df.loc[df['colA'] == filter_value, 'colB']
print "\ndf.loc[df['colA'] == filter_value, 'colB']"
print result
print type(result)
ID = df.loc[rows_to_keep, 'colB'].tolist()
print "\ndf.loc[rows_to_keep, 'colB'].tolist()"
print ID
print type(ID)
ID = df.loc[df['colA'] == filter_value, 'colB'].tolist()
print "\ndf.loc[df['colA'] == filter_value, 'colB'].tolist()"
print ID
print type(ID)
Result:
df
colA colB
0 1 4
1 2 5
2 1 6
<class 'pandas.core.frame.DataFrame'>
df['colA'] == filter_value
0 True
1 False
2 True
Name: colA, dtype: bool
<class 'pandas.core.series.Series'>
df[rows_to_keep]['colB']
0 4
2 6
Name: colB, dtype: int64
<class 'pandas.core.series.Series'>
df[rows_to_keep][['colB']]
colB
0 4
2 6
<class 'pandas.core.frame.DataFrame'>
df[rows_to_keep][['colB']].squeeze()
0 4
2 6
Name: colB, dtype: int64
<class 'pandas.core.series.Series'>
df.loc[rows_to_keep, 'colB']
0 4
2 6
Name: colB, dtype: int64
<class 'pandas.core.series.Series'>
df.loc[df['colA'] == filter_value, 'colB']
0 4
2 6
Name: colB, dtype: int64
<class 'pandas.core.series.Series'>
df.loc[rows_to_keep, 'colB'].tolist()
[4, 6]
<type 'list'>
df.loc[df['colA'] == filter_value, 'colB'].tolist()
[4, 6]
<type 'list'>
You can use pandas.Series.tolist
e.g.:
import pandas as pd
df = pd.DataFrame({'a':[1,2,3], 'b':[4,5,6]})
Run:
>>> df['a'].tolist()
You will get
>>> [1, 2, 3]
The above solution is good if all the data is of same dtype. Numpy arrays are homogeneous containers. When you do df.values
the output is an numpy array
. So if the data has int
and float
in it then output will either have int
or float
and the columns will loose their original dtype.
Consider df
a b
0 1 4
1 2 5
2 3 6
a float64
b int64
So if you want to keep original dtype, you can do something like
row_list = df.to_csv(None, header=False, index=False).split('\n')
this will return each row as a string.
['1.0,4', '2.0,5', '3.0,6', '']
Then split each row to get list of list. Each element after splitting is a unicode. We need to convert it required datatype.
def f(row_str):
row_list = row_str.split(',')
return [float(row_list[0]), int(row_list[1])]
df_list_of_list = map(f, row_list[:-1])
[[1.0, 4], [2.0, 5], [3.0, 6]]