I try to retrieve for each row containing NaN values all the indices of the corresponding columns.
d=[[11.4,1.3,2.0, NaN],[11.4,1.3,NaN, NaN],[11.4,1.3,2.8, 0.7],[NaN,NaN,2.8, 0.7]]
df = pd.DataFrame(data=d, columns=['A','B','C','D'])
print df
A B C D
0 11.4 1.3 2.0 NaN
1 11.4 1.3 NaN NaN
2 11.4 1.3 2.8 0.7
3 NaN NaN 2.8 0.7
I've already done the following :
- add a column with the count of NaN for each row
- get the indices of each row containing NaN values
What I want (ideally the name of the column) is get a list like this :
[ ['D'],['C','D'],['A','B'] ]
Hope I can find a way without doing for each row the test for each column
if df.ix[i][column] == NaN:
I'm looking for a pandas way to be able to deal with my huge dataset.
Thanks in advance.
It should be efficient to use a scipy coordinate-format sparse matrix to retrieve the coordinates of the null values:
Note that I'm calling the
nonzero
method in order to just output the coordinates of the nonzero entries in the underlying sparse matrix since I don't care about the actual values which are allTrue
.You can iterate through each row in the dataframe, create a mask of null values, and output their index (i.e. the columns in the dataframe).
Another way, extract the rows which are NaN:
This gets you most of the way and may be enough.
Although it may be easier to work with the Series:
e.g. if you wanted the lists (though I don't think you would need them)
another simpler way is:
to subset:
to get integer index: