I have a data frame that looks something like this:
NUM A B C D E F
p1 NaN -1.183 NaN NaN NaN 1.829711
p5 NaN NaN NaN NaN 1.267 -1.552721
p9 1.138 NaN NaN -1.179 NaN 1.227306
There is always a non-NaN value in: column F and at least one other column A-E.
I want to create a sub-table containing only those rows which contain certain combinations of non-NaN values in columns. There are a number of these desired combinations including doublets and triplets. Here are examples of three such combinations I want to pull:
- Rows which contain non-NaN values in columns A & B
- Rows which contain non-NaN values in C & D
- Rows which contain non-NaN values in A & B & C
I already know about the np.isfinite and pd.notnull commands from this question but I do not know how to apply them to combinations of columns.
Also, once I have a list of commands for removing rows that do not match one of my desired combinations, I do not know how to tell Pandas to remove rows ONLY if they do not match any of the desired combinations.
Many times, we will need to do logical operations on Boolean arrays (either numpy arrays or pandas series) as part of selecting a subset of a dataframe. Using 'and', 'or', 'not' operators for this will not work.
In [79]: df[pd.notnull(df['A']) and pd.notnull(df['F'])]
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
In Python, when using 'and', 'or' and 'not' operators, non-boolean variables are usually considered to be True unless they represent "empty" objects like []
, int(0)
, float(0)
, None
etc. So, it would be confusing to use these same operators for doing array-wise Boolean operations in Pandas. Some people would expect them to simply evaluate to True
Instead, we should use &
, |
and ~
for this.
In [69]: df[pd.notnull(df['A']) & pd.notnull(df['F'])]
Out[69]:
NUM A B C D E F
2 p9 1.138 NaN NaN -1.179 NaN 1.227306
An alternative shorter, but less flexible way to do this is to use any()
, all()
or empty
.
In [78]: df[pd.notnull(df[['A', 'F']]).all(axis=1)]
Out[78]:
NUM A B C D E F
2 p9 1.138 NaN NaN -1.179 NaN 1.227306
You can read more on this here
You can use apply
and lambda function where you choose non-Nan value. You can verify if it's Nan value using Numpy.isNan(..)
.
data="""NUM A B C D E F
p1 NaN -1.183 NaN NaN NaN 1.829711
p5 NaN NaN NaN NaN 1.267 -1.552721
p9 1.138 NaN NaN -1.179 NaN 1.227306"""
import pandas as pd
from io import StringIO
df= pd.read_csv(StringIO(data.decode('UTF-8')),delim_whitespace=True )
print df
# Rows which contain non-NaN values in columns A & B
df["A_B"]= df.apply(lambda x: x['A'] if np.isnan(x['B']) else x['B'] if np.isnan(x['A']) else 0, axis=1)
# Rows which contain non-NaN values in C & D
df["C_D"]= df.apply(lambda x: x['C'] if np.isnan(x['D']) else x['D'] if np.isnan(x['C']) else 0, axis=1)
# Rows which contain non-NaN values in A & B & C
df["A_B_C"]= df.apply(lambda x: x['C'] if np.isnan(x['A_B']) else x['A_B'] if np.isnan(x['C']) else 0, axis=1)
print df
# Rows which contain non-NaN values in A & B & C
df["A_B_C_D"]= df.apply(lambda x: x['A_B'] if np.isnan(x['C_D']) else x['C_D'] if np.isnan(x['A_B']) else 0, axis=1)
print df
Output:
NUM A B C D E F A_B C_D A_B_C
0 p1 NaN -1.183 NaN NaN NaN 1.829711 -1.183 NaN -1.183
1 p5 NaN NaN NaN NaN 1.267 -1.552721 NaN NaN NaN
2 p9 1.138 NaN NaN -1.179 NaN 1.227306 1.138 -1.179 1.138
If you don't need to go through conditional cases, you can check the other way that is explained in the other post.
Let's say your dataframe is called df
. You can use boolean masks like this.
# Specify column combinations that you want to pull
combo1 = ['A', 'B']
# Select rows in the data frame that have non-NaN values in the combination
# of columns specified above
notmissing = ((df.loc[:, combo1].notnull()))
df = df.loc[notmissing, :]