Pandas - Remove rows based on combinations of NaN

2020-08-04 04:54发布

问题:

I have a data frame that looks something like this:

NUM   A      B        C      D        E        F
p1    NaN    -1.183   NaN    NaN      NaN      1.829711
p5    NaN    NaN      NaN    NaN      1.267   -1.552721
p9    1.138  NaN      NaN    -1.179   NaN      1.227306

There is always a non-NaN value in: column F and at least one other column A-E.

I want to create a sub-table containing only those rows which contain certain combinations of non-NaN values in columns. There are a number of these desired combinations including doublets and triplets. Here are examples of three such combinations I want to pull:

  1. Rows which contain non-NaN values in columns A & B
  2. Rows which contain non-NaN values in C & D
  3. Rows which contain non-NaN values in A & B & C

I already know about the np.isfinite and pd.notnull commands from this question but I do not know how to apply them to combinations of columns.

Also, once I have a list of commands for removing rows that do not match one of my desired combinations, I do not know how to tell Pandas to remove rows ONLY if they do not match any of the desired combinations.

回答1:

Many times, we will need to do logical operations on Boolean arrays (either numpy arrays or pandas series) as part of selecting a subset of a dataframe. Using 'and', 'or', 'not' operators for this will not work.

In [79]: df[pd.notnull(df['A']) and pd.notnull(df['F'])]

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

In Python, when using 'and', 'or' and 'not' operators, non-boolean variables are usually considered to be True unless they represent "empty" objects like [], int(0), float(0), None etc. So, it would be confusing to use these same operators for doing array-wise Boolean operations in Pandas. Some people would expect them to simply evaluate to True

Instead, we should use &, | and ~for this.

In [69]: df[pd.notnull(df['A']) & pd.notnull(df['F'])]
Out[69]:
  NUM      A   B   C      D   E         F
2  p9  1.138 NaN NaN -1.179 NaN  1.227306

An alternative shorter, but less flexible way to do this is to use any(), all() or empty.

In [78]: df[pd.notnull(df[['A', 'F']]).all(axis=1)]
Out[78]:
  NUM      A   B   C      D   E         F
2  p9  1.138 NaN NaN -1.179 NaN  1.227306

You can read more on this here



回答2:

You can use apply and lambda function where you choose non-Nan value. You can verify if it's Nan value using Numpy.isNan(..).

data="""NUM   A      B        C      D        E        F
p1    NaN    -1.183   NaN    NaN      NaN      1.829711
p5    NaN    NaN      NaN    NaN      1.267   -1.552721
p9    1.138  NaN      NaN    -1.179   NaN      1.227306"""

import pandas as pd
from io import StringIO

df= pd.read_csv(StringIO(data.decode('UTF-8')),delim_whitespace=True )
print df



# Rows which contain non-NaN values in columns A & B
df["A_B"]= df.apply(lambda x: x['A'] if np.isnan(x['B']) else x['B'] if np.isnan(x['A']) else 0, axis=1)

# Rows which contain non-NaN values in C & D
df["C_D"]= df.apply(lambda x: x['C'] if np.isnan(x['D']) else x['D'] if np.isnan(x['C']) else 0, axis=1)

# Rows which contain non-NaN values in A & B & C
df["A_B_C"]= df.apply(lambda x: x['C'] if np.isnan(x['A_B']) else x['A_B'] if np.isnan(x['C']) else 0, axis=1)
print df

# Rows which contain non-NaN values in A & B & C
df["A_B_C_D"]= df.apply(lambda x: x['A_B'] if np.isnan(x['C_D']) else x['C_D'] if np.isnan(x['A_B']) else 0, axis=1)
print df

Output:

  NUM      A      B   C      D      E         F    A_B    C_D  A_B_C
0  p1    NaN -1.183 NaN    NaN    NaN  1.829711 -1.183    NaN -1.183
1  p5    NaN    NaN NaN    NaN  1.267 -1.552721    NaN    NaN    NaN
2  p9  1.138    NaN NaN -1.179    NaN  1.227306  1.138 -1.179  1.138

If you don't need to go through conditional cases, you can check the other way that is explained in the other post.



回答3:

Let's say your dataframe is called df. You can use boolean masks like this.

# Specify column combinations that you want to pull 
combo1 = ['A', 'B'] 

# Select rows in the data frame that have non-NaN values in the combination
# of columns specified above

notmissing = ((df.loc[:, combo1].notnull()))
df = df.loc[notmissing, :]