Get first non-null value per row

2020-02-14 03:11发布

I have a sample dataframe show as below. For each line, I want to check the c1 first, if it is not null, then check c2. By this way, find the first notnull column and store that value to column result.

ID  c1  c2  c3  c4  result
1   a   b           a
2       cc  dd      cc
3           ee  ff  ee
4               gg  gg

I am using this way for now. but I would like to know if there is a better method.(The column name do not have any pattern, this is just sample)

df["result"] = np.where(df["c1"].notnull(), df["c1"], None)
df["result"] = np.where(df["result"].notnull(), df["result"], df["c2"])
df["result"] = np.where(df["result"].notnull(), df["result"], df["c3"])
df["result"] = np.where(df["result"].notnull(), df["result"], df["c4"])
df["result"] = np.where(df["result"].notnull(), df["result"], "unknown)

When there are lots of columns, this method looks not good.

4条回答
看我几分像从前
2楼-- · 2020-02-14 03:46

I am using lookup and data from Jpp

df=df.set_index('ID')
s=df.ne('').idxmax(1)
df['Result']=df.lookup(s.index,s)
df
Out[492]: 
   c1  c2  c3  c4 Result
ID                      
1   a   b              a
2      cc  dd         cc
3          ee  ff     ee
4              gg     gg
查看更多
聊天终结者
3楼-- · 2020-02-14 03:51

Use back filling NaNs first and then select first column by iloc:

df['result'] = df[['c1','c2','c3','c4']].bfill(axis=1).iloc[:, 0].fillna('unknown')

Or:

df['result'] = df.iloc[:, 1:].bfill(axis=1).iloc[:, 0].fillna('unknown')

print (df)
   ID   c1   c2  c3   c4 result
0   1    a    b   a  NaN      a
1   2  NaN   cc  dd   cc     cc
2   3  NaN   ee  ff   ee     ee
3   4  NaN  NaN  gg   gg     gg

Performance:

df = pd.concat([df] * 1000, ignore_index=True)


In [220]: %timeit df['result'] = df[['c1','c2','c3','c4']].bfill(axis=1).iloc[:, 0].fillna('unknown')
100 loops, best of 3: 2.78 ms per loop

In [221]: %timeit df['result'] = df.iloc[:, 1:].bfill(axis=1).iloc[:, 0].fillna('unknown')
100 loops, best of 3: 2.7 ms per loop

#jpp solution
In [222]: %%timeit
     ...: cols = df.iloc[:, 1:].T.apply(pd.Series.first_valid_index)
     ...: 
     ...: df['result'] = [df.loc[i, cols[i]] for i in range(len(df.index))]
     ...: 
1 loop, best of 3: 180 ms per loop

#cᴏʟᴅsᴘᴇᴇᴅ'  s solution
In [223]: %timeit df['result'] = df.stack().groupby(level=0).first()
1 loop, best of 3: 606 ms per loop
查看更多
▲ chillily
4楼-- · 2020-02-14 03:56

One way is to use pd.DataFrame.lookup with pd.Series.first_valid_index applied on a transposed dataframe:

df = pd.DataFrame({'ID': [1, 2, 3, 4],
                   'c1': ['a', '', '', ''],
                   'c2': ['b', 'cc', '', ''],
                   'c3': ['' , 'dd', 'ee', ''],
                   'c4': ['', '', 'ff', 'gg']})

df = df.replace('', np.nan)

df['result'] = df.lookup(df.index, df.iloc[:, 1:].T.apply(pd.Series.first_valid_index))

print(df)

   ID   c1   c2   c3   c4 result
0   1    a    b  NaN  NaN      a
1   2  NaN   cc   dd  NaN     cc
2   3  NaN  NaN   ee   ff     ee
3   4  NaN  NaN  NaN   gg     gg
查看更多
We Are One
5楼-- · 2020-02-14 04:00

Setup

df = df.set_index('ID') # if necessary
df
     c1   c2  c3   c4
ID                   
1     a    b   a  NaN
2   NaN   cc  dd   cc
3   NaN   ee  ff   ee
4   NaN  NaN  gg   gg

Solution
stack + groupby + first
stack implicitly drops NaNs, so groupby.first is guarantee to give you the first non-null value if it exists. Assigning the result back will expose any NaNs at missing indices which you can fillna with a subsequent call.

df['result'] = df.stack().groupby(level=0).first()
# df['result'] = df['result'].fillna('unknown') # if necessary 
df
     c1   c2  c3   c4 result
ID                          
1     a    b   a  NaN      a
2   NaN   cc  dd   cc     cc
3   NaN   ee  ff   ee     ee
4   NaN  NaN  gg   gg     gg

(beware, this is slow for larger dataframes, for performance you may use @jezrael's solution)

查看更多
登录 后发表回答