I have a sample dataframe show as below. For each line, I want to check the c1 first, if it is not null, then check c2. By this way, find the first notnull column and store that value to column result.
ID c1 c2 c3 c4 result
1 a b a
2 cc dd cc
3 ee ff ee
4 gg gg
I am using this way for now. but I would like to know if there is a better method.(The column name do not have any pattern, this is just sample)
df["result"] = np.where(df["c1"].notnull(), df["c1"], None)
df["result"] = np.where(df["result"].notnull(), df["result"], df["c2"])
df["result"] = np.where(df["result"].notnull(), df["result"], df["c3"])
df["result"] = np.where(df["result"].notnull(), df["result"], df["c4"])
df["result"] = np.where(df["result"].notnull(), df["result"], "unknown)
When there are lots of columns, this method looks not good.
I am using
lookup
and data from JppUse back filling
NaN
s first and then select first column byiloc
:Or:
Performance:
One way is to use
pd.DataFrame.lookup
withpd.Series.first_valid_index
applied on a transposed dataframe:Setup
Solution
stack
+groupby
+first
stack
implicitly drops NaNs, sogroupby.first
is guarantee to give you the first non-null value if it exists. Assigning the result back will expose any NaNs at missing indices which you canfillna
with a subsequent call.(beware, this is slow for larger dataframes, for performance you may use @jezrael's solution)