How do I find: Is the first non-NaN value in each

2020-07-18 09:41发布

For example:

      0     1
0  87.0   NaN
1   NaN  99.0
2   NaN   NaN
3   NaN   NaN
4   NaN  66.0
5   NaN   NaN
6   NaN  77.0
7   NaN   NaN
8   NaN   NaN
9  88.0   NaN

My expected output is: [False, True] since 87 is the first !NaN value but not the maximum in column 0. 99 however is the first !NaN value and is indeed the max in that column.

5条回答
何必那么认真
2楼-- · 2020-07-18 10:18

Option a): Just do groupby with first

(May not be 100% reliable )

df.groupby([1]*len(df)).first()==df.max()
Out[89]: 
       0     1
1  False  True

Option b): bfill

Or using bfill(Fill any NaN value by the backward value in the column , then the first row after bfill is the first not NaN value )

df.bfill().iloc[0]==df.max()
Out[94]: 
0    False
1     True
dtype: bool

Option c): stack

df.stack().reset_index(level=1).drop_duplicates('level_1').set_index('level_1')[0]==df.max()
Out[102]: 
level_1
0    False
1     True
dtype: bool

Option d): idxmax with first_valid_index

df.idxmax()==df.apply(pd.Series.first_valid_index)
Out[105]: 
0    False
1     True
dtype: bool

Option e)(From Pir): idxmax with isna

df.notna().idxmax() == df.idxmax()     
Out[107]: 
0    False
1     True
dtype: bool
查看更多
欢心
3楼-- · 2020-07-18 10:30

Using pure numpy (I think this is very fast)

>>> np.isnan(df.values).argmin(axis=0) == df.fillna(-np.inf).values.argmax(axis=0)
array([False,  True])

The idea is to compare if the index of the first non-nan is also the index of the argmax.

Timings

df = pd.concat([df]*1000).reset_index(drop=True) # setup

%timeit np.isnan(df.values).argmin(axis=0) == df.fillna(-np.inf).values.argmax(axis=0)
207 µs ± 8.83 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df.groupby([1]*len(df)).first()==df.max()
9.78 ms ± 339 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit df.bfill().iloc[0]==df.max()
824 µs ± 47.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df.stack().reset_index(level=1).drop_duplicates('level_1').set_index('level_1')[0]==df.max()
3.55 ms ± 249 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit df.idxmax()==df.apply(pd.Series.first_valid_index)
1.5 ms ± 25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df.values[df.notnull().idxmax(), np.arange(df.shape[1])] == df.max(axis=0)
1.13 ms ± 14.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df.values[(~np.isnan(df.values)).argmax(axis=0), np.arange(df.shape[1])] == df.max(axis=0).values
450 µs ± 20.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
查看更多
beautiful°
4楼-- · 2020-07-18 10:30

We can use numpy's nanmax here for an efficient solution:

a = df.values
np.nanmax(a, 0) == a[np.isnan(a).argmin(0), np.arange(a.shape[1])]

array([False,  True])

Timings (Whole lot of options presented here):


Functions

def chris(df):
    a = df.values
    return np.nanmax(a, 0) == a[np.isnan(a).argmin(0), np.arange(a.shape[1])]

def bradsolomon(df):
    df.values[df.notnull().idxmax(), np.arange(df.shape[1])] == df.max(axis=0).values

def wen1(df):
    return df.groupby([1]*len(df)).first()==df.max()

def wen2(df):
    return df.bfill().iloc[0]==df.max()

def wen3(df):
    return df.idxmax()==df.apply(pd.Series.first_valid_index)

def rafaelc(df):
    return np.isnan(df.values).argmin(axis=0) == df.fillna(-np.inf).values.argmax(axis=0)

def pir(df):
    return df.notna().idxmax() == df.idxmax()

Setup

res = pd.DataFrame(
       index=['chris', 'bradsolomon', 'wen1', 'wen2', 'wen3', 'rafaelc', 'pir'],
       columns=[10, 20, 30, 100, 500, 1000],
       dtype=float
)

for f in res.index:
    for c in res.columns:
        a = np.random.rand(c, c)
        a[a > 0.4] = np.nan
        df = pd.DataFrame(a)
        stmt = '{}(df)'.format(f)
        setp = 'from __main__ import df, {}'.format(f)
        res.at[f, c] = timeit(stmt, setp, number=50)

ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N");
ax.set_ylabel("time (relative)");

plt.show()

Results

enter image description here

查看更多
淡お忘
5楼-- · 2020-07-18 10:36

You can do something similar to Wens' answer with the underlying Numpy arrays:

>>> df.values[df.notnull().idxmax(), np.arange(df.shape[1])] == df.max(axis=0).values
array([False,  True])

df.max(axis=0) gives the column-wise max.

The left hand side indexes df.values, which is a 2d array, to make it a 1d array and compare it element-wise to the maxes per column.

If you exclude .values from the right-hand side, the result will just be a Pandas Series:

>>> df.values[df.notnull().idxmax(), np.arange(df.shape[1])] == df.max(axis=0)
0    False
1     True
dtype: bool
查看更多
一纸荒年 Trace。
6楼-- · 2020-07-18 10:38

After posting the question I came up with this:

def nice_method_name_here(sr):
    return sr[sr > 0][0] == np.max(sr)

print(df.apply(nice_method_name_here))

which seems to work, but not sure yet!

查看更多
登录 后发表回答