For example:

      0     1
0  87.0   NaN
1   NaN  99.0
2   NaN   NaN
3   NaN   NaN
4   NaN  66.0
5   NaN   NaN
6   NaN  77.0
7   NaN   NaN
8   NaN   NaN
9  88.0   NaN

My expected output is: [False, True] since 87 is the first !NaN value but not the maximum in column 0. 99 however is the first !NaN value and is indeed the max in that column.

标签： python pandas max nan

5条回答

何必那么认真

2楼-- · 2020-07-18 10:18

Option a): Just do `groupby` with `first`

(May not be 100% reliable )

df.groupby([1]*len(df)).first()==df.max()
Out[89]: 
       0     1
1  False  True

Option b): `bfill`

Or using bfill(Fill any NaN value by the backward value in the column , then the first row after bfill is the first not NaN value )

df.bfill().iloc[0]==df.max()
Out[94]: 
0    False
1     True
dtype: bool

Option c): `stack`

df.stack().reset_index(level=1).drop_duplicates('level_1').set_index('level_1')[0]==df.max()
Out[102]: 
level_1
0    False
1     True
dtype: bool

Option d): `idxmax` with `first_valid_index`

df.idxmax()==df.apply(pd.Series.first_valid_index)
Out[105]: 
0    False
1     True
dtype: bool

Option e)(From Pir): `idxmax` with `isna`

df.notna().idxmax() == df.idxmax()     
Out[107]: 
0    False
1     True
dtype: bool

0人赞添加讨论(0) 举报

欢心

3楼-- · 2020-07-18 10:30

Using pure numpy (I think this is very fast)

>>> np.isnan(df.values).argmin(axis=0) == df.fillna(-np.inf).values.argmax(axis=0)
array([False,  True])

The idea is to compare if the index of the first non-nan is also the index of the argmax.

Timings

df = pd.concat([df]*1000).reset_index(drop=True) # setup

%timeit np.isnan(df.values).argmin(axis=0) == df.fillna(-np.inf).values.argmax(axis=0)
207 µs ± 8.83 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df.groupby([1]*len(df)).first()==df.max()
9.78 ms ± 339 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit df.bfill().iloc[0]==df.max()
824 µs ± 47.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df.stack().reset_index(level=1).drop_duplicates('level_1').set_index('level_1')[0]==df.max()
3.55 ms ± 249 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit df.idxmax()==df.apply(pd.Series.first_valid_index)
1.5 ms ± 25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df.values[df.notnull().idxmax(), np.arange(df.shape[1])] == df.max(axis=0)
1.13 ms ± 14.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df.values[(~np.isnan(df.values)).argmax(axis=0), np.arange(df.shape[1])] == df.max(axis=0).values
450 µs ± 20.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

0人赞添加讨论(0) 举报

beautiful°

4楼-- · 2020-07-18 10:30

We can use numpy's nanmax here for an efficient solution:

a = df.values
np.nanmax(a, 0) == a[np.isnan(a).argmin(0), np.arange(a.shape[1])]

array([False,  True])

Timings (Whole lot of options presented here):

Functions

def chris(df):
    a = df.values
    return np.nanmax(a, 0) == a[np.isnan(a).argmin(0), np.arange(a.shape[1])]

def bradsolomon(df):
    df.values[df.notnull().idxmax(), np.arange(df.shape[1])] == df.max(axis=0).values

def wen1(df):
    return df.groupby([1]*len(df)).first()==df.max()

def wen2(df):
    return df.bfill().iloc[0]==df.max()

def wen3(df):
    return df.idxmax()==df.apply(pd.Series.first_valid_index)

def rafaelc(df):
    return np.isnan(df.values).argmin(axis=0) == df.fillna(-np.inf).values.argmax(axis=0)

def pir(df):
    return df.notna().idxmax() == df.idxmax()

Setup

res = pd.DataFrame(
       index=['chris', 'bradsolomon', 'wen1', 'wen2', 'wen3', 'rafaelc', 'pir'],
       columns=[10, 20, 30, 100, 500, 1000],
       dtype=float
)

for f in res.index:
    for c in res.columns:
        a = np.random.rand(c, c)
        a[a > 0.4] = np.nan
        df = pd.DataFrame(a)
        stmt = '{}(df)'.format(f)
        setp = 'from __main__ import df, {}'.format(f)
        res.at[f, c] = timeit(stmt, setp, number=50)

ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N");
ax.set_ylabel("time (relative)");

plt.show()

Results

0人赞添加讨论(0) 举报

淡お忘

5楼-- · 2020-07-18 10:36

You can do something similar to Wens' answer with the underlying Numpy arrays:

>>> df.values[df.notnull().idxmax(), np.arange(df.shape[1])] == df.max(axis=0).values
array([False,  True])

df.max(axis=0) gives the column-wise max.

The left hand side indexes df.values, which is a 2d array, to make it a 1d array and compare it element-wise to the maxes per column.

If you exclude .values from the right-hand side, the result will just be a Pandas Series:

>>> df.values[df.notnull().idxmax(), np.arange(df.shape[1])] == df.max(axis=0)
0    False
1     True
dtype: bool

0人赞添加讨论(0) 举报

一纸荒年 Trace。

6楼-- · 2020-07-18 10:38

After posting the question I came up with this:

def nice_method_name_here(sr):
    return sr[sr > 0][0] == np.max(sr)

print(df.apply(nice_method_name_here))

which seems to work, but not sure yet!

0人赞添加讨论(0) 举报

How do I find: Is the first non-NaN value in each

Option a): Just do groupby with first

Option b): bfill

Option c): stack

Option d): idxmax with first_valid_index

Option e)(From Pir): idxmax with isna

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间

Option a): Just do `groupby` with `first`

Option b): `bfill`

Option c): `stack`

Option d): `idxmax` with `first_valid_index`

Option e)(From Pir): `idxmax` with `isna`