Selecting from pandas dataframe (or numpy ndarray?

2020-07-24 04:51发布

问题:

I find myself coding this sort of pattern a lot:

tmp = <some operation>
result = tmp[<boolean expression>]
del tmp

...where <boolean expression> is to be understood as a boolean expression involving tmp. (For the time being, tmp is always a pandas dataframe, but I suppose that the same pattern would show up if I were working with numpy ndarrays--not sure.)

For example:

tmp = df.xs('A')['II'] - df.xs('B')['II']
result = tmp[tmp < 0]
del tmp

As one can guess from the del tmp at the end, the only reason for creating tmp at all is so that I can use a boolean expression involving it inside an indexing expression applied to it.

I would love to eliminate the need for this (otherwise useless) intermediate, but I don't know of any efficient1 way to do this. (Please, correct me if I'm wrong!)

As second best, I'd like to push off this pattern to some helper function. The problem is finding a decent way to pass the <boolean expression> to it. I can only think of indecent ones. E.g.:

def filterobj(obj, criterion):
    return obj[eval(criterion % 'obj')]

This actually works2:

filterobj(df.xs('A')['II'] - df.xs('B')['II'], '%s < 0')

# Int
# 0     -1.650107
# 2     -0.718555
# 3     -1.725498
# 4     -0.306617
# Name: II

...but using eval always leaves me feeling all yukky 'n' stuff... Please let me know if there's some other way.


1E.g., any approach I can think of involving the filter built-in is probably ineffiencient, since it would apply the criterion (some lambda function) by iterating, "in Python", over the panda (or numpy) object...

2The definition of df used in the last expression above would be something like this:

import itertools
import pandas as pd
import numpy as np
a = ('A', 'B')
i = range(5)
ix = pd.MultiIndex.from_tuples(list(itertools.product(a, i)),
                               names=('Alpha', 'Int'))
c = ('I', 'II', 'III')
df = pd.DataFrame(np.random.randn(len(idx), len(c)), index=ix, columns=c)

回答1:

Because of the way Python works, I think this one's going to be tough. I can only think of hacks which only get you part of the way there. Something like

def filterobj(obj, fn):
    return obj[fn(obj)]

filterobj(df.xs('A')['II'] - df.xs('B')['II'], lambda x: x < 0)

should work, unless I've missed something. Using lambdas this way is one of the usual tricks for delaying evaluation.

Thinking out loud: one could make a this object which isn't evaluated but just sticks around as an expression, something like

>>> this
this
>>> this < 3
this < 3
>>> df[this < 3]
Traceback (most recent call last):
  File "<ipython-input-34-d5f1e0baecf9>", line 1, in <module>
    df[this < 3]
[...]
KeyError: u'no item named this < 3'

and then either special-case the treatment of this into pandas or still have a function like

def filterobj(obj, criterion):
    return obj[eval(str(criterion.subs({"this": "obj"})))]

(with enough work we could lose the eval, this is simply proof of concept) after which something like

>>> tmp = df["I"] + df["II"]
>>> tmp[tmp < 0]
Alpha  Int
A      4     -0.464487
B      3     -1.352535
       4     -1.678836
Dtype: float64
>>> filterobj(df["I"] + df["II"], this < 0)
Alpha  Int
A      4     -0.464487
B      3     -1.352535
       4     -1.678836
Dtype: float64

would work. I'm not sure any of this is worth the headache, though, Python simply isn't very conducive to this style.



回答2:

This is as concise as I could get:

(df.xs('A')['II'] - df.xs('B')['II']).apply(lambda x: x if (x<0) else np.nan).dropna()

Int
0     -4.488312
1     -0.666710
2     -1.995535
Name: II


标签: numpy pandas