Dynamic Expression Evaluation in pandas using pd.e

2019-01-01 16:14发布

问题:

Objective and Motivation

eval and query are powerful, but underrated functions in the pandas API suite, and their use is far from being fully documented or understood. With the right amount of care, query and eval can greatly simplify code, improve performance, and become a powerful tool for creating dynamic workflows.

The aim of this canonical QnA is to give users a better understanding of these functions, discussing some of the lesser known features, how they are used, and how best to use them, with clear and easy to understand examples. The two main topics this post will address are

  1. Understanding engine, parser and target arguments in pd.eval, and how they can be used to evaluate expressions
  2. Understanding the difference between pd.eval, df.eval and df.query, and when each function is appropriate to use for dynamic execution.

This post is not a substitute for the documentation (links in the answer), so please do go through that as well!


Question

I will frame a question in such a way that opens discussion for various features supported by eval.

Given two DataFrames

np.random.seed(0)
df1 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list(\'ABCD\'))
df2 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list(\'ABCD\'))

df1
   A  B  C  D
0  5  0  3  3
1  7  9  3  5
2  2  4  7  6
3  8  8  1  6
4  7  7  8  1

df2
   A  B  C  D
0  5  9  8  9
1  4  3  0  3
2  5  0  2  3
3  8  1  3  3
4  3  7  0  1

I would like to perform arithmetic on one or more columns using pd.eval. Specifically, I would like to port the following code:

x = 5
df2[\'D\'] = df1[\'A\'] + (df1[\'B\'] * x) 

...to code using eval. The reason for using eval is that I would like to automate many workflows, so creating them dynamically will be useful to me.

I am trying to better understand the engine and parser arguments to determine how best to solve my problem. I have gone through the documentation but the difference was not made clear to me.

  1. What arguments should be used to ensure my code is working at max performance?
  2. Is there a way to assign the result of the expression back to df2?
  3. Also, to make things more complicated, how do I pass x as an argument inside the string expression?

回答1:

This answer dives into the various features and functionality offered by pd.eval, df.query, and df.eval.

Setup
Examples will involve these DataFrames (unless otherwise specified).

np.random.seed(0)
df1 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list(\'ABCD\'))
df2 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list(\'ABCD\'))
df3 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list(\'ABCD\'))
df4 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list(\'ABCD\'))

pandas.eval - The \"Missing Manual\"

Note
Of the three functions being discussed, pd.eval is the most important. df.eval and df.query call pd.eval under the hood. Behaviour and usage is more or less consistent across the three functions, with some minor semantic variations which will be highlighted later. This section will introduce functionality that is common across all the three functions - this includes, (but not limited to) allowed syntax, precedence rules, and keyword arguments.

pd.eval can evaluate arithmetic expressions which can consist of variables and/or literals. These expressions must be passed as strings. So, to answer the question as stated, you can do

x = 5
pd.eval(\"df1.A + (df1.B * x)\")  

Some things to note here:

  1. The entire expression is a string
  2. df1, df2, and x refer to variables in the global namespace, these are picked up by eval when parsing the expression
  3. Specific columns are accessed using the attribute accessor index. You can also use \"df1[\'A\'] + (df1[\'B\'] * x)\" to the same effect.

I will be addressing the specific issue of reassignment in the section explaining the target=... attribute below. But for now, here are more simple examples of valid operations with pd.eval:

pd.eval(\"df1.A + df2.A\")   # Valid, returns a pd.Series object
pd.eval(\"abs(df1) ** .5\")  # Valid, returns a pd.DataFrame object

...and so on. Conditional expressions are also supported in the same way. The statements below are all valid expressions and will be evaluated by the engine.

pd.eval(\"df1 > df2\")        
pd.eval(\"df1 > 5\")    
pd.eval(\"df1 < df2 and df3 < df4\")      
pd.eval(\"df1 in [1, 2, 3]\")
pd.eval(\"1 < 2 < 3\")

A list detailing all the supported features and syntax can be found in the documentation. In summary,

  • Arithmetic operations except for the left shift (<<) and right shift (>>) operators, e.g., df + 2 * pi / s ** 4 % 42 - the_golden_ratio
  • Comparison operations, including chained comparisons, e.g., 2 < df < df2
  • Boolean operations, e.g., df < df2 and df3 < df4 or not df_bool list and tuple literals, e.g., [1, 2] or (1, 2)
  • Attribute access, e.g., df.a
  • Subscript expressions, e.g., df[0]
  • Simple variable evaluation, e.g., pd.eval(\'df\') (this is not very useful)
  • Math functions: sin, cos, exp, log, expm1, log1p, sqrt, sinh, cosh, tanh, arcsin, arccos, arctan, arccosh, arcsinh, arctanh, abs and arctan2.

This section of the documentation also specifies syntax rules that are not supported, including set/dict literals, if-else statements, loops, and comprehensions, and generator expressions.

From the list, it is obvious you can also pass expressions involving the index, such as

pd.eval(\'df1.A * (df1.index > 1)\')

Parser Selection: The parser=... argument

pd.eval supports two different parser options when parsing the expression string to generate the syntax tree: pandas and python. The main difference between the two is highlighted by slightly differing precedence rules.

Using the default parser pandas, the overloaded bitwise operators & and | which implement vectorized AND and OR operations with pandas objects will have the same operator precedence as and and `or. So,

pd.eval(\"(df1 > df2) & (df3 < df4)\")

Will be the same as

pd.eval(\"df1 > df2 & df3 < df4\")
# pd.eval(\"df1 > df2 & df3 < df4\", parser=\'pandas\')

And also the same as

pd.eval(\"df1 > df2 and df3 < df4\")

Here, the parentheses are necessary. To do this conventionally, the parens would be required to override the higher precedence of bitwise operators:

(df1 > df2) & (df3 < df4)

Without that, we end up with

df1 > df2 & df3 < df4

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Use parser=\'python\' if you want to maintain consistency with python\'s actual operator precedence rules while evaluating the string.

pd.eval(\"(df1 > df2) & (df3 < df4)\", parser=\'python\')

The other difference between the two types of parsers are the semantics of the == and != operators with list and tuple nodes, which have the similar semantics as in and not in respectively, when using the \'pandas\' parser. For example,

pd.eval(\"df1 == [1, 2, 3]\")

Is valid, and will run with the same semantics as

pd.eval(\"df1 in [1, 2, 3]\")

OTOH, pd.eval(\"df1 == [1, 2, 3]\", parser=\'python\') will throw a NotImplementedError error.

Backend Selection: The engine=... argument

There are two options - numexpr (the default) and python. The numexpr option uses the numexpr backend which is optimized for performance.

With \'python\' backend, your expression is evaluated similar to just passing the expression to python\'s eval function. You have the flexibility of doing more inside expressions, such as string operations, for instance.

df = pd.DataFrame({\'A\': [\'abc\', \'def\', \'abacus\']})
pd.eval(\'df.A.str.contains(\"ab\")\', engine=\'python\')

0     True
1    False
2     True
Name: A, dtype: bool

Unfortunately, this method offers no performance benefits over the numexpr engine, and there are very few security measures to ensure that dangerous expressions are not evaluated, so USE AT YOUR OWN RISK! It is generally not recommended to change this option to \'python\' unless you know what you\'re doing.

local_dict and global_dict arguments

Sometimes, it is useful to supply values for variables used inside expressions, but not currently defined in your namespace. You can pass a dictionary to local_dict

For example,

pd.eval(\"df1 > thresh\")

UndefinedVariableError: name \'thresh\' is not defined

This fails because thresh is not defined. However, this works:

pd.eval(\"df1 > x\", local_dict={\'thresh\': 10})

This is useful when you have variables to supply from a dictionary. Alternatively, with the \'python\' engine, you could simply do this:

mydict = {\'thresh\': 5}
# Dictionary values with *string* keys cannot be accessed without 
# using the \'python\' engine.
pd.eval(\'df1 > mydict[\"thresh\"]\', engine=\'python\')

But this is going to possibly be much slower than using the \'numexpr\' engine and passing a dictionary to local_dict or global_dict. Hopefully, this should make a convincing argument for the use of these parameters.

The target (+ inplace) argument, and Assignment Expressions

This is not often a requirement because there are usually simpler ways of doing this, but you can assign the result of pd.eval to an object that implements __getitem__ such as dicts, and (you guessed it) DataFrames.

Consider the example in the question

x = 5
df2[\'D\'] = df1[\'A\'] + (df1[\'B\'] * x)

To assign a column \"D\" to df2, we do

pd.eval(\'D = df1.A + (df1.B * x)\', target=df2)

   A  B  C   D
0  5  9  8   5
1  4  3  0  52
2  5  0  2  22
3  8  1  3  48
4  3  7  0  42

This is not an in-place modification of df2 (but it can be... read on). Consider another example:

pd.eval(\'df1.A + df2.A\')

0    10
1    11
2     7
3    16
4    10
dtype: int32

If you wanted to (for example) assign this back to a DataFrame, you could use the target argument as follows:

df = pd.DataFrame(columns=list(\'FBGH\'), index=df1.index)
df
     F    B    G    H
0  NaN  NaN  NaN  NaN
1  NaN  NaN  NaN  NaN
2  NaN  NaN  NaN  NaN
3  NaN  NaN  NaN  NaN
4  NaN  NaN  NaN  NaN

df = pd.eval(\'B = df1.A + df2.A\', target=df)
# Similar to 
# df = df.assign(B=pd.eval(\'df1.A + df2.A\'))

df
     F   B    G    H
0  NaN  10  NaN  NaN
1  NaN  11  NaN  NaN
2  NaN   7  NaN  NaN
3  NaN  16  NaN  NaN
4  NaN  10  NaN  NaN

If you wanted to perform an in-place mutation on df, set inplace=True.

pd.eval(\'B = df1.A + df2.A\', target=df, inplace=True)
# Similar to 
# df[\'B\'] = pd.eval(\'df1.A + df2.A\')

df
     F   B    G    H
0  NaN  10  NaN  NaN
1  NaN  11  NaN  NaN
2  NaN   7  NaN  NaN
3  NaN  16  NaN  NaN
4  NaN  10  NaN  NaN

If inplace is set without a target, a ValueError is raised.

While the target argument is fun to play around with, you will seldom need to use it.

If you wanted to do this with df.eval, you would use an expression involving an assignment:

df = df.eval(\"B = @df1.A + @df2.A\")
# df.eval(\"B = @df1.A + @df2.A\", inplace=True)
df

     F   B    G    H
0  NaN  10  NaN  NaN
1  NaN  11  NaN  NaN
2  NaN   7  NaN  NaN
3  NaN  16  NaN  NaN
4  NaN  10  NaN  NaN

Note
One of pd.eval\'s unintended uses is parsing literal strings in a manner very similar to ast.literal_eval:

pd.eval(\"[1, 2, 3]\")
array([1, 2, 3], dtype=object)

It can also parse nested lists with the \'python\' engine:

pd.eval(\"[[1, 2, 3], [4, 5], [10]]\", engine=\'python\')
[[1, 2, 3], [4, 5], [10]]

And lists of strings:

pd.eval([\"[1, 2, 3]\", \"[4, 5]\", \"[10]\"], engine=\'python\')
[[1, 2, 3], [4, 5], [10]]

The problem, however, is for lists with length larger than 10:

pd.eval([\"[1]\"] * 100, engine=\'python\') # Works
pd.eval([\"[1]\"] * 101, engine=\'python\') 

AttributeError: \'PandasExprVisitor\' object has no attribute \'visit_Ellipsis\'

More information can this error, causes, fixes, and workarounds can be found here.


DataFrame.eval - A Juxtaposition with pandas.eval

As mentioned above, df.eval calls pd.eval under the hood. The v0.23 source code shows this:

def eval(self, expr, inplace=False, **kwargs):

    from pandas.core.computation.eval import eval as _eval

    inplace = validate_bool_kwarg(inplace, \'inplace\')
    resolvers = kwargs.pop(\'resolvers\', None)
    kwargs[\'level\'] = kwargs.pop(\'level\', 0) + 1
    if resolvers is None:
        index_resolvers = self._get_index_resolvers()
        resolvers = dict(self.iteritems()), index_resolvers
    if \'target\' not in kwargs:
        kwargs[\'target\'] = self
    kwargs[\'resolvers\'] = kwargs.get(\'resolvers\', ()) + tuple(resolvers)
    return _eval(expr, inplace=inplace, **kwargs)

eval creates arguments, does a little validation, and passes the arguments on to pd.eval.

For more, you can read on: when to use DataFrame.eval() versus pandas.eval() or python eval()

Usage Differences

Expressions with DataFrames v/s Series Expressions

For dynamic queries associated with entire DataFrames, you should prefer pd.eval. For example, there is no simple way to specify the equivalent of pd.eval(\"df1 + df2\") when you call df1.eval or df2.eval.

Specifying Column Names

Another other major difference is how columns are accessed. For example, to add two columns \"A\" and \"B\" in df1, you would call pd.eval with the following expression:

pd.eval(\"df1.A + df1.B\")

With df.eval, you need only supply the column names:

df1.eval(\"A + B\")

Since, within the context of df1, it is clear that \"A\" and \"B\" refer to column names.

You can also refer to the index and columns using index (unless the index is named, in which case you would use the name).

df1.eval(\"A + index\")

Or, more generally, for any DataFrame with an index having 1 or more levels, you can refer to the kth level of the index in an expression using the variable \"ilevel_k\" which stands for \"index at level k\". IOW, the expression above can be written as df1.eval(\"A + ilevel_0\").

These rules also apply to query.

Accessing Variables in Local/Global Namespace

Variables supplied inside expressions must be preceeded by the \"@\" symbol, to avoid confusion with column names.

A = 5
df1.eval(\"A > @A\") 

The same goes for query/

It goes without saying that your column names must follow the rules for valid identifier naming in python to be accessible inside eval. See here for a list of rules on naming identifiers.

Multiline Queries and Assignment

A little known fact is that eval support multiline expressions that deal with assignment. For example, to create two new columns \"E\" and \"F\" in df1 based on some arithmetic operations on some columns, and a third column \"G\" based on the previously created \"E\" and \"F\", we can do

df1.eval(\"\"\"
E = A + B
F = @df2.A + @df2.B
G = E >= F
\"\"\")

   A  B  C  D   E   F      G
0  5  0  3  3   5  14  False
1  7  9  3  5  16   7   True
2  2  4  7  6   6   5   True
3  8  8  1  6  16   9   True
4  7  7  8  1  14  10   True

...Nifty! However, note that this is not supported by query.


eval v/s query - Final Word

It helps to think of df.query as a function that uses pd.eval as a subroutine.

Typically, query (as the name suggests) is used to evaluate conditional expressions (i.e., expressions that result in True/False values) and return the rows corresponding to the True result. The result of the expression is then passed to loc (in most cases) to return the rows that satisfy the expression. According to the documentation,

The result of the evaluation of this expression is first passed to DataFrame.loc and if that fails because of a multidimensional key (e.g., a DataFrame) then the result will be passed to DataFrame.__getitem__().

This method uses the top-level pandas.eval() function to evaluate the passed query.

In terms of similarity, query and df.eval are both alike in how they access column names and variables.

This key difference between the two, as mentioned above is how they handle the expression result. This becomes obvious when you actually run an expression through these two functions. For example, consider

df1.A

0    5
1    7
2    2
3    8
4    7
Name: A, dtype: int32

df2.B

0    9
1    3
2    0
3    1
4    7
Name: B, dtype: int32

To get all rows where \"A\" >= \"B\" in df1, we would use eval like this:

m = df1.eval(\"A >= B\")
m
0     True
1    False
2    False
3     True
4     True
dtype: bool

m represents the intermediate result generated by evaluating the expression \"A >= B\". We then use the mask to filter df1:

df1[m]
# df1.loc[m]

   A  B  C  D
0  5  0  3  3
3  8  8  1  6
4  7  7  8  1

However, with query, the intermediate result \"m\" is directly passed to loc, so with query, you would simply need to do

df1.query(\"A >= B\")

   A  B  C  D
0  5  0  3  3
3  8  8  1  6
4  7  7  8  1

Performance wise, it is exactly the same.

df1_big = pd.concat([df1] * 100000, ignore_index=True)

%timeit df1_big[df1_big.eval(\"A >= B\")]
%timeit df1_big.query(\"A >= B\")

14.7 ms ± 33.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
14.7 ms ± 24.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

But the latter is more concise, and expresses the same operation in a single step.

Note that you can also do weird stuff with query like this (to, say, return all rows indexed by df1.index)

df1.query(\"index\")
# Same as df1.loc[df1.index] # Pointless,... I know

   A  B  C  D
0  5  0  3  3
1  7  9  3  5
2  2  4  7  6
3  8  8  1  6
4  7  7  8  1

But don\'t.

Bottom line: Please use query when querying or filtering rows based on a conditional expression.