Hey I'm new to Pandas and I just came across df.query()
.
Why people would use df.query()
when you can directly filter your Dataframes using brackets notation ? The official pandas tutorial also seems to prefer the latter approach.
With brackets notation :
df[df['age'] <= 21]
With pandas query method :
df.query('age <= 21')
Besides some of the stylistic or flexibility differences that have been mentioned, is one canonically preferred - namely for performance of operations on large dataframes?
Consider the following sample DF:
In [307]: df
Out[307]:
sex age name
0 M 40 Max
1 F 35 Anna
2 M 29 Joe
3 F 18 Maria
4 F 23 Natalie
There are quite a few good reasons to prefer .query()
method.
it might be much shorter and cleaner compared to boolean indexing:
In [308]: df.query("20 <= age <= 30 and sex=='F'")
Out[308]:
sex age name
4 F 23 Natalie
In [309]: df[(df['age']>=20) & (df['age']<=30) & (df['sex']=='F')]
Out[309]:
sex age name
4 F 23 Natalie
you can prepare conditions (queries) programmatically:
In [315]: conditions = {'name':'Joe', 'sex':'M'}
In [316]: q = ' and '.join(['{}=="{}"'.format(k,v) for k,v in conditions.items()])
In [317]: q
Out[317]: 'name=="Joe" and sex=="M"'
In [318]: df.query(q)
Out[318]:
sex age name
2 M 29 Joe
PS there are also some disadvantages:
- we can't use
.query()
method for columns containing spaces or columns that consist only from digits
- not all functions can be applied or in some cases we have to use
engine='python'
instead of default engine='numexpr'
(which is faster)
NOTE: Jeff (one of the main Pandas contributors and a member of Pandas core team) once said:
Note that in reality .query is just a nice-to-have interface, in fact
it has very specific guarantees, meaning its meant to parse like a
query language, and not a fully general interface.
Some other interesting usages in the documentation.
Reuseable
A use case for query() is when you have a collection of DataFrame
objects that have a subset of column names (or index levels/names) in
common. You can pass the same query to both frames without having to
specify which frame you’re interested in querying -- (Source)
Example:
dfA = pd.DataFrame([[1,2,3], [4,5,6]], columns=["X", "Y", "Z"])
dfB = pd.DataFrame([[1,3,3], [4,1,6]], columns=["X", "Y", "Z"])
q = "(X > 3) & (Y < 10)"
print(dfA.query(q))
print(dfB.query(q))
X Y Z
1 4 5 6
X Y Z
1 4 1 6
More flexible syntax
df.query('a < b and b < c') # understand a bit more English
Support in
operator and not in
(alternative to isin
)
df.query('a in [3, 4, 5]') # select rows whose value of column a is in [2, 3, 4]
Special usage of == and != (similar to in
/not in
)
df.query('a == [1, 3, 5]') # select whose value of column a is in [1, 3, 5]
# equivalent to df.query('a in [1, 3, 5]')