I am transitioning from R to Python. I just began using Pandas. I have an R code that subsets nicely:
k1 <- subset(data, Product = p.id & Month < mn & Year == yr, select = c(Time, Product))
Now, I want to do similar stuff in Python. this is what I have got so far:
import pandas as pd
data = pd.read_csv("../data/monthly_prod_sales.csv")
#first, index the dataset by Product. And, get all that matches a given 'p.id' and time.
data.set_index('Product')
k = data.ix[[p.id, 'Time']]
# then, index this subset with Time and do more subsetting..
I am beginning to feel that I am doing this the wrong way. perhaps, there is an elegant solution. Can anyone help? I need to extract month and year from the timestamp I have and do subsetting. Perhaps there is a one-liner that will accomplish all this:
k1 <- subset(data, Product = p.id & Time >= start_time & Time < end_time, select = c(Time, Product))
thanks.
Just for someone looking for a solution more similar to R:
No need for
data.loc
orquery
, but I do think it is a bit long.I've found that you can use any subset condition for a given column by wrapping it in []. For instance, you have a df with columns ['Product','Time', 'Year', 'Color']
And let's say you want to include products made before 2014. You could write,
To return all the rows where this is the case. You can add different conditions.
Then just choose the columns you want as directed above. For instance, the product color and key for the df above,
I'll assume that
Time
andProduct
are columns in aDataFrame
,df
is an instance ofDataFrame
, and that other variables are scalar values:For now, you'll have to reference the
DataFrame
instance:The parentheses are also necessary, because of the precedence of the
&
operator vs. the comparison operators. The&
operator is actually an overloaded bitwise operator which has the same precedence as arithmetic operators which in turn have a higher precedence than comparison operators.In
pandas
0.13 a new experimentalDataFrame.query()
method will be available. It's extremely similar to subset modulo theselect
argument:With
query()
you'd do it like this:Here's a simple example:
The final query that you're interested will even be able to take advantage of chained comparisons, like this:
Creating an Empty Dataframe with known Column Name:
Creating a dataframe from csv:
Creating a dynamic filter to subset a
dtaframe
:Creating a dynamic filter to subset required columns of dtaframe