I have a feature in my data set that is a pandas timestamp object. It has (among many others) the following attributes: year, hour, dayofweek, month.
I can create new features based on these attributes using some brute force methods:
df["year"] = df["timeStamp"].apply(lambda x : x.year)
df["hour"] = df["timeStamp"].apply(lambda x : x.hour)
. . .
However, I want to iterate over a list:
nomtimes = ["year", "hour", "month", "dayofweek"]
for i in nomtimes:
df[i] = df["timeStamp"].apply(lambda x : x.i)
I get the following AttributeError: 'Timestamp' object has no attribute 'i', and I get it and understand why I am having this error.
How can I get the quoted string to unquote so that I can pass it as an attribute?
You just need getattr()
:
df[i] = df["timeStamp"].apply(lambda x : getattr(x, i))
Don't use .apply
here, pandas has various built-in utilities for handling datetime objects, use the dt
attribute on the series objects:
In [11]: start = datetime(2011, 1, 1)
...: end = datetime(2012, 1, 1)
...:
In [12]: df = pd.DataFrame({'data':pd.date_range(start, end)})
In [13]: df.dtypes
Out[13]:
data datetime64[ns]
dtype: object
In [14]: df['year'] = df.data.dt.year
In [15]: df['hour'] = df.data.dt.hour
In [16]: df['month'] = df.data.dt.month
In [17]: df['dayofweek'] = df.data.dt.dayofweek
In [18]: df.head()
Out[18]:
data year hour month dayofweek
0 2011-01-01 2011 0 1 5
1 2011-01-02 2011 0 1 6
2 2011-01-03 2011 0 1 0
3 2011-01-04 2011 0 1 1
4 2011-01-05 2011 0 1 2
Or, dynamically as you wanted using getattr
:
In [24]: df = pd.DataFrame({'data':pd.date_range(start, end)})
In [25]: nomtimes = ["year", "hour", "month", "dayofweek"]
...:
In [26]: df.head()
Out[26]:
data
0 2011-01-01
1 2011-01-02
2 2011-01-03
3 2011-01-04
4 2011-01-05
In [27]: for t in nomtimes:
...: df[t] = getattr(df.data.dt, t)
...:
In [28]: df.head()
Out[28]:
data year hour month dayofweek
0 2011-01-01 2011 0 1 5
1 2011-01-02 2011 0 1 6
2 2011-01-03 2011 0 1 0
3 2011-01-04 2011 0 1 1
4 2011-01-05 2011 0 1 2
And if you must use a one-liner, go with:
In [30]: df = pd.DataFrame({'data':pd.date_range(start, end)})
In [31]: df.head()
Out[31]:
data
0 2011-01-01
1 2011-01-02
2 2011-01-03
3 2011-01-04
4 2011-01-05
In [32]: df = df.assign(**{t:getattr(df.data.dt,t) for t in nomtimes})
In [33]: df.head()
Out[33]:
data dayofweek hour month year
0 2011-01-01 5 0 1 2011
1 2011-01-02 6 0 1 2011
2 2011-01-03 0 0 1 2011
3 2011-01-04 1 0 1 2011
4 2011-01-05 2 0 1 2011
operator.attrgetter
You can extract attributes in a loop:
from operator import attrgetter
for i in nomtimes:
df[i] = df['timeStamp'].apply(attrgetter(i))
Here's a complete example:
df = pd.DataFrame({'timeStamp': ['2018-05-05 15:00', '2015-01-30 11:00']})
df['timeStamp'] = pd.to_datetime(df['timeStamp'])
nomtimes = ['year', 'hour', 'month', 'dayofweek']
for i in nomtimes:
df[i] = df['timeStamp'].apply(attrgetter(i))
print(df)
timeStamp year hour month dayofweek
0 2018-05-05 15:00:00 2018 15 5 5
1 2015-01-30 11:00:00 2015 11 1 4
Your code will not work because you are attempting to pass a string rather than extracting an attribute by name. Yet this isn't what's happening: the syntax does not feed the string but tries to access i
directly, as demonstrated in your first example.
Getting rid of the for loop
You might ask if there's any way to extract all attributes from a datetime
object in one go rather than sequentially. The benefit of attrgetter
is you can specify multiple attributes directly to avoid a for
loop altogether:
attributes = df['timeStamp'].apply(attrgetter(*nomtimes))
df[nomtimes] = pd.DataFrame(attributes.values.tolist())
Using dt accessor instead of apply
But pd.Series.apply
is just a thinly veiled loop. Often, it's not necessary. Borrowing @juanpa.arrivillaga's idea, you an access attributes directly via the pd.Series.dt
accessor:
attributes = pd.concat(attrgetter(*nomtimes)(df['timeStamp'].dt), axis=1, keys=nomtimes)
df = df.join(attributes)