Suppose I have a df
which has columns of 'ID', 'col_1', 'col_2'
. And I define a function :
f = lambda x, y : my_function_expression
.
Now I want to apply the f
to df
's two columns 'col_1', 'col_2'
to element-wise calculate a new column 'col_3'
, somewhat like :
df['col_3'] = df[['col_1','col_2']].apply(f)
# Pandas gives : TypeError: ('<lambda>() takes exactly 2 arguments (1 given)'
How to do ?
** Add detail sample as below ***
import pandas as pd
df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']
def get_sublist(sta,end):
return mylist[sta:end+1]
#df['col_3'] = df[['col_1','col_2']].apply(get_sublist,axis=1)
# expect above to output df as below
ID col_1 col_2 col_3
0 1 0 1 ['a', 'b']
1 2 2 4 ['c', 'd', 'e']
2 3 3 5 ['d', 'e', 'f']
I suppose you don't want to change
get_sublist
function, and just want to use DataFrame'sapply
method to do the job. To get the result you want, I've wrote two help functions:get_sublist_list
andunlist
. As the function name suggest, first get the list of sublist, second extract that sublist from that list. Finally, We need to callapply
function to apply those two functions to thedf[['col_1','col_2']]
DataFrame subsequently.If you don't use
[]
to enclose theget_sublist
function, then theget_sublist_list
function will return a plain list, it'll raiseValueError: could not broadcast input array from shape (3) into shape (2)
, as @Ted Petrou had mentioned.I'm going to put in a vote for np.vectorize. It allows you to just shoot over x number of columns and not deal with the dataframe in the function, so it's great for functions you don't control or doing something like sending 2 columns and a constant into a function (i.e. col_1, col_2, 'foo').
My example to your questions:
Returning a list from
apply
is a dangerous operation as the resulting object is not guaranteed to be either a Series or a DataFrame. And exceptions might be raised in certain cases. Let's walk through a simple example:There are three possible outcomes with returning a list from
apply
1) If the length of the returned list is not equal to the number of columns, then a Series of lists is returned.
2) When the length of the returned list is equal to the number of columns then a DataFrame is returned and each column gets the corresponding value in the list.
3) If the length of the returned list equals the number of columns for the first row but has at least one row where the list has a different number of elements than number of columns a ValueError is raised.
Answering the problem without apply
Using
apply
with axis=1 is very slow. It is possible to get much better performance (especially on larger datasets) with basic iterative methods.Create larger dataframe
Timings
@Thomas answer
A simple solution is:
Here's an example using
apply
on the dataframe, which I am calling withaxis = 1
.Note the difference is that instead of trying to pass two values to the function
f
, rewrite the function to accept a pandas Series object, and then index the Series to get the values needed.Depending on your use case, it is sometimes helpful to create a pandas
group
object, and then useapply
on the group.