I have the following indexed DataFrame with named columns and rows not- continuous numbers:
a b c d
2 0.671399 0.101208 -0.181532 0.241273
3 0.446172 -0.243316 0.051767 1.577318
5 0.614758 0.075793 -0.451460 -0.012493
I would like to add a new column, 'e'
, to the existing data frame and do not want to change anything in the data frame (i.e., the new column always has the same length as the DataFrame).
0 -0.335485
1 -1.166658
2 -0.385571
dtype: float64
I tried different versions of join
, append
, merge
, but I did not get the result I wanted, only errors at most. How can I add column e
to the above example?
If you want to set the whole new column to an initial base value (e.g.
None
), you can do this:df1['e'] = None
This actually would assign "object" type to the cell. So later you're free to put complex data types, like list, into individual cells.
Foolproof:
Example:
Let me just add that, just like for hum3,
.loc
didn't solve theSettingWithCopyWarning
and I had to resort todf.insert()
. In my case false positive was generated by "fake" chain indexingdict['a']['e']
, where'e'
is the new column, anddict['a']
is a DataFrame coming from dictionary.Also note that if you know what you are doing, you can switch of the warning using
pd.options.mode.chained_assignment = None
and than use one of the other solutions given here.If the data frame and Series object have the same index,
pandas.concat
also works here:In case they don't have the same index:
I assume that the index values in
e
match those indf1
.The easiest way to initiate a new column named
e
, and assign it the values from your seriese
:assign (Pandas 0.16.0+)
As of Pandas 0.16.0, you can also use
assign
, which assigns new columns to a DataFrame and returns a new object (a copy) with all the original columns in addition to the new ones.As per this example (which also includes the source code of the
assign
function), you can also include more than one column:In context with your example:
The description of this new feature when it was first introduced can be found here.
Super simple column assignment
A pandas dataframe is implemented as an ordered dict of columns.
This means that the
__getitem__
[]
can not only be used to get a certain column, but__setitem__
[] =
can be used to assign a new column.For example, this dataframe can have a column added to it by simply using the
[]
accessorNote that this works even if the index of the dataframe is off.
[]= is the way to go, but watch out!
However, if you have a
pd.Series
and try to assign it to a dataframe where the indexes are off, you will run in to trouble. See example:This is because a
pd.Series
by default has an index enumerated from 0 to n. And the pandas[] =
method tries to be "smart"What actually is going on.
When you use the
[] =
method pandas is quietly performing an outer join or outer merge using the index of the left hand dataframe and the index of the right hand series.df['column'] = series
Side note
This quickly causes cognitive dissonance, since the
[]=
method is trying to do a lot of different things depending on the input, and the outcome cannot be predicted unless you just know how pandas works. I would therefore advice against the[]=
in code bases, but when exploring data in a notebook, it is fine.Going around the problem
If you have a
pd.Series
and want it assigned from top to bottom, or if you are coding productive code and you are not sure of the index order, it is worth it to safeguard for this kind of issue.You could downcast the
pd.Series
to anp.ndarray
or alist
, this will do the trick.or
But this is not very explicit.
Some coder may come along and say "Hey, this looks redundant, I'll just optimize this away".
Explicit way
Setting the index of the
pd.Series
to be the index of thedf
is explicit.Or more realistically, you probably have a
pd.Series
already available.Can now be assigned
Alternative way with
df.reset_index()
Since the index dissonance is the problem, if you feel that the index of the dataframe should not dictate things, you can simply drop the index, this should be faster, but it is not very clean, since your function now probably does two things.
Note on
df.assign
While
df.assign
make it more explicit what you are doing, it actually has all the same problems as the above[]=
Just watch out with
df.assign
that your column is not calledself
. It will cause errors. This makesdf.assign
smelly, since there are these kind of artifacts in the function.You may say, "Well, I'll just not use
self
then". But who knows how this function changes in the future to support new arguments. Maybe your column name will be an argument in a new update of pandas, causing problems with upgrading.