I'm struggling to understand the concept behind column naming conventions, given that one of the following attempts to create a new column appears to fail:
from numpy.random import randn
import pandas as pd
df = pd.DataFrame({'a':range(0,10,2), 'c':range(0,1000,200)},
columns=list('ac'))
df['b'] = 10*df.a
df
gives the following result:
Yet, if I were to try to create column b by substituting with the following line, there is no error message, yet the dataframe df remains with only the columns a and c.
df.b = 10*df.a ### rather than the previous df['b'] = 10*df.a ###
What has pandas done and why is my command incorrect?
What you did was add an attribute
b
to your df:but we see that no new column has been added:
which means we get a
KeyError
if we trieddf['b']
, to avoid this ambiguity you should always use square brackets when assigning.for instance if you had a column named
index
orsum
ormax
then doingdf.index
would return the index and not the index column, and similarlydf.sum
anddf.max
would screw up those df methods.I strongly advise to always use square brackets, it avoids any ambiguity and the latest ipython is able to resolve column names using square brackets. It's also useful to think of a dataframe as a dict of series in which it makes sense to use square brackets for assigning and returning a column
Always use square brackets for assigning columns
Dot notation is a convenience for accessing columns in a dataframe. If they conflict with existing properties (e.g. if you had a column named 'max'), then you need to use square brackets to access that column, e.g.
df['max']
. You also need to use square brackets when the column name contains spaces, e.g.df['max value']
.A DataFrame is just an object which has the usual properties and methods. If you use dot notation for assignment, you are creating a property or method for the dataframe object. So
df.val = 2
will assigndf
with a propertyval
that has a value of two. This is very different fromdf['val'] = 2
which creates a new column in the dataframe and assigns each element in that column the value of two.To be safe, using square bracket notation will always provide the correct result.
As an aside, your
columns=list('ac'))
doesn't do anything, as you are just creating a variable namedcolumns
that is never used. You may have meantdf.columns = list('ac')
, but you already assigned those in the creation of the dataframe, so I'm not sure what the intent is with this line of code. And remember that dictionaries are unordered, so thatpd.DataFrame({'a': [...], 'b': [...]})
could potentially return a dataframe with columns ['b', 'a']. If this were the case, then assigning column names could potentially mix up the column headers.The issue has to do with how properties are handled in python. There is no restriction in python of setting a new properties for a class, so for example you could do something like
So when you do assignment like
It is ambiguous whether you want to add a property or a new column, and a property is set. The easiest way to actually see what is going on with this is to use pdb and step through the code
This will step into the
__setattr__()
whereaspdb.run("df['a2'] = x")
will step into__setitem__()