Is there an easy way to check whether two data frames are different copies or views of the same underlying data that doesn't involve manipulations? I'm trying to get a grip on when each is generated, and given how idiosyncratic the rules seem to be, I'd like an easy way to test.
For example, I thought "id(df.values)" would be stable across views, but they don't seem to be:
# Make two data frames that are views of same data.
df = pd.DataFrame([[1,2,3,4],[5,6,7,8]], index = ['row1','row2'],
columns = ['a','b','c','d'])
df2 = df.iloc[0:2,:]
# Demonstrate they are views:
df.iloc[0,0] = 99
df2.iloc[0,0]
Out[70]: 99
# Now try and compare the id on values attribute
# Different despite being views!
id(df.values)
Out[71]: 4753564496
id(df2.values)
Out[72]: 4753603728
# And we can of course compare df and df2
df is df2
Out[73]: False
Other answers I've looked up that try to give rules, but don't seem consistent, and also don't answer this question of how to test:
And of course: - http://pandas.pydata.org/pandas-docs/stable/indexing.html#returning-a-view-versus-a-copy
UPDATE: Comments below seem to answer the question -- looking at the df.values.base
attribute rather than df.values
attribute does it, as does a reference to the df._is_copy
attribute (though the latter is probably very bad form since it's an internal).
You might trace the memory your pandas/python environment is consuming, and, on the assumption that a copy will utilise more memory than a view, be able to decide one way or another.
I believe there are libraries out there that will present the memory usage within the python environment itself - e.g. Heapy/Guppy.
There ought to be a metric you can apply that takes a baseline picture of memory usage prior to creating the object under inspection, then another picture afterwards. Comparison of the two memory maps (assuming nothing else has been created and we can isolate the change is due to the new object) should provide an idea of whether a view or copy has been produced.
We'd need to get an idea of the different memory profiles of each type of implementation, but some experimentation should yield results.
Answers from HYRY and Marius in comments!
One can check either by:
testing equivalence of the
values.base
attribute rather than thevalues
attribute, as in:df.values.base is df2.values.base
instead ofdf.values is df2.values
._is_view
attribute (df2._is_view
isTrue
).Thanks everyone!