I am a bit confused about the argument copy
in DataFrame.merge()
after a co-worker asked me about that.
The docstring of DataFrame.merge()
states:
copy : boolean, default True
If False, do not copy data unnecessarily
The pandas documentation states:
copy
: Always copy data (defaultTrue
) from the passed DataFrame objects, even when reindexing is not necessary. Cannot be avoided in many cases but may improve performance / memory usage. The cases where copying can be avoided are somewhat pathological but this option is provided nonetheless.
The docstring kind of implies that copying the data is not necessary and might be skipped nearly always. The documention on the other hand says, that copying data can't be avoided in many cases.
My questions are:
- What are those cases?
- What are the downsides?
Disclaimer: I'm not very experienced with pandas and this is the first time I dug into its source, so I can't guarantee that I'm not missing something in my below assessment.
The relevant bits of code have been recently refactored. I'll discuss the subject in terms of the current stable version 0.20, but I don't suspect functional changes compared to earlier versions.
The investigation starts with the source of
merge
in pandas/core/reshape/merge.py (formerly pandas/tools/merge.py). Ignoring some doc-aware decorators:Calling
merge
will pass on thecopy
parameter to the constructor of class_MergeOperation
, then calls itsget_result()
method. The first few lines with context:Now that second comment is highly suspicious. Moving on, the
copy
kwarg is bound to an eponymous instance attribute, which only seems to reappear once within the class:We can then track down the
concatenate_block_managers
function in pandas/core/internals.py that just passes on thecopy
kwarg toconcatenate_join_units
.We reached the final resting place of the original
copy
keyword argument inconcatenate_join_units
:As you can see, the only thing that
copy
does is rebind a copy ofconcat_values
here to the same name in the special case of concatenation when there's really nothing to concatenate.Now, at this point my lack of pandas knowledge starts to show, because I'm not really sure what exactly is going on this deep inside the call stack. But the above hot-potato scheme with the
copy
keyword argument ending in that no-op-like branch of a concatenation function is perfectly consistent with the "TODO" comment above, the documentation quoted in the question:(emphasis mine), and the related discussion on an old issue:
Based on these hints I suspect that in the very vast majority of real use cases copying is inevitable, and the
copy
keyword argument is never used. However, since for the small number of exceptions skipping a copy step might improve performance (without leading to any performance impact whatsoever for the majority of use cases in the mean time), the choice was implemented.I suspect that the rationale is something like this: the upside of not doing a copy unless necessary (which is only possible in a very special few cases) is that the code avoids some memory allocations and copies in this case, but not returning a copy in a very special few cases might lead to unexpected surprises if one doesn't expect that mutating the return value of
merge
could in any way affect the original dataframe. So the default value of thecopy
keyword argument isTrue
, thus the user only doesn't get a copy frommerge
if they explicitly volunteer for this (but even then they'll still likely end up with a copy).