I have a Pandas dataframe named data_match. It contains columns '_worker_id', '_unit_id', and 'caption'. (Please see attached screenshot for some of the rows in this dataframe)
Let's say the index column is not in ascending order (I want the index to be 0, 1, 2, 3, 4...n) and I want it to be in ascending order. So I ran the following function attempting to reset the index column:
data_match=data_match.reset_index(drop=True)
I was able to get the function to return the correct output in my computer using Python 3.6. However, when my coworker ran that function in his computer using Python 3.6, the '_worker_id' column got removed.
Is this due to the '(drop=True)' clause next to 'reset_index'? But I didn't know why it worked in my computer and not in my coworker's computer. Can anybody advise?
As the saying goes, "What happens in your interpreter stays in your
interpreter". It's impossible to explain the discrepancy without seeing the
full history of commands entered into both Python interactive sessions.
However, it is possible to venture a guess:
df.reset_index(drop=True)
drops the current index of the DataFrame and replaces it with an index of
increasing integers. It never drops columns.
So, in your interactive session, _worker_id
was a column. In your co-worker's
interactive session, _worker_id
must have been an index level.
The visual difference can be somewhat subtle. For example, below, df
has a
_worker_id
column while df2
has a _worker_id
index level:
In [190]: df = pd.DataFrame({'foo':[1,2,3], '_worker_id':list('ABC')}); df
Out[190]:
_worker_id foo
0 A 1
1 B 2
2 C 3
In [191]: df2 = df.set_index('_worker_id', append=True); df2
Out[191]:
foo
_worker_id
0 A 1
1 B 2
2 C 3
Notice that the name _worker_id
appears one line below foo
when it is an
index level, and on the same line as foo
when it is a column. That is the only
visual clue you get when looking at the str
or repr
of a DataFrame.
So to repeat: When _worker_index
is a column, the column is unaffected by
df.reset_index(drop=True)
:
In [194]: df.reset_index(drop=True)
Out[194]:
_worker_id foo
0 A 1
1 B 2
2 C 3
But _worker_index
is dropped when it is part of the index:
In [195]: df2.reset_index(drop=True)
Out[195]:
foo
0 1
1 2
2 3