Why did 'reset_index(drop=True)' function

2019-09-14 17:15发布

问题:

I have a Pandas dataframe named data_match. It contains columns '_worker_id', '_unit_id', and 'caption'. (Please see attached screenshot for some of the rows in this dataframe)

Let's say the index column is not in ascending order (I want the index to be 0, 1, 2, 3, 4...n) and I want it to be in ascending order. So I ran the following function attempting to reset the index column:
data_match=data_match.reset_index(drop=True)

I was able to get the function to return the correct output in my computer using Python 3.6. However, when my coworker ran that function in his computer using Python 3.6, the '_worker_id' column got removed.

Is this due to the '(drop=True)' clause next to 'reset_index'? But I didn't know why it worked in my computer and not in my coworker's computer. Can anybody advise?

回答1:

As the saying goes, "What happens in your interpreter stays in your interpreter". It's impossible to explain the discrepancy without seeing the full history of commands entered into both Python interactive sessions.

However, it is possible to venture a guess:

df.reset_index(drop=True) drops the current index of the DataFrame and replaces it with an index of increasing integers. It never drops columns.

So, in your interactive session, _worker_id was a column. In your co-worker's interactive session, _worker_id must have been an index level.

The visual difference can be somewhat subtle. For example, below, df has a _worker_id column while df2 has a _worker_id index level:

In [190]: df = pd.DataFrame({'foo':[1,2,3], '_worker_id':list('ABC')}); df
Out[190]: 
  _worker_id  foo
0          A    1
1          B    2
2          C    3

In [191]: df2 = df.set_index('_worker_id', append=True); df2
Out[191]: 
              foo
  _worker_id     
0 A             1
1 B             2
2 C             3

Notice that the name _worker_id appears one line below foo when it is an index level, and on the same line as foo when it is a column. That is the only visual clue you get when looking at the str or repr of a DataFrame.

So to repeat: When _worker_index is a column, the column is unaffected by df.reset_index(drop=True):

In [194]: df.reset_index(drop=True)
Out[194]: 
  _worker_id  foo
0          A    1
1          B    2
2          C    3

But _worker_index is dropped when it is part of the index:

In [195]: df2.reset_index(drop=True)
Out[195]: 
   foo
0    1
1    2
2    3