可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I have data in different columns but I don\'t know how to extract it to save it in another variable.
index a b c
1 2 3 4
2 3 4 5
How do I select \'a\'
, \'b\'
and save it in to df1?
I tried
df1 = df[\'a\':\'b\']
df1 = df.ix[:, \'a\':\'b\']
None seem to work.
回答1:
The column names (which are strings) cannot be sliced in the manner you tried.
Here you have a couple of options. If you know from context which variables you want to slice out, you can just return a view of only those columns by passing a list into the __getitem__
syntax (the []\'s).
df1 = df[[\'a\',\'b\']]
Alternatively, if it matters to index them numerically and not by their name (say your code should automatically do this without knowing the names of the first two columns) then you can do this instead:
df1 = df.iloc[:,0:2] # Remember that Python does not slice inclusive of the ending index.
Additionally, you should familiarize yourself with the idea of a view into a Pandas object vs. a copy of that object. The first of the above methods will return a new copy in memory of the desired sub-object (the desired slices).
Sometimes, however, there are indexing conventions in Pandas that don\'t do this and instead give you a new variable that just refers to the same chunk of memory as the sub-object or slice in the original object. This will happen with the second way of indexing, so you can modify it with the copy()
function to get a regular copy. When this happens, changing what you think is the sliced object can sometimes alter the original object. Always good to be on the look out for this.
df1 = df.iloc[0,0:2].copy() # To avoid the case where changing df1 also changes df
回答2:
Assuming your column names (df.columns
) are [\'index\',\'a\',\'b\',\'c\']
, then the data you want is in the
3rd & 4th columns. If you don\'t know their names when your script runs, you can do this
newdf = df[df.columns[2:4]] # Remember, Python is 0-offset! The \"3rd\" entry is at slot 2.
As EMS points out in his answer, df.ix
slices columns a bit more concisely, but the .columns
slicing interface might be more natural because it uses the vanilla 1-D python list indexing/slicing syntax.
WARN: \'index\'
is a bad name for a DataFrame
column. That same label is also used for the real df.index
attribute, a Index
array. So your column is returned by df[\'index\']
and the real DataFrame index is returned by df.index
. An Index
is a special kind of Series
optimized for lookup of it\'s elements\' values. For df.index it\'s for looking up rows by their label. That df.columns
attribute is also a pd.Index
array, for looking up columns by their labels.
回答3:
As of version 0.11.0, columns can be sliced in the manner you tried using the .loc
indexer:
df.loc[:, \'C\':\'E\']
returns columns C
through E
.
A demo on a randomly generated DataFrame:
import pandas as pd
import numpy as np
np.random.seed(5)
df = pd.DataFrame(np.random.randint(100, size=(100, 6)),
columns=list(\'ABCDEF\'),
index=[\'R{}\'.format(i) for i in range(100)])
df.head()
Out:
A B C D E F
R0 99 78 61 16 73 8
R1 62 27 30 80 7 76
R2 15 53 80 27 44 77
R3 75 65 47 30 84 86
R4 18 9 41 62 1 82
To get the columns from C to E (note that unlike integer slicing, \'E\' is included in the columns):
df.loc[:, \'C\':\'E\']
Out:
C D E
R0 61 16 73
R1 30 80 7
R2 80 27 44
R3 47 30 84
R4 41 62 1
R5 5 58 0
...
Same works for selecting rows based on labels. Get the rows \'R6\' to \'R10\' from those columns:
df.loc[\'R6\':\'R10\', \'C\':\'E\']
Out:
C D E
R6 51 27 31
R7 83 19 18
R8 11 67 65
R9 78 27 29
R10 7 16 94
.loc
also accepts a boolean array so you can select the columns whose corresponding entry in the array is True
. For example, df.columns.isin(list(\'BCD\'))
returns array([False, True, True, True, False, False], dtype=bool)
- True if the column name is in the list [\'B\', \'C\', \'D\']
; False, otherwise.
df.loc[:, df.columns.isin(list(\'BCD\'))]
Out:
B C D
R0 78 61 16
R1 27 30 80
R2 53 80 27
R3 65 47 30
R4 9 41 62
R5 78 5 58
...
回答4:
In [39]: df
Out[39]:
index a b c
0 1 2 3 4
1 2 3 4 5
In [40]: df1 = df[[\'b\', \'c\']]
In [41]: df1
Out[41]:
b c
0 3 4
1 4 5
回答5:
I realize this question is quite old, but in the latest version of pandas there is an easy way to do exactly this. Column names (which are strings) can be sliced in whatever manner you like.
columns = [\'b\', \'c\']
df1 = pd.DataFrame(df, columns=columns)
回答6:
You could provide a list of columns to be dropped and return back the DataFrame with only the columns needed using the drop()
function on a Pandas DataFrame.
Just saying
colsToDrop = [\'a\']
df.drop(colsToDrop, axis=1)
would return a DataFrame with just the columns b
and c
.
The drop
method is documented here.
回答7:
I found this method to be very useful:
# iloc[row slicing, column slicing]
surveys_df.iloc [0:3, 1:4]
More details can be found here
回答8:
just use:
it will select b and c column.
df1=pd.DataFrame()
df1=df[[\'b\',\'c\']]
then u can just call df1:
df1
回答9:
If you want to get one element by row index and column name, you can do it just like df[\'b\'][0]
. It is as simple as you can image.
Or you can use df.ix[0,\'b\']
,mixed usage of index and label.
Note: Since v0.20 ix
has been deprecated in favour of loc
/ iloc
.
回答10:
The different approaches discussed in above responses are based on the assumption that either the user knows column indices to drop or subset on, or the user wishes to subset a dataframe using a range of columns (for instance between \'C\' : \'E\'). pandas.DataFrame.drop() is certainly an option to subset data based on a list of columns defined by user (though you have to be cautious that you always use copy of dataframe and inplace parameters should not be set to True!!)
Another option is to use pandas.columns.difference(), which does a set difference on column names, and returns an index type of array containing desired columns. Following is the solution:
df = pd.DataFrame([[2,3,4],[3,4,5]],columns=[\'a\',\'b\',\'c\'],index=[1,2])
columns_for_differencing = [\'a\']
df1 = df.copy()[df.columns.difference(columns_for_differencing)]
print(df1)
The output would be:
b c
1 3 4
2 4 5
回答11:
With pandas,
wit column names
dataframe[[\'column1\',\'column2\']]
with iloc, column index can be used like
dataframe[:,[1,2]]
with loc column names can be used like
dataframe[:,[\'column1\',\'column2\']]
hope it helps !
回答12:
Starting in 0.21.0, using .loc
or []
with a list with one or more missing labels, is deprecated, in favor of .reindex
. So, the answer to your question is:
df1 = df.reindex(columns=[\'b\',\'c\'])
In prior versions, using .loc[list-of-labels]
would work as long as at least 1 of the keys was found (otherwise it would raise a KeyError
). This behavior is deprecated and now shows a warning message. The recommended alternative is to use .reindex()
.
Read more at https://pandas.pydata.org/pandas-docs/stable/indexing.html#reindexing
回答13:
I am quite sure that this is not an optimized approach but can be considered as a different one.
using iterows
`df1= pd.DataFrame() #creating an empty dataframe
for index,i in df.iterrows():
df1.loc[index,\'A\']=df.loc[index,\'A\']
df1.loc[index,\'B\']=df.loc[index,\'B\']
df1.head()
回答14:
Below is my code:
import pandas as pd
df = pd.read_excel(\"data.xlsx\", sheet_name = 2)
print df
df1 = df[[\'emp_id\',\'date\']]
print df1
Output:
emp_id date count
0 1001 11/1/2018 3
1 1002 11/1/2018 4
2 11/2/2018 2
3 11/3/2018 4
emp_id date
0 1001 11/1/2018
1 1002 11/1/2018
2 11/2/2018
3 11/3/2018
First dataframe is the master one. I just copied two columns into df1.