Parsing specific columns from a dataset in python

2019-05-26 05:48发布

I have a dataset with multiple columns and I am only interested in analyzing the data from six of the columns. It is in a txt file, and I want to load the file and pull out the following columns (0, 1, 2, 4, 6, 7) with the headings (time, mode, event, xcoord, ycoord, phi). There are ten columns total, Here is an example of what the data looks like:

1385940076332   3   M   subject_avatar  -30.000000  1.000000    -59.028107  180.000000  0.000000    0.000000
1385940076336   2   M   subject_avatar  -30.000000  1.000000    -59.028107  180.000000  0.000000    0.000000
1385940076339   3   M   subject_avatar  -30.000000  1.000000    -59.028107  180.000000  0.000000    0.000000
1385940076342   3   M   subject_avatar  -30.000000  1.000000    -59.028107  180.000000  0.000000    0.000000
1385940076346   3   M   subject_avatar  -30.000000  1.000000    -59.028107  180.000000  0.000000    0.000000
1385940076350   2   M   subject_avatar  -30.000000  1.000000    -59.028107  180.000000  0.000000    0.000000
1385940076353   3   M   subject_avatar  -30.000000  1.000000    -59.028107  180.000000  0.000000    0.000000
1385940076356   3   M   subject_avatar  -30.000000  1.000000    -59.028107  180.000000  0.000000    0.000000

When I use the following code to parse the data into columns, it only appears to count the data- but I would like to be able to list the data for further analysis. Here is the code I am using from @alko:

import pandas as pd
df = pd.read_csv('filtered.txt', header=None, false_values=None, sep='\s+')[[0, 1, 2, 4, 6, 7]]
df.columns = ['time', 'mode', 'event', 'xcoord', 'ycoord', 'phi']
print df  

Here is what that code returns:

class 'pandas.core.frame.DataFrame'
Int64Index: 115534 entries, 0 to 115533
Data columns (total 6 columns): 
time      115534  non-null values
mode      115534  non-null values
event     115534  non-null values
xcoord    115534  non-null values
ycoord    115534  non-null values
phi       115534  non-null values
dtypes: float64(3), int64(2), object(1)

So the goal is to pull out these 6 columns from the 10 original, label them, and list them.

1条回答
干净又极端
2楼-- · 2019-05-26 06:28

You can use pandas' read_csv parser:

import pandas as pd
from StringIO import StringIO
s = """1385940076332   3   M   subject_avatar  -30.000000  1.000000    -59.028107  180.000000  0.000000    0.000000
1385940076336   2   M   subject_avatar  -30.000000  1.000000    -59.028107  180.000000  0.000000    0.000000
1385940076339   3   M   subject_avatar  -30.000000  1.000000    -59.028107  180.000000  0.000000    0.000000
1385940076342   3   M   subject_avatar  -30.000000  1.000000    -59.028107  180.000000  0.000000    0.000000
1385940076346   3   M   subject_avatar  -30.000000  1.000000    -59.028107  180.000000  0.000000    0.000000
1385940076350   2   M   subject_avatar  -30.000000  1.000000    -59.028107  180.000000  0.000000    0.000000
1385940076353   3   M   subject_avatar  -30.000000  1.000000    -59.028107  180.000000  0.000000    0.000000
1385940076356   3   M   subject_avatar  -30.000000  1.000000    -59.028107  180.000000  0.# 000000    0.000000"""

df = pd.read_csv(StringIO(s),header=None, sep='\s+')[[0, 2, 3, 4, 6, 7]]
df.columns = ['time', 'mode', 'event', 'xcoord', 'ycoord', 'phi']
print df
#             time mode           event  xcoord     ycoord  phi
# 0  1385940076332    M  subject_avatar     -30 -59.028107  180
# 1  1385940076336    M  subject_avatar     -30 -59.028107  180
# 2  1385940076339    M  subject_avatar     -30 -59.028107  180
# 3  1385940076342    M  subject_avatar     -30 -59.028107  180
# 4  1385940076346    M  subject_avatar     -30 -59.028107  180
# 5  1385940076350    M  subject_avatar     -30 -59.028107  180
# 6  1385940076353    M  subject_avatar     -30 -59.028107  180
# 7  1385940076356    M  subject_avatar     -30 -59.028107  180

Note, that I corrected columns indices, as it seems that ones provided by You in the question are not correct.

查看更多
登录 后发表回答