可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I have a dataset in this format:
I need to import the data and work with it.
The main problem is that the first and the fourth columns are strings while the second and third columns are floats and ints, respectively.
I'd like to put the data in a matrix or at least obtain a list of each column's data.
I tried to read the whole dataset as a string but it's a mess:
f = open ( 'input.txt' , 'r')
l = [ map(str,line.split('\t')) for line in f ]
What could be a good solution?
回答1:
You seem to have CSV data (with tabs as the delimiter) so why not use the csv module?
import csv
with open('data.csv') as f:
reader = csv.reader(f, delimiter='\t')
data = [(col1, float(col2), int(col3), col4)
for col1, col2, col3, col4 in reader]
data
is a list of tuples containing the converted data (column 2 -> float, column 3 -> int). If data.csv contains (with tabs, not spaces):
thing1 5.005069 284 D
thing2 5.005049 142 D
thing3 5.005066 248 D
thing4 5.005037 124 D
data
would contain :
[('thing1', 5.005069, 284, 'D'),
('thing2', 5.005049, 142, 'D'),
('thing3', 5.005066, 248, 'D'),
('thing4', 5.005037, 124, 'D')]
回答2:
You can use pandas. They are great for reading csv files, tab delimited files etc. Pandas will almost all the time read the data type correctly and put them in an numpy array when accessed using rows/columns as demonstrated.
I used this tab delimited 'test.txt' file:
bbbbffdd 434343 228 D
bbbWWWff 43545343 289 E
ajkfbdafa 2345345 2312 F
Here is the pandas code. Your file will be read in a nice dataframe using one line in python. You can change the 'sep' value to anything else to suit your file.
import pandas as pd
X = pd.read_csv('test.txt', sep="\t", header=None)
Then try:
print X
0 1 2 3
0 bbbbffdd 434343 228 D
1 bbbWWWff 43545343 289 E
2 ajkfbdafa 2345345 2312 F
print X[0]
0 bbbbffdd
1 bbbWWWff
2 ajkfbdafa
print X[2]
0 228
1 289
2 2312
print X[1][1:]
1 43545343
2 2345345
You can add column names as:
X.columns = ['random_letters', 'number', 'simple_number', 'letter']
And then get the columns as:
X['number'].values
array([ 434343, 43545343, 2345345])
回答3:
Here's a solution to read in the data and convert those second and third columns to numeric types:
f = open('input.txt', 'r')
rows = []
for line in f:
# Split on any whitespace (including tab characters)
row = line.split()
# Convert strings to numeric values:
row[1] = float(row[1])
row[2] = int(row[2])
# Append to our list of lists:
rows.append(row)
print rows
With the following input.txt
:
string1 5.005069 284 D
string2 5.005049 142 D
string3 5.005066 284 D
string4 5.005037 124 D
It produces the following output:
[['string1', 5.005069, 284, 'D'],
['string2', 5.005049, 142, 'D'],
['string3', 5.005066, 284, 'D'],
['string4', 5.005037, 124, 'D']]
回答4:
split and transpose the list:
with open ( 'in.txt' , 'r') as f: # use with to open your files, it close them automatically
l = [x.split() for x in f]
rows = [list(x) for x in zip(*l)]
rows[1],rows[2] = map(float,rows[1]),map(int,rows[2])
In [16]: rows
Out[16]:
[['bbbbffdd', 'bbbWWWff', 'ajkfbdafa'],
[434343.0, 43545343.0, 2345345.0],
[228, 289, 2312],
['D', 'E', 'F']]
回答5:
Use numpy.loadtxt("data.txt")
to read data as a list of rows
[[row1],[row2],[row3]...]
each row has elements of each column
[row1] = [col1, col2, col3, ...]
Use dtype = string
to read each entry as string
You can convert corresponding values to integer, float, etc. with a for loop.
Reference: https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.loadtxt.html