I have a dataset in this format:
I need to import the data and work with it.
The main problem is that the first and the fourth columns are strings while the second and third columns are floats and ints, respectively.
I'd like to put the data in a matrix or at least obtain a list of each column's data.
I tried to read the whole dataset as a string but it's a mess:
f = open ( 'input.txt' , 'r')
l = [ map(str,line.split('\t')) for line in f ]
What could be a good solution?
split and transpose the list:
You seem to have CSV data (with tabs as the delimiter) so why not use the csv module?
data
is a list of tuples containing the converted data (column 2 -> float, column 3 -> int). If data.csv contains (with tabs, not spaces):data
would contain :You can use pandas. They are great for reading csv files, tab delimited files etc. Pandas will almost all the time read the data type correctly and put them in an numpy array when accessed using rows/columns as demonstrated.
I used this tab delimited 'test.txt' file:
Here is the pandas code. Your file will be read in a nice dataframe using one line in python. You can change the 'sep' value to anything else to suit your file.
Then try:
You can add column names as:
And then get the columns as:
Here's a solution to read in the data and convert those second and third columns to numeric types:
With the following
input.txt
:It produces the following output:
Use
numpy.loadtxt("data.txt")
to read data as a list of rowseach row has elements of each column
Use
dtype = string
to read each entry as stringYou can convert corresponding values to integer, float, etc. with a for loop.
Reference: https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.loadtxt.html