I need to created a data frame using data stored in a file. For that I want to use read_csv
method. However, the separator is not very regular. Some columns are separated by tabs (\t
), other are separated by spaces. Moreover, some columns can be separated by 2 or 3 or more spaces or even by a combination of spaces and tabs (for example 3 spaces, two tabs and then 1 space).
Is there a way to tell pandas to treat these files properly?
By the way, I do not have this problem if I use Python. I use:
for line in file(file_name):
fld = line.split()
And it works perfect. It does not care if there are 2 or 3 spaces between the fields. Even combinations of spaces and tabs do not cause any problem. Can pandas do the same?
We may consider this to take care of all the combination and zero or more occurrences.
Pandas has two csv readers, only is flexible regarding redundant leading white space:
while one is not
Neither is out-of-the-box flexible regarding trailing white space, see the answers with regular expressions. Avoid delim_whitespace, as it also allows just spaces (without , or \t) as separators.
would use any combination of any number of spaces and tabs as the separator.
From the documentation, you can use either a regex or
delim_whitespace
: