How to make separator in pandas read_csv more flex

I need to created a data frame using data stored in a file. For that I want to use read_csv method. However, the separator is not very regular. Some columns are separated by tabs (\t), other are separated by spaces. Moreover, some columns can be separated by 2 or 3 or more spaces or even by a combination of spaces and tabs (for example 3 spaces, two tabs and then 1 space).

Is there a way to tell pandas to treat these files properly?

By the way, I do not have this problem if I use Python. I use:

for line in file(file_name):
   fld = line.split()

And it works perfect. It does not care if there are 2 or 3 spaces between the fields. Even combinations of spaces and tabs do not cause any problem. Can pandas do the same?

标签： python csv pandas dataframe whitespace

4条回答

看淡一切

2楼-- · 2019-01-02 18:50

We may consider this to take care of all the combination and zero or more occurrences.

pd.read_csv("whitespace.csv", header = None, sep = "[ \t]*,[ \t]*")

0人赞添加讨论(0) 举报

笑指拈花

3楼-- · 2019-01-02 18:51

Pandas has two csv readers, only is flexible regarding redundant leading white space:

pd.read_csv("whitespace.csv", skipinitialspace=True)

while one is not

pd.DataFrame.from_csv("whitespace.csv")

Neither is out-of-the-box flexible regarding trailing white space, see the answers with regular expressions. Avoid delim_whitespace, as it also allows just spaces (without , or \t) as separators.

0人赞添加讨论(0) 举报

后来的你喜欢了谁

4楼-- · 2019-01-02 18:59

>>> pd.read_csv("whitespace.csv", header = None, sep = "\s+|\t+|\s+\t+|\t+\s+")

would use any combination of any number of spaces and tabs as the separator.

0人赞添加讨论(0) 举报

零度萤火

5楼-- · 2019-01-02 19:05

From the documentation, you can use either a regex or delim_whitespace:

>>> import pandas as pd
>>> for line in open("whitespace.csv"):
...     print repr(line)
...     
'a\t  b\tc 1 2\n'
'd\t  e\tf 3 4\n'
>>> pd.read_csv("whitespace.csv", header=None, delimiter=r"\s+")
   0  1  2  3  4
0  a  b  c  1  2
1  d  e  f  3  4
>>> pd.read_csv("whitespace.csv", header=None, delim_whitespace=True)
   0  1  2  3  4
0  a  b  c  1  2
1  d  e  f  3  4

0人赞添加讨论(0) 举报

How to make separator in pandas read_csv more flex

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间