How do you handle column names having spaces in th

2019-02-25 22:38发布

问题:

This is a real problem I've faced for a long time.

Take this dataframe:

         A         B  THRESHOLD
       NaN       NaN        NaN
 -0.041158 -0.161571   0.329038
  0.238156  0.525878   0.110370
  0.606738  0.854177  -0.095147
  0.200166  0.385453   0.166235

It is easy enough to copy using pd.read_clipboard. However, if one of the column names has a space:

         A         B     Col #3
       NaN       NaN        NaN
 -0.041158 -0.161571   0.329038
  0.238156  0.525878   0.110370
  0.606738  0.854177  -0.095147
  0.200166  0.385453   0.166235

Then, it is read like this:

          A         B       Col  #3
0       NaN       NaN       NaN NaN
1 -0.041158 -0.161571  0.329038 NaN
2  0.238156  0.525878  0.110370 NaN
3  0.606738  0.854177 -0.095147 NaN
4  0.200166  0.385453  0.166235 NaN

How can I prevent that?

回答1:

What I do in this situation is that I make all my columns two or more spaces apart, then I use sep='\s\s+' for my delimiter, this way when I do have column headings with a single space such as, Col #3 above it treats it as one column.

         A         B     Col #3
       NaN       NaN        NaN
 -0.041158  -0.161571   0.329038
  0.238156   0.525878   0.110370
  0.606738   0.854177  -0.095147
  0.200166   0.385453   0.166235

df = pd.read_clipboard(sep='\s\s+')

You do get this warning, but you can ignore it since it as done it right. Or you could put the engine='python' if your OCD gets the best of you. :)

C:\Program Files\Anaconda3\lib\site-packages\pandas\io\clipboards.py:63: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'. return read_table(StringIO(text), sep=sep, **kwargs)

print(df)

          A         B    Col #3
0       NaN       NaN       NaN
1 -0.041158 -0.161571  0.329038
2  0.238156  0.525878  0.110370
3  0.606738  0.854177 -0.095147
4  0.200166  0.385453  0.166235


回答2:

Using re, io and pd.read_table to drive the point I was making in the comments, I copied the exact text you have in the post, applied a first round of re.sub to remove any leading whitespace. Then, I replaced any space that is preceded by a number--this is unique to the case at hand since the column names are mostly string characters--with 2 spaces. Once all that is done, I converted the resulting string into an io.StringIO object and fed the latter to the pd.read_table function. This essentially the same thing as copying the text and pasting it in sublime text, and then applying to search and replace operations before you finally copy the resulting string and feed it to pd.read_clipboard.

The following snippet of code illustrates the point:

import pandas as pd
import re
import io


text = """         A         B     Col #3
        NaN       NaN        NaN
  -0.041158 -0.161571   0.329038
   0.238156  0.525878   0.110370
   0.606738  0.854177  -0.095147
   0.200166  0.385453   0.166235"""


with io.StringIO(re.sub("(?<=[0-9]) +", "  ", re.sub("^ +", "", text))) as fs:
    df =  pd.read_table(fs, header=0, sep="\s{2,}",engine='python')


#           A         B    Col #3
# 0       NaN       NaN       NaN
# 1 -0.041158 -0.161571  0.329038
# 2  0.238156  0.525878  0.110370
# 3  0.606738  0.854177 -0.095147
# 4  0.200166  0.385453  0.166235

Thanks for asking the question.