This is a real problem I've faced for a long time.
Take this dataframe:
A B THRESHOLD
NaN NaN NaN
-0.041158 -0.161571 0.329038
0.238156 0.525878 0.110370
0.606738 0.854177 -0.095147
0.200166 0.385453 0.166235
It is easy enough to copy using pd.read_clipboard
. However, if one of the column names has a space:
A B Col #3
NaN NaN NaN
-0.041158 -0.161571 0.329038
0.238156 0.525878 0.110370
0.606738 0.854177 -0.095147
0.200166 0.385453 0.166235
Then, it is read like this:
A B Col #3
0 NaN NaN NaN NaN
1 -0.041158 -0.161571 0.329038 NaN
2 0.238156 0.525878 0.110370 NaN
3 0.606738 0.854177 -0.095147 NaN
4 0.200166 0.385453 0.166235 NaN
How can I prevent that?
What I do in this situation is that I make all my columns two or more spaces apart, then I use sep='\s\s+' for my delimiter, this way when I do have column headings with a single space such as, Col #3 above it treats it as one column.
A B Col #3
NaN NaN NaN
-0.041158 -0.161571 0.329038
0.238156 0.525878 0.110370
0.606738 0.854177 -0.095147
0.200166 0.385453 0.166235
df = pd.read_clipboard(sep='\s\s+')
You do get this warning, but you can ignore it since it as done it right. Or you could put the engine='python'
if your OCD gets the best of you. :)
C:\Program
Files\Anaconda3\lib\site-packages\pandas\io\clipboards.py:63:
ParserWarning: Falling back to the 'python' engine because the 'c'
engine does not support regex separators (separators > 1 char and
different from '\s+' are interpreted as regex); you can avoid this
warning by specifying engine='python'. return
read_table(StringIO(text), sep=sep, **kwargs)
print(df)
A B Col #3
0 NaN NaN NaN
1 -0.041158 -0.161571 0.329038
2 0.238156 0.525878 0.110370
3 0.606738 0.854177 -0.095147
4 0.200166 0.385453 0.166235
Using re
, io
and pd.read_table
to drive the point I was making in the comments, I copied the exact text you have in the post, applied a first round of re.sub
to remove any leading whitespace. Then, I replaced any space that is preceded by a number--this is unique to the case at hand since the column names are mostly string characters--with 2 spaces. Once all that is done, I converted the resulting string into an io.StringIO
object and fed the latter to the pd.read_table
function. This essentially the same thing as copying the text and pasting it in sublime text
, and then applying to search and replace operations before you finally copy the resulting string and feed it to pd.read_clipboard
.
The following snippet of code illustrates the point:
import pandas as pd
import re
import io
text = """ A B Col #3
NaN NaN NaN
-0.041158 -0.161571 0.329038
0.238156 0.525878 0.110370
0.606738 0.854177 -0.095147
0.200166 0.385453 0.166235"""
with io.StringIO(re.sub("(?<=[0-9]) +", " ", re.sub("^ +", "", text))) as fs:
df = pd.read_table(fs, header=0, sep="\s{2,}",engine='python')
# A B Col #3
# 0 NaN NaN NaN
# 1 -0.041158 -0.161571 0.329038
# 2 0.238156 0.525878 0.110370
# 3 0.606738 0.854177 -0.095147
# 4 0.200166 0.385453 0.166235
Thanks for asking the question.