I am having some troubles with leading and trailing whitespace in a data.frame.
Eg I like to take a look at a specific row
in a data.frame
based on a certain condition:
> myDummy[myDummy$country == c("Austria"),c(1,2,3:7,19)]
[1] codeHelper country dummyLI dummyLMI dummyUMI
[6] dummyHInonOECD dummyHIOECD dummyOECD
<0 rows> (or 0-length row.names)
I was wondering why I didn't get the expected output since the country Austria obviously existed in my data.frame
. After looking through my code history and trying to figure out what went wrong I tried:
> myDummy[myDummy$country == c("Austria "),c(1,2,3:7,19)]
codeHelper country dummyLI dummyLMI dummyUMI dummyHInonOECD dummyHIOECD
18 AUT Austria 0 0 0 0 1
dummyOECD
18 1
All I have changed in the command is an additional whitespace after Austria.
Further annoying problems obviously arise. Eg when I like to merge two frames based on the country column. One data.frame
uses "Austria "
while the other frame has "Austria"
. The matching doesn't work.
- Is there a nice way to 'show' the whitespace on my screen so that i am aware of the problem?
- And can I remove the leading and trailing whitespace in R?
So far I used to write a simple Perl
script which removes the whitespace but it would be nice if I can somehow do it inside R.
ad1) To see white spaces you could directly call
print.data.frame
with modified arguments:See also
?print.data.frame
for other options.Best method is trimws()
Following code will apply this function to entire dataframe
I tried trim(). Works well with white spaces as well as the '\n'. x = '\n Harden, J.\n '
trim(x)
Probably the best way is to handle the trailing whitespaces when you read your data file. If you use
read.csv
orread.table
you can set the parameterstrip.white=TRUE
.If you want to clean strings afterwards you could use one of these functions:
To use one of these functions on
myDummy$country
:To 'show' the whitespace you could use:
which will show you the strings surrounded by quotation marks (") making whitespaces easier to spot.
After this, you'll need to force R not to recognize "Austria " as a level. Let's pretend you also have "USA" and "Spain" as levels:
A little less intimidating than the highest voted response, but it should still work.
Another option is to use the
stri_trim
function from thestringi
package which defaults to removing leading and trailing whitespace:For only removing leading whitespace, use
stri_trim_left
. For only removing trailing whitespace, usestri_trim_right
. When you want to remove other leading or trailing characters, you have to specify that withpattern =
.See also
?stri_trim
for more info.