I am having some troubles with leading and trailing whitespace in a data.frame.
Eg I like to take a look at a specific row
in a data.frame
based on a certain condition:
> myDummy[myDummy$country == c("Austria"),c(1,2,3:7,19)]
[1] codeHelper country dummyLI dummyLMI dummyUMI
[6] dummyHInonOECD dummyHIOECD dummyOECD
<0 rows> (or 0-length row.names)
I was wondering why I didn't get the expected output since the country Austria obviously existed in my data.frame
. After looking through my code history and trying to figure out what went wrong I tried:
> myDummy[myDummy$country == c("Austria "),c(1,2,3:7,19)]
codeHelper country dummyLI dummyLMI dummyUMI dummyHInonOECD dummyHIOECD
18 AUT Austria 0 0 0 0 1
dummyOECD
18 1
All I have changed in the command is an additional whitespace after Austria.
Further annoying problems obviously arise. Eg when I like to merge two frames based on the country column. One data.frame
uses "Austria "
while the other frame has "Austria"
. The matching doesn't work.
- Is there a nice way to 'show' the whitespace on my screen so that i am aware of the problem?
- And can I remove the leading and trailing whitespace in R?
So far I used to write a simple Perl
script which removes the whitespace but it would be nice if I can somehow do it inside R.
As of R 3.2.0 a new function was introduced for removing leading/trailing whitespaces:
See: http://stat.ethz.ch/R-manual/R-patched/library/base/html/trimws.html
To manipulate the white space, use str_trim() in the stringr package. The package has manual dated Feb 15,2013 and is in CRAN. The function can also handle string vectors.
(credit goes to commenter: R. Cotton)
I created a
trim.strings ()
function to trim leading and/or trailing whitespace as:For illustration,
I'd prefer to add the answer as comment to user56 but yet unable so writing as an independent answer. Removing leading and trailing blanks might be achieved through trim() function from gdata package as well:
Usage example:
Use grep or grepl to find observations with whitespaces and sub to get rid of them.
Another related problem occurs if you have multiple spaces inbetween inputs:
You can then easily split this string into "real" tokens using a regular expression to the
split
argument:Note that if there is a match at the beginning of a (non-empty) string, the first element of the output is ‘""’, but if there is a match at the end of the string, the output is the same as with the match removed.