How to trim leading and trailing whitespace?

2018-12-31 08:29发布

I am having some troubles with leading and trailing whitespace in a data.frame. Eg I like to take a look at a specific row in a data.frame based on a certain condition:

> myDummy[myDummy$country == c("Austria"),c(1,2,3:7,19)] 

[1] codeHelper     country        dummyLI    dummyLMI       dummyUMI       
[6] dummyHInonOECD dummyHIOECD    dummyOECD      
<0 rows> (or 0-length row.names)

I was wondering why I didn't get the expected output since the country Austria obviously existed in my data.frame. After looking through my code history and trying to figure out what went wrong I tried:

> myDummy[myDummy$country == c("Austria "),c(1,2,3:7,19)]
   codeHelper  country dummyLI dummyLMI dummyUMI dummyHInonOECD dummyHIOECD
18        AUT Austria        0        0        0              0           1
   dummyOECD
18         1

All I have changed in the command is an additional whitespace after Austria.

Further annoying problems obviously arise. Eg when I like to merge two frames based on the country column. One data.frame uses "Austria " while the other frame has "Austria". The matching doesn't work.

  1. Is there a nice way to 'show' the whitespace on my screen so that i am aware of the problem?
  2. And can I remove the leading and trailing whitespace in R?

So far I used to write a simple Perl script which removes the whitespace but it would be nice if I can somehow do it inside R.

13条回答
人间绝色
2楼-- · 2018-12-31 09:16

As of R 3.2.0 a new function was introduced for removing leading/trailing whitespaces:

trimws()

See: http://stat.ethz.ch/R-manual/R-patched/library/base/html/trimws.html

查看更多
查无此人
3楼-- · 2018-12-31 09:16

To manipulate the white space, use str_trim() in the stringr package. The package has manual dated Feb 15,2013 and is in CRAN. The function can also handle string vectors.

install.packages("stringr", dependencies=TRUE)
require(stringr)
example(str_trim)
d4$clean2<-str_trim(d4$V2)

(credit goes to commenter: R. Cotton)

查看更多
大哥的爱人
4楼-- · 2018-12-31 09:23

I created a trim.strings () function to trim leading and/or trailing whitespace as:

# Arguments:    x - character vector
#            side - side(s) on which to remove whitespace 
#                   default : "both"
#                   possible values: c("both", "leading", "trailing")

trim.strings <- function(x, side = "both") { 
    if (is.na(match(side, c("both", "leading", "trailing")))) { 
      side <- "both" 
      } 
    if (side == "leading") { 
      sub("^\\s+", "", x)
      } else {
        if (side == "trailing") {
          sub("\\s+$", "", x)
    } else gsub("^\\s+|\\s+$", "", x)
    } 
} 

For illustration,

a <- c("   ABC123 456    ", " ABC123DEF          ")

# returns string without leading and trailing whitespace
trim.strings(a)
# [1] "ABC123 456" "ABC123DEF" 

# returns string without leading whitespace
trim.strings(a, side = "leading")
# [1] "ABC123 456    "      "ABC123DEF          "

# returns string without trailing whitespace
trim.strings(a, side = "trailing")
# [1] "   ABC123 456" " ABC123DEF"   
查看更多
步步皆殇っ
5楼-- · 2018-12-31 09:27

I'd prefer to add the answer as comment to user56 but yet unable so writing as an independent answer. Removing leading and trailing blanks might be achieved through trim() function from gdata package as well:

require(gdata)
example(trim)

Usage example:

> trim("   Remove leading and trailing blanks    ")
[1] "Remove leading and trailing blanks"
查看更多
与风俱净
6楼-- · 2018-12-31 09:28

Use grep or grepl to find observations with whitespaces and sub to get rid of them.

names<-c("Ganga Din\t","Shyam Lal","Bulbul ")
grep("[[:space:]]+$",names)
[1] 1 3
grepl("[[:space:]]+$",names)
[1]  TRUE FALSE  TRUE
sub("[[:space:]]+$","",names)
[1] "Ganga Din" "Shyam Lal" "Bulbul"  
查看更多
旧时光的记忆
7楼-- · 2018-12-31 09:29

Another related problem occurs if you have multiple spaces inbetween inputs:

> a <- "  a string         with lots   of starting, inter   mediate and trailing   whitespace     "

You can then easily split this string into "real" tokens using a regular expression to the split argument:

> strsplit(a, split=" +")
[[1]]
 [1] ""           "a"          "string"     "with"       "lots"      
 [6] "of"         "starting,"  "inter"      "mediate"    "and"       
[11] "trailing"   "whitespace"

Note that if there is a match at the beginning of a (non-empty) string, the first element of the output is ‘""’, but if there is a match at the end of the string, the output is the same as with the match removed.

查看更多
登录 后发表回答