How to trim leading and trailing whitespace?

2018-12-31 08:29发布

I am having some troubles with leading and trailing whitespace in a data.frame. Eg I like to take a look at a specific row in a data.frame based on a certain condition:

> myDummy[myDummy$country == c("Austria"),c(1,2,3:7,19)] 

[1] codeHelper     country        dummyLI    dummyLMI       dummyUMI       
[6] dummyHInonOECD dummyHIOECD    dummyOECD      
<0 rows> (or 0-length row.names)

I was wondering why I didn't get the expected output since the country Austria obviously existed in my data.frame. After looking through my code history and trying to figure out what went wrong I tried:

> myDummy[myDummy$country == c("Austria "),c(1,2,3:7,19)]
   codeHelper  country dummyLI dummyLMI dummyUMI dummyHInonOECD dummyHIOECD
18        AUT Austria        0        0        0              0           1
   dummyOECD
18         1

All I have changed in the command is an additional whitespace after Austria.

Further annoying problems obviously arise. Eg when I like to merge two frames based on the country column. One data.frame uses "Austria " while the other frame has "Austria". The matching doesn't work.

  1. Is there a nice way to 'show' the whitespace on my screen so that i am aware of the problem?
  2. And can I remove the leading and trailing whitespace in R?

So far I used to write a simple Perl script which removes the whitespace but it would be nice if I can somehow do it inside R.

13条回答
美炸的是我
2楼-- · 2018-12-31 09:05

ad1) To see white spaces you could directly call print.data.frame with modified arguments:

print(head(iris), quote=TRUE)
#   Sepal.Length Sepal.Width Petal.Length Petal.Width  Species
# 1        "5.1"       "3.5"        "1.4"       "0.2" "setosa"
# 2        "4.9"       "3.0"        "1.4"       "0.2" "setosa"
# 3        "4.7"       "3.2"        "1.3"       "0.2" "setosa"
# 4        "4.6"       "3.1"        "1.5"       "0.2" "setosa"
# 5        "5.0"       "3.6"        "1.4"       "0.2" "setosa"
# 6        "5.4"       "3.9"        "1.7"       "0.4" "setosa"

See also ?print.data.frame for other options.

查看更多
琉璃瓶的回忆
3楼-- · 2018-12-31 09:06

Best method is trimws()

Following code will apply this function to entire dataframe

mydataframe<- data.frame(lapply(mydataframe, trimws),stringsAsFactors = FALSE)

查看更多
何处买醉
4楼-- · 2018-12-31 09:07

I tried trim(). Works well with white spaces as well as the '\n'. x = '\n Harden, J.\n '

trim(x)

查看更多
美炸的是我
5楼-- · 2018-12-31 09:09

Probably the best way is to handle the trailing whitespaces when you read your data file. If you use read.csv or read.table you can set the parameterstrip.white=TRUE.

If you want to clean strings afterwards you could use one of these functions:

# returns string w/o leading whitespace
trim.leading <- function (x)  sub("^\\s+", "", x)

# returns string w/o trailing whitespace
trim.trailing <- function (x) sub("\\s+$", "", x)

# returns string w/o leading or trailing whitespace
trim <- function (x) gsub("^\\s+|\\s+$", "", x)

To use one of these functions on myDummy$country:

 myDummy$country <- trim(myDummy$country)

To 'show' the whitespace you could use:

 paste(myDummy$country)

which will show you the strings surrounded by quotation marks (") making whitespaces easier to spot.

查看更多
只若初见
6楼-- · 2018-12-31 09:09
myDummy[myDummy$country == "Austria "] <- "Austria"

After this, you'll need to force R not to recognize "Austria " as a level. Let's pretend you also have "USA" and "Spain" as levels:

myDummy$country = factor(myDummy$country, levels=c("Austria", "USA", "Spain"))

A little less intimidating than the highest voted response, but it should still work.

查看更多
刘海飞了
7楼-- · 2018-12-31 09:13

Another option is to use the stri_trim function from the stringi package which defaults to removing leading and trailing whitespace:

> x <- c("  leading space","trailing space   ")
> stri_trim(x)
[1] "leading space"  "trailing space"

For only removing leading whitespace, use stri_trim_left. For only removing trailing whitespace, use stri_trim_right. When you want to remove other leading or trailing characters, you have to specify that with pattern =.

See also ?stri_trim for more info.

查看更多
登录 后发表回答