I've been trying to remove the white space that I have in a data frame (using R). The data frame is large (>1gb) and has multiple columns that contains white space in every data entry.
Is there a quick way to remove the white space from the whole data frame? I've been trying to do this on a subset of the first 10 rows of data using:
gsub( " ", "", mydata)
This didn't seem to work, although R returned an output which I have been unable to interpret.
str_replace( " ", "", mydata)
R returned 47 warnings and did not remove the white space.
erase_all(mydata, " ")
R returned an error saying 'Error: could not find function "erase_all"'
I would really appreciate some help with this as I've spent the last 24hrs trying to tackle this problem.
Thanks!
A lot of the answers are older, so here in 2019 is a simple
dplyr
answer that will operate only on the character columns to remove trailing and leading whitespace.You can switch out the
str_trim()
function for other ones if you want a different flavor of whitespace removal.You could use trimws function in R 3.2 on all the columns.
myData[,c(1)]=trimws(myData[,c(1)])
You can loop this for all the columns in your dataset. It has good performance with large datasets as well.
If you want to maintain the variable classes in your
data.frame
- you should know that usingapply
will clobber them because it outputs amatrix
where all variables are converted to eithercharacter
ornumeric
. Building upon the code of Fremzy and Anthony Simon Mielniczuk you can loop through the columns of your data.frame and trim the white space off only columns of classfactor
orcharacter
(and maintain your data classes):R is simply not the right tool for such file size. However have 2 options :
Use ffdply and ff base
Use
ff
andffbase
packages:Use sed (my preference)
I think that a simple approach with sapply, also works, given a df like:
You will notice that
dat$N
is going to become class character due to'4 ' & '5 '
(you can check withclass(dat$N)
)To get rid of the spaces on the numeic column simply convert to
numeric
withas.numeric
oras.integer
.dat$N<-as.numeric(dat$N)
If you want to remove all the spaces, do:
And again use
as.numeric
on col N (ause sapply will convert it tocharacter
)If you're dealing with large data sets like this, you could really benefit form the speed of
data.table
.I would expect this to be the fastest solution. This line of code uses the
set
operator ofdata.table
, which loops over columns really fast. There is a nice explanation here: Fast looping with set.