可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I\'ve just started using R and I\'m not sure how to incorporate my dataset with the following sample code:
sample(x, size, replace = FALSE, prob = NULL)
I have a dataset that I need to put into a training (75%) and testing (25%) set.
I\'m not sure what information I\'m supposed to put into the x and size?
Is x the dataset file, and size how many samples I have?
回答1:
There are numerous approaches to achieve data partitioning. For a more complete approach take a look at the createDataPartition
function in the caret
package.
Here is a simple example:
data(mtcars)
## 75% of the sample size
smp_size <- floor(0.75 * nrow(mtcars))
## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(mtcars)), size = smp_size)
train <- mtcars[train_ind, ]
test <- mtcars[-train_ind, ]
回答2:
It can be easily done by:
set.seed(101) # Set Seed so that same sample can be reproduced in future also
# Now Selecting 75% of data as sample from total \'n\' rows of the data
sample <- sample.int(n = nrow(data), size = floor(.75*nrow(data)), replace = F)
train <- data[sample, ]
test <- data[-sample, ]
By using caTools package:
require(caTools)
set.seed(101)
sample = sample.split(data$anycolumn, SplitRatio = .75)
train = subset(data, sample == TRUE)
test = subset(data, sample == FALSE)
回答3:
This is almost the same code, but in more nice look
bound <- floor((nrow(df)/4)*3) #define % of training and test set
df <- df[sample(nrow(df)), ] #sample rows
df.train <- df[1:bound, ] #get training set
df.test <- df[(bound+1):nrow(df), ] #get test set
回答4:
I would use dplyr
for this, makes it super simple. It does require an id variable in your data set, which is a good idea anyway, not only for creating sets but also for traceability during your project. Add it if doesn\'t contain already.
mtcars$id <- 1:nrow(mtcars)
train <- mtcars %>% dplyr::sample_frac(.75)
test <- dplyr::anti_join(mtcars, train, by = \'id\')
回答5:
I will split \'a\' into train(70%) and test(30%)
a # original data frame
library(dplyr)
train<-sample_frac(a, 0.7)
sid<-as.numeric(rownames(train)) # because rownames() returns character
test<-a[-sid,]
done
回答6:
library(caret)
intrain<-createDataPartition(y=sub_train$classe,p=0.7,list=FALSE)
training<-m_train[intrain,]
testing<-m_train[-intrain,]
回答7:
My solution is basically the same as dickoa\'s but a little easier to interpret:
data(mtcars)
n = nrow(mtcars)
trainIndex = sample(1:n, size = round(0.7*n), replace=FALSE)
train = mtcars[trainIndex ,]
test = mtcars[-trainIndex ,]
回答8:
If you type:
?sample
If will launch a help menu to explain what the parameters of the sample function mean.
I am not an expert, but here is some code I have:
data <- data.frame(matrix(rnorm(400), nrow=100))<br>
splitdata <- split(data[1:nrow(data),],sample(rep(1:4,as.integer(nrow(data)/4))))<br>
test <- splitdata[[1]]<br>
train <- rbind(splitdata[[1]],splitdata[[2]],splitdata[[3]])<br>
This will give you 75% train and 25% test.
回答9:
My solution shuffles the rows, then takes the first 75% of the rows as train and the last 25% as test. Super simples!
row_count <- nrow(orders_pivotted)
shuffled_rows <- sample(row_count)
train <- orders_pivotted[head(shuffled_rows,floor(row_count*0.75)),]
test <- orders_pivotted[tail(shuffled_rows,floor(row_count*0.25)),]
回答10:
Below a function that create a list
of sub-samples of the same size which is not exactly what you wanted but might prove usefull for others. In my case to create multiple classification trees on smaller samples to test overfitting :
df_split <- function (df, number){
sizedf <- length(df[,1])
bound <- sizedf/number
list <- list()
for (i in 1:number){
list[i] <- list(df[((i*bound+1)-bound):(i*bound),])
}
return(list)
}
Example :
x <- matrix(c(1:10), ncol=1)
x
# [,1]
# [1,] 1
# [2,] 2
# [3,] 3
# [4,] 4
# [5,] 5
# [6,] 6
# [7,] 7
# [8,] 8
# [9,] 9
#[10,] 10
x.split <- df_split(x,5)
x.split
# [[1]]
# [1] 1 2
# [[2]]
# [1] 3 4
# [[3]]
# [1] 5 6
# [[4]]
# [1] 7 8
# [[5]]
# [1] 9 10
回答11:
Use caTools package in R
sample code will be as follows:-
data
split = sample.split(data$DependentcoloumnName, SplitRatio = 0.6)
training_set = subset(data, split == TRUE)
test_set = subset(data, split == FALSE)
回答12:
Use base R. Function runif
generates uniformly distributed values from 0 to 1.By varying cutoff value (train.size in example below), you will always have approximately the same percentage of random records below the cutoff value.
data(mtcars)
set.seed(123)
#desired proportion of records in training set
train.size<-.7
#true/false vector of values above/below the cutoff above
train.ind<-runif(nrow(mtcars))<train.size
#train
train.df<-mtcars[train.ind,]
#test
test.df<-mtcars[!train.ind,]
回答13:
Just a more brief and simple way using awesome dplyr library:
library(dplyr)
set.seed(275) #to get repeatable data
data.train <- sample_frac(Default, 0.7)
train_index <- as.numeric(rownames(data.train))
data.test <- Default[-train.index, ]
回答14:
require(caTools)
set.seed(101) #This is used to create same samples everytime
split1=sample.split(data$anycol,SplitRatio=2/3)
train=subset(data,split1==TRUE)
test=subset(data,split1==FALSE)
The sample.split()
function will add one extra column \'split1\' to dataframe and 2/3 of the rows will have this value as TRUE and others as FALSE.Now the rows where split1 is TRUE will be copied into train and other rows will be copied to test dataframe.
回答15:
I can suggest using the rsample package:
# choosing 75% of the data to be the training data
data_split <- initial_split(data, prop = .75)
# extracting training data and test data as two seperate dataframes
data_train <- training(data_split)
data_test <- testing(data_split)
回答16:
Assuming df is your data frame, and that you want to create 75% train and 25% test
all <- 1:nrow(df)
train_i <- sort(sample(all, round(nrow(df)*0.75,digits = 0),replace=FALSE))
test_i <- all[-train_i]
Then to create a train and test data frames
df_train <- df[train_i,]
df_test <- df[test_i,]
回答17:
Beware of sample
for splitting if you look for reproducible results. If your data changes even slightly, the split will vary even if you use set.seed
. For example, imagine the sorted list of IDs in you data is all the numbers between 1 and 10. If you just dropped one observation, say 4, sampling by location would yield a different results because now 5 to 10 all moved places.
An alternative method is to use a hash function to map IDs into some pseudo random numbers and then sample on the mod of these numbers. This sample is more stable because assignment is now determined by the hash of each observation, and not by its relative position.
For example:
require(openssl) # for md5
require(data.table) # for the demo data
set.seed(1) # this won\'t help `sample`
population <- as.character(1e5:(1e6-1)) # some made up ID names
N <- 1e4 # sample size
sample1 <- data.table(id = sort(sample(population, N))) # randomly sample N ids
sample2 <- sample1[-sample(N, 1)] # randomly drop one observation from sample1
# samples are all but identical
sample1
sample2
nrow(merge(sample1, sample2))
[1] 9999
# row splitting yields very different test sets, even though we\'ve set the seed
test <- sample(N-1, N/2, replace = F)
test1 <- sample1[test, .(id)]
test2 <- sample2[test, .(id)]
nrow(test1)
[1] 5000
nrow(merge(test1, test2))
[1] 2653
# to fix that, we can use some hash function to sample on the last digit
md5_bit_mod <- function(x, m = 2L) {
# Inputs:
# x: a character vector of ids
# m: the modulo divisor (modify for split proportions other than 50:50)
# Output: remainders from dividing the first digit of the md5 hash of x by m
as.integer(as.hexmode(substr(openssl::md5(x), 1, 1)) %% m)
}
# hash splitting preserves the similarity, because the assignment of test/train
# is determined by the hash of each obs., and not by its relative location in the data
# which may change
test1a <- sample1[md5_bit_mod(id) == 0L, .(id)]
test2a <- sample2[md5_bit_mod(id) == 0L, .(id)]
nrow(merge(test1a, test2a))
[1] 5057
nrow(test1a)
[1] 5057
sample size is not exactly 5000 because assignment is probabilistic, but it shouldn\'t be a problem in large samples thanks to the law of large numbers.
See also: http://blog.richardweiss.org/2016/12/25/hash-splits.html
and https://crypto.stackexchange.com/questions/20742/statistical-properties-of-hash-functions-when-calculating-modulo
回答18:
set.seed(123)
llwork<-sample(1:length(mydata),round(0.75*length(mydata),digits=0))
wmydata<-mydata[llwork, ]
tmydata<-mydata[-llwork, ]
回答19:
There is a very simple way to select a number of rows using the R index for rows and columns. This lets you CLEANLY split the data set given a number of rows - say the 1st 80% of your data.
In R all rows and columns are indexed so DataSetName[1,1] is the value assigned to the first column and first row of \"DataSetName\". I can select rows using [x,] and columns using [,x]
For example: If I have a data set conveniently named \"data\" with 100 rows I can view the first 80 rows using
View(data[1:80,])
In the same way I can select these rows and subset them using:
train = data[1:80,]
test = data[81:100,]
Now I have my data split into two parts without the possibility of resampling. Quick and easy.