I have a data frame 'math.numeric' with 32 variables. Each row represents a student and each variable is an attribute. The students have been put into 5 groups based on their final grade.
The data looks as follows:
head(math.numeric)
school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason ... group
1 1 18 2 1 1 4 4 1 5 1 2
1 1 17 2 1 2 1 1 1 3 1 2
1 1 15 2 2 2 1 1 1 3 3 3
1 1 15 2 1 2 4 2 2 4 2 4
1 1 16 2 1 2 3 3 3 3 2 3
1 2 16 2 2 2 4 3 4 3 4 4
I am performing t-tests on each variable for group 1 vs. all the other groups to identify significantly different attributes with this group. I am looking to pull out the p-values for each test such as:
t.test(subset(math.numeric$school, math.numeric$group == 1),
subset(math.numeric$school, math.numeric$group != 1))$p.value
t.test(subset(math.numeric$sex, math.numeric$group == 1),
subset(math.numeric$sex, math.numeric$group != 1))$p.value
t.test(subset(math.numeric$age, math.numeric$group == 1),
subset(math.numeric$age, math.numeric$group != 1))$p.value
I have been trying to figure out how I can create a loop to do this instead of writing out each test one at a time. I have tried a for loop, and lapply, but so far I have not had any luck.
I am fairly new to this, so any help would be appreciated.
Courtney
Your example data is not sufficient to actually carry out t-tests on all subgroups. For that reason, I take the iris
dataset, which contains 3 species of plants: Setosa, Versicolor, and Virginica. These are my groups. You will have to adjust your code accordingly. Below I show how to test one groups versus all other groups, one group versus each other group, and all combinations of individual groups.
One group versus all other groups combined:
First, let's say I want to compare Versicolor and Virginica to Setosa, i.e. Setosa is my group 1
to which all other groups should be compared. An easy way to achieve what you want is the following:
sapply(names(iris)[-ncol(iris)], function(x){
t.test(iris[iris$Species=="setosa", x],
iris[iris$Species!="setosa", x])$p.value
})
Sepal.Length Sepal.Width Petal.Length Petal.Width
7.709331e-32 1.035396e-13 1.746188e-69 1.347804e-60
Here, I have supplied the names of the different variables in the dataset names(iris)
- exlcuding the column indicating the grouping variable [-ncol(iris)]
(since it is the last column) - as vector to sapply
, which passes the corresponding names as arguments to the function that I have defined.
One group versus each of the other groups:
In case you want to make groupwise comparisons for all groups, the following may be helpful: First, create a dataframe of all group x variable combinations that you are going to do, excluding the grouping variable itself and the reference group, of course. This can be achieved by:
comps <- expand.grid(unique(iris$Species)[-1], # excluding Setosa as reference group
names(iris)[-ncol(iris)] # excluding group column
)
head(comps)
Var1 Var2
1 versicolor Sepal.Length
2 virginica Sepal.Length
3 versicolor Sepal.Width
4 virginica Sepal.Width
5 versicolor Petal.Length
6 virginica Petal.Length
Here, Var1
are the different species, and Var2
the different variables for which comparisons are to be done. The reference group 1
or Setosa is implicit in this case. Now, I can use apply
to create the tests. I do this by using each row of comps
as argument with two elements, the first of which indicates which group's turn it is, and the second argument indicates which variable should be compared. These will be used to subset the original dataframe.
comps$pval <- apply(comps, 1, function(x) {
t.test(iris[iris$Species=="setosa", x[2]], iris[iris$Species==x[1], x[2]])$p.value
} )
where group 1 aka Setosa is hard-coded in the function. This gives me a dataframe with p-values for all combinations (with Setosa as reference group) so that they are easy to look up:
head(comps)
Var1 Var2 pval
1 versicolor Sepal.Length 3.746743e-17
2 virginica Sepal.Length 3.966867e-25
3 versicolor Sepal.Width 2.484228e-15
4 virginica Sepal.Width 4.570771e-09
5 versicolor Petal.Length 9.934433e-46
6 virginica Petal.Length 9.269628e-50
All combinations of groups:
You can expand the above easily to produce a dataframe that contains p-values of t-tests for each combination of groups. One approach would be:
comps <- expand.grid(unique(iris$Species), unique(iris$Species), names(iris)[-ncol(iris)])
This now has three columns. The first two are the groups, and the third the variable:
head(comps)
Var1 Var2 Var3
1 setosa setosa Sepal.Length
2 versicolor setosa Sepal.Length
3 virginica setosa Sepal.Length
4 setosa versicolor Sepal.Length
5 versicolor versicolor Sepal.Length
6 virginica versicolor Sepal.Length
You can use this to carry out the tests:
comps$pval <- apply(comps, 1, function(x) {
t.test(iris[iris$Species==x[1], x[3]], iris[iris$Species==x[2], x[3]])$p.value
} )
I get an error message: what should I do?
t.test
may throw out an error message if the sample size is too small or if the values are constant for one group. This is problematic since it might only occur for specific groups, and you may not know in advance which one it is. Yet the error will disrupt the entire function call to apply
, and you will not be able to see any results.
A way to circumvent this and to identify the problematic groups is to wrap the function t.test
around dplyr::failwith
(see also ?tryCatch
). To show how this works, consider the following:
smalln <- data.frame(a=1, b=2)
t.test(smalln$a, smalln$b)
> Error in t.test.default(smalln$a, smalln$b) : not enough 'x' observations
failproof.t <- failwith(default="Some default of your liking", t.test, quiet = T)
failproof.t(smalln$a, smalln$b)
[1] "Some default of your liking"
That way, whenever t.test
would throw out an error, you get a character as a result instead and the computation continues with other groups. Needless to say, you could also set default
to a number, or anything else. It does not have to be a character.
Statistical disclaimer:
Having said all of this, note that conducting a several t-tests is not necessarily good statistical practice. You may want to adjust your p-values to account for multiple testing, or you may want to use alternative test procedures that conducts joint tests.
Hows this?
pvals <- numeric() #the vector of p values
k <- 1 #in case you choose to use a subset not continuing from 1
# "for(i in seq(1,5))" is just doing the pvalues for the first 5 columns. You could do a
# list, like "c(1,2,4)" (in place of "seq(1,5)"), to do tests for columns 1, 2, and 4.
# To do all of the columns, try "for(i in seq(1,(ncol(math.numeric)-1)))".
for(i in seq(1,5)){
# using your code to grab the p-values and store them in the kth element of "pvals"
pvals[k] <- t.test(subset(math.numeric[,i], math.numeric$group == 1),
subset(math.numeric[,i], math.numeric$group != 1))$p.value
#iterating the "pvals" vector entry counter
k=k+1
}
pvals #printing the p values for each test
Consider splitting the data frame by group and using mapply()
across the columns. Output becomes a compiled list of tests' statistics: statistic, parameter, p-value, confid. interval, etc.
# FILTER ROWS AND SUBSET NUMERIC COLS
group1df <- df[df$group==1, 1:ncol(df)-1]
othgroupdf <- df[df$group!=1, 1:ncol(df)-1]
# T-TEST FCT
tfct <- function(v1, v2){
t.test(v1, v2)
}
# RUN T-TESTS BY COL, SAVE RESULTS TO LIST
ttests <- mapply(tfct, group1df, othgroupdf)