Wednesday, July 20, 2011

R - t.test for subsets of a data frame (ddply or tapply)

(two sample t-test for groups in data frame here)
First we create a data frame, on which we can work
x <- rnorm(100)                 
y <- sample(1:3, 100, replace=T)
z <- sample(1:2, 100, replace=T)

df <- data.frame(group=y,sex=z,sds=x)
head(df)
group sex       sds
1     1   2 1.3412663
2     1   1 0.8000326
3     2   2 0.0371029
4     1   2 0.2064684
5     1   2 1.6429816
6     3   1 1.3271138
In the following I use ddply() from the plyr-package, so if you did not install and/or load it you have to do it now - the command library() without any argument gives you a list of the installed packages:
install.packages("plyr") # install the package
library(plyr) # load it
Here is the function for doing the t.tests:
t.test.plyr <- function(x, var, mean=0 ){
  y <- rep(NA,10)
  y[6] <- nrow(x)[1]              # count observations
  if(nrow(x) < 2) return(y)       # exits if too less observations
  res <- t.test(x[var], mu=mean)  # doing the test

  y[1] <- res$statistic           # extract values of interest
  y[2] <- res$p.value      
  y[3] <- res$estimate     
  y[4] <- res$conf.int[1]  
  y[5] <- res$conf.int[2]  
  y[7] <- res$parameter    
  y[8] <- res$method       
  y[9] <- res$alternative  
  y[10] <- res$null.value   

  names(y) <- c("statistic","p.value","estimate","conf.int1", "conf.int2", "nobs","dof","method","alternative","null.value")
  y 
}

where t.test.plyr() is a function of:
- x - the data frame
- var - the variable which should be testet
- mean - the mean of the t-test

now we can use ddply in the following way:
result <- ddply(df, .(group, sex), t.test.plyr, "sds")
result
group sex           statistic           p.value            estimate
1     1   1   0.774917157596946 0.445379429071851   0.132214420988956
2     1   2  -0.415594591752269  0.68359099812399 -0.0987341821354776
3     2   1   0.130257003222609 0.899578855830346  0.0602600950757679
4     2   2 -0.0579045286518569 0.954965406753824 -0.0150105343694587
5     3   1    1.18044920537142 0.256202807853187   0.406567809727671
6     3   2    1.49215606704641  0.15126489039304   0.294865619363907
           conf.int1         conf.int2 nobs dof            method alternative
1 -0.218494854049804 0.482923696027717   27  26 One Sample t-test   two.sided
2 -0.605109702463049 0.407641338192093   16  15 One Sample t-test   two.sided
3  -1.00655416438508  1.12707435453662    9   8 One Sample t-test   two.sided
4 -0.592608791299465 0.562587722560548   11  10 One Sample t-test   two.sided
5 -0.327541518602652  1.14067713805799   16  15 One Sample t-test   two.sided
6 -0.117342538637965  0.70707377736578   21  20 One Sample t-test   two.sided
  null.value
1          0
2          0
3          0
4          0
5          0
6          0
If you want to change the mu in you t.test, add the argument mean:
result <- ddply(df, .(group, sex), t.test.plyr, "sds", mean=1)
result
If there is no need or you do not want a data frame as result you can just use tapply(). tapply results in a list with as many elements as subsets, each element contains the results of each t.test (as list).
res <- tapply(df$sds, list(df$sex,df$group), t.test)
res
1      2      3     
1 List,9 List,9 List,9
2 List,9 List,9 List,9
So the first element of res contains the t.test results for sex=1 and group=1 etc pp
res[[1]]
One Sample t-test

data:  X[[1L]] 
t = 0.7749, df = 26, p-value = 0.4454
alternative hypothesis: true mean is not equal to 0 
95 percent confidence interval:
 -0.2184949  0.4829237 
sample estimates:
mean of x 
0.1322144

3 comments :

  1. What you are doing here is basically testing if any value has a mean significantly different from zero, right? Would it not be more interesting to test the groups against each other?

    Actually, that's what I wanted to code when your blog showed up as a search result.

    ReplyDelete
    Replies
    1. do you mean something like this?

      > xx <- data.frame(vals=rnorm(100),group=sample(letters[1:2],100,replace = T))

      > t.test(xx$vals~xx$group)

      Welch Two Sample t-test

      data: xx$vals by xx$group
      t = 0,6923, df = 84,572, p-value = 0,4906
      alternative hypothesis: true difference in means is not equal to 0
      95 percent confidence interval:
      -0,2801770 0,5794806
      sample estimates:
      mean in group a mean in group b
      0,02413266 -0,12551915

      Delete
  2. yes, i was looking for the same thing. do you have an idea how to implement this? i have been trying to add an addition variable like this
    t.test.plyr <- function(x, var,group){
    y <- rep(NA,10)
    y[6] <- nrow(x)[1] # count observations
    if(nrow(x) < 2) return(y) # exits if too less observations
    res <- t.test(x[var]~x[group]) # doing the test

    y[1] <- res$statistic # extract values of interest
    y[2] <- res$p.value
    y[3] <- res$estimate
    y[4] <- res$conf.int[1]
    y[5] <- res$conf.int[2]
    y[7] <- res$parameter
    y[8] <- res$method
    y[9] <- res$alternative
    y[10] <- res$null.value

    names(y) <- c("statistic","p.value","estimate","conf.int1", "conf.int2", "nobs","dof","method","alternative","null.value")
    y
    }

    result <- ddply(df, .(group), t.test.plyr, "sds", "sex")
    but its giving me
    Error in if (stderr < 10 * .Machine$double.eps * max(abs(mx), abs(my))) stop("data are essentially constant") :
    missing value where TRUE/FALSE needed
    In addition: Warning messages:
    1: In mean.default(x) : argument is not numeric or logical: returning NA
    2: In var(x) :
    Calling var(x) on a factor x is deprecated and will become an error.
    Use something like 'all(duplicated(x)[-1L])' to test for a constant vector.
    3: In mean.default(y) : argument is not numeric or logical: returning NA
    4: In var(y) :
    Calling var(x) on a factor x is deprecated and will become an error.
    Use something like 'all(duplicated(x)[-1L])' to test for a constant vector.
    Called from: t.test.default(x = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
    1L, 1L, 1L), y = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L))

    Any Ideas?

    ReplyDelete