Saturday, June 25, 2011

R - working with unique() and duplicated()

duplicated() vs. unique()

  • first we create a vector we can work with:
x <- sample(LETTERS[1:10], 20, replace=T)
x
 [1] "J" "C" "J" "C" "F" "J" "E" "J" "H" "A" "C" "G" "I" "A" "F" "H" "J" "C" "C"
[20] "D"
  • unique() gives us a vector containing every new element of x but ignores repeated elements
unique(x)
[1] "J" "C" "F" "E" "H" "A" "G" "I" "D"
  • duplicated() gives a logical vector
duplicated(x)
 [1] FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE
[13] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
  • so if we want to get the same result like the one from unique we have to index x in the following way:
x[duplicated(x)==F]
[1] "J" "C" "F" "E" "H" "A" "G" "I" "D"
  • if we want to get just repeated occurences (i.e. a vector without the first occurence) we use the following line
x[duplicated(x)==T]
 [1] "J" "C" "J" "J" "C" "A" "F" "H" "J" "C" "C"
  • we can use these commands the same way on dataframes, so let our x code persons, and we add a numeric value which could be a measurement and the order of the vector describes the order in which the measurements are taken
y  <- rnorm(20, mean=10)
df <- data.frame(person=x, meas=y)
df
   person      meas
1       J 11.180452
2       C 10.235697
3       J 10.908622
4       C 10.677399
5       F  8.564007
6       J 10.070557
7       E 10.144191
8       J 10.872314
9       H 11.635032
10      A 10.448090
11      C 10.642052
12      G  8.689660
13      I 10.007930
14      A  8.321125
15      F 10.610739
16      H  9.060412
17      J 10.678726
18      C  8.513766
19      C  8.851564
20      D 12.793154
  • we extract the first measurement of each person with (the comma behind the F is important - it says we want the whole line)
df[duplicated(df$person)==F,]
   person      meas
1       J 11.180452
2       C 10.235697
5       F  8.564007
7       E 10.144191
9       H 11.635032
10      A 10.448090
12      G  8.689660
13      I 10.007930
20      D 12.793154
  • and with the following command we can extract the follow up measurements
df[duplicated(df$person)==T,]
   person      meas
3       J 10.908622
4       C 10.677399
6       J 10.070557
8       J 10.872314
11      C 10.642052
14      A  8.321125
15      F 10.610739
16      H  9.060412
17      J 10.678726
18      C  8.513766
19      C  8.851564
  • we also can use duplicate in a recursive way; the result of the following function is a list containing vectors whereupon the first contains the first occurence, the second the second, etc.; you can change it easily: so it can give back logical vectors which can use to index a dataframe, or for working on a dataframe itself (which both would be more useful)
sep.meas <- function(dupl){
  res <- list()
  while(length(dupl)>0){
    res[[length(res)+1] ] <- dupl[duplicated(dupl)==F]
    dupl <- dupl[duplicated(dupl)==T]
  }
 res
}
  • if we use it on x we get the following result
sep.meas(x)
[[1]]
[1] "J" "C" "F" "E" "H" "A" "G" "I" "D"

[[2]]
[1] "J" "C" "A" "F" "H"

[[3]]
[1] "J" "C"

[[4]]
[1] "J" "C"

[[5]]
[1] "J" "C"

No comments :

Post a Comment