duplicated() vs. unique()

• first we create a vector we can work with:
x <- sample(LETTERS[1:10], 20, replace=T)
x

 [1] "J" "C" "J" "C" "F" "J" "E" "J" "H" "A" "C" "G" "I" "A" "F" "H" "J" "C" "C"
[20] "D"

• unique() gives us a vector containing every new element of x but ignores repeated elements
unique(x)

[1] "J" "C" "F" "E" "H" "A" "G" "I" "D"

• duplicated() gives a logical vector
duplicated(x)

 [1] FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE
[13] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE

• so if we want to get the same result like the one from unique we have to index x in the following way:
x[duplicated(x)==F]

[1] "J" "C" "F" "E" "H" "A" "G" "I" "D"

• if we want to get just repeated occurences (i.e. a vector without the first occurence) we use the following line
x[duplicated(x)==T]

 [1] "J" "C" "J" "J" "C" "A" "F" "H" "J" "C" "C"

• we can use these commands the same way on dataframes, so let our x code persons, and we add a numeric value which could be a measurement and the order of the vector describes the order in which the measurements are taken
y  <- rnorm(20, mean=10)
df <- data.frame(person=x, meas=y)
df

   person      meas
1       J 11.180452
2       C 10.235697
3       J 10.908622
4       C 10.677399
5       F  8.564007
6       J 10.070557
7       E 10.144191
8       J 10.872314
9       H 11.635032
10      A 10.448090
11      C 10.642052
12      G  8.689660
13      I 10.007930
14      A  8.321125
15      F 10.610739
16      H  9.060412
17      J 10.678726
18      C  8.513766
19      C  8.851564
20      D 12.793154

• we extract the first measurement of each person with (the comma behind the F is important - it says we want the whole line)
df[duplicated(df$person)==F,]   person meas 1 J 11.180452 2 C 10.235697 5 F 8.564007 7 E 10.144191 9 H 11.635032 10 A 10.448090 12 G 8.689660 13 I 10.007930 20 D 12.793154  • and with the following command we can extract the follow up measurements df[duplicated(df$person)==T,]

   person      meas
3       J 10.908622
4       C 10.677399
6       J 10.070557
8       J 10.872314
11      C 10.642052
14      A  8.321125
15      F 10.610739
16      H  9.060412
17      J 10.678726
18      C  8.513766
19      C  8.851564

• we also can use duplicate in a recursive way; the result of the following function is a list containing vectors whereupon the first contains the first occurence, the second the second, etc.; you can change it easily: so it can give back logical vectors which can use to index a dataframe, or for working on a dataframe itself (which both would be more useful)
sep.meas <- function(dupl){
res <- list()
while(length(dupl)>0){
res[[length(res)+1] ] <- dupl[duplicated(dupl)==F]
dupl <- dupl[duplicated(dupl)==T]
}
res
}

• if we use it on x we get the following result
sep.meas(x)

[[1]]
[1] "J" "C" "F" "E" "H" "A" "G" "I" "D"

[[2]]
[1] "J" "C" "A" "F" "H"

[[3]]
[1] "J" "C"

[[4]]
[1] "J" "C"

[[5]]
[1] "J" "C"