R, Ruby, Perl und ich: R - working with unique() and duplicated()

Saturday, June 25, 2011

R - working with unique() and duplicated()

duplicated() vs. unique()

first we create a vector we can work with:

x <- sample(LETTERS[1:10], 20, replace=T)
x

 [1] "J" "C" "J" "C" "F" "J" "E" "J" "H" "A" "C" "G" "I" "A" "F" "H" "J" "C" "C"
[20] "D"

unique() gives us a vector containing every new element of x but ignores repeated elements

unique(x)

[1] "J" "C" "F" "E" "H" "A" "G" "I" "D"

duplicated() gives a logical vector

duplicated(x)

 [1] FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE
[13] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE

so if we want to get the same result like the one from unique we have to index x in the following way:

x[duplicated(x)==F]

[1] "J" "C" "F" "E" "H" "A" "G" "I" "D"

if we want to get just repeated occurences (i.e. a vector without the first occurence) we use the following line

x[duplicated(x)==T]

 [1] "J" "C" "J" "J" "C" "A" "F" "H" "J" "C" "C"

we can use these commands the same way on dataframes, so let our x code persons, and we add a numeric value which could be a measurement and the order of the vector describes the order in which the measurements are taken

y  <- rnorm(20, mean=10)
df <- data.frame(person=x, meas=y)
df

   person      meas
1       J 11.180452
2       C 10.235697
3       J 10.908622
4       C 10.677399
5       F  8.564007
6       J 10.070557
7       E 10.144191
8       J 10.872314
9       H 11.635032
10      A 10.448090
11      C 10.642052
12      G  8.689660
13      I 10.007930
14      A  8.321125
15      F 10.610739
16      H  9.060412
17      J 10.678726
18      C  8.513766
19      C  8.851564
20      D 12.793154

we extract the first measurement of each person with (the comma behind the F is important - it says we want the whole line)

df[duplicated(df$person)==F,]

   person      meas
1       J 11.180452
2       C 10.235697
5       F  8.564007
7       E 10.144191
9       H 11.635032
10      A 10.448090
12      G  8.689660
13      I 10.007930
20      D 12.793154

and with the following command we can extract the follow up measurements

df[duplicated(df$person)==T,]

   person      meas
3       J 10.908622
4       C 10.677399
6       J 10.070557
8       J 10.872314
11      C 10.642052
14      A  8.321125
15      F 10.610739
16      H  9.060412
17      J 10.678726
18      C  8.513766
19      C  8.851564

we also can use duplicate in a recursive way; the result of the following function is a list containing vectors whereupon the first contains the first occurence, the second the second, etc.; you can change it easily: so it can give back logical vectors which can use to index a dataframe, or for working on a dataframe itself (which both would be more useful)

sep.meas <- function(dupl){
  res <- list()
  while(length(dupl)>0){
    res[[length(res)+1] ] <- dupl[duplicated(dupl)==F]
    dupl <- dupl[duplicated(dupl)==T]
  }
 res
}

if we use it on x we get the following result

sep.meas(x)

[[1]]
[1] "J" "C" "F" "E" "H" "A" "G" "I" "D"

[[2]]
[1] "J" "C" "A" "F" "H"

[[3]]
[1] "J" "C"

[[4]]
[1] "J" "C"

[[5]]
[1] "J" "C"

R, Ruby, Perl und ich

Saturday, June 25, 2011

R - working with unique() and duplicated()

duplicated() vs. unique()

No comments :

Post a Comment