Wednesday, July 6, 2011

R - ddply

ddply

dataframe in dataframe out

example

  • first we create a data frame with the following columns: id, date, value
  • to create the id column we use rep(), where the first argument 1:5 (the vector (1,2,3,4,5)) is repeated five times, so we have a vector with length 25
  • the vector dates is generated in the following way:
    • generate a vector of random numbers between 1 and 1500 and length 25 (sample.int())
    • use as.Date with the origin argument, this converts the integers into Date, where the the origin is day 0 (in the case 0 is convert to the origin i.e. 2000-01-01, 1 is mapped to 2000-01-02 etc pp, so we get a vector of dates between 2000 and beginning 2005
  • vals are generated by the random numbers function rnorm() (normal distributed with mean 5 and standard deviation 1
  • last but not least put the three together in one data frame (and let us show the first lines)
id <- rep(1:5,5)
dates <- as.Date(sample.int(1500, 25, replace=T), origin="2000-01-01") 
vals <- rnorm(25, mean=5, sd=1)
df <- data.frame(id=id, date=dates, val=vals)
head(df)
  id       date      val
1  1 2001-12-31 5.778680
2  2 2002-08-02 6.982799
3  3 2002-04-23 5.925903
4  4 2000-08-03 3.527375
5  5 2002-08-29 5.239211
6  1 2003-01-28 5.118337
  • id encodes a person, date contains the day of the measurement and val the values
  • now we want to add a column which contains the the first measurement in time
  • therefore we had to load plyr, then we use the function ddply() (the meaning of the first two letters is data frame in data frame out)
  • the first argument of the function is the data frame we pass to ddply
  • the second argument defines the groups (we want the min of each person so our grouping variable is id)
  • transform says we want to change existing data frame - like recode a variable or add a new (another choice would be summarise - if we want to aggregate the data) - so we add a variable named start and it should be the min() of our date per person
res <- ddply(df, .(id), transform, start=min(date))
res
iddatevalstart
12001-12-315.778679592025192001-03-07
12003-01-285.118336553288972001-03-07
12003-11-143.750754634375052001-03-07
12001-03-073.363050937734682001-03-07
12001-03-076.61415837892332001-03-07
22002-08-026.982798845791672000-05-29
22003-12-305.44500191386462000-05-29
22000-05-295.925679827166672000-05-29
22000-11-264.601699565975442000-05-29
22002-08-045.023958124582000-05-29
32002-04-235.925903245936592000-02-02
32003-04-286.592460569591672000-02-02
32002-10-205.897804643006622000-02-02
32000-02-024.05486188162692000-02-02
32000-05-305.47226334023782000-02-02
42000-08-033.527374580480372000-08-03
42002-11-023.988538941622852000-08-03
42002-11-134.543730885196552000-08-03
42003-08-024.21442451843462000-08-03
42003-12-194.987674782982792000-08-03
52002-08-295.239211212098212000-04-27
52000-04-274.204002414119612000-04-27
52003-07-266.18219572726462000-04-27
52003-05-184.535154046383622000-04-27
52001-07-233.066952749405012000-04-27
  • if we change the function just a little we get the days elapsed from the first measurment:
res <- ddply(df, .(id), transform, dayselapsed=date-min(date))
res
iddatevaldayselapsed
12001-12-315.77867959202519299
12003-01-285.11833655328897692
12003-11-143.75075463437505982
12001-03-073.363050937734680
12001-03-076.61415837892330
22002-08-026.98279884579167795
22003-12-305.44500191386461310
22000-05-295.925679827166670
22000-11-264.60169956597544181
22002-08-045.02395812458797
32002-04-235.92590324593659811
32003-04-286.592460569591671181
32002-10-205.89780464300662991
32000-02-024.05486188162690
32000-05-305.4722633402378118
42000-08-033.527374580480370
42002-11-023.98853894162285821
42002-11-134.54373088519655832
42003-08-024.21442451843461094
42003-12-194.987674782982791233
52002-08-295.23921121209821854
52000-04-274.204002414119610
52003-07-266.18219572726461185
52003-05-184.535154046383621116
52001-07-233.06695274940501452

No comments :

Post a Comment