R, Ruby, Perl und ich: R

ddply

dataframe in dataframe out

example

first we create a data frame with the following columns: id, date, value
to create the id column we use rep(), where the first argument 1:5 (the vector (1,2,3,4,5)) is repeated five times, so we have a vector with length 25
the vector dates is generated in the following way:
- generate a vector of random numbers between 1 and 1500 and length 25 (sample.int())
- use as.Date with the origin argument, this converts the integers into Date, where the the origin is day 0 (in the case 0 is convert to the origin i.e. 2000-01-01, 1 is mapped to 2000-01-02 etc pp, so we get a vector of dates between 2000 and beginning 2005
vals are generated by the random numbers function rnorm() (normal distributed with mean 5 and standard deviation 1
last but not least put the three together in one data frame (and let us show the first lines)

id <- rep(1:5,5)
dates <- as.Date(sample.int(1500, 25, replace=T), origin="2000-01-01") 
vals <- rnorm(25, mean=5, sd=1)
df <- data.frame(id=id, date=dates, val=vals)
head(df)

  id       date      val
1  1 2001-12-31 5.778680
2  2 2002-08-02 6.982799
3  3 2002-04-23 5.925903
4  4 2000-08-03 3.527375
5  5 2002-08-29 5.239211
6  1 2003-01-28 5.118337

id encodes a person, date contains the day of the measurement and val the values
now we want to add a column which contains the the first measurement in time
therefore we had to load plyr, then we use the function ddply() (the meaning of the first two letters is data frame in data frame out)
the first argument of the function is the data frame we pass to ddply
the second argument defines the groups (we want the min of each person so our grouping variable is id)
transform says we want to change existing data frame - like recode a variable or add a new (another choice would be summarise - if we want to aggregate the data) - so we add a variable named start and it should be the min() of our date per person

res <- ddply(df, .(id), transform, start=min(date))
res


id	date	val	start
1	2001-12-31	5.77867959202519	2001-03-07
1	2003-01-28	5.11833655328897	2001-03-07
1	2003-11-14	3.75075463437505	2001-03-07
1	2001-03-07	3.36305093773468	2001-03-07
1	2001-03-07	6.6141583789233	2001-03-07
2	2002-08-02	6.98279884579167	2000-05-29
2	2003-12-30	5.4450019138646	2000-05-29
2	2000-05-29	5.92567982716667	2000-05-29
2	2000-11-26	4.60169956597544	2000-05-29
2	2002-08-04	5.02395812458	2000-05-29
3	2002-04-23	5.92590324593659	2000-02-02
3	2003-04-28	6.59246056959167	2000-02-02
3	2002-10-20	5.89780464300662	2000-02-02
3	2000-02-02	4.0548618816269	2000-02-02
3	2000-05-30	5.4722633402378	2000-02-02
4	2000-08-03	3.52737458048037	2000-08-03
4	2002-11-02	3.98853894162285	2000-08-03
4	2002-11-13	4.54373088519655	2000-08-03
4	2003-08-02	4.2144245184346	2000-08-03
4	2003-12-19	4.98767478298279	2000-08-03
5	2002-08-29	5.23921121209821	2000-04-27
5	2000-04-27	4.20400241411961	2000-04-27
5	2003-07-26	6.1821957272646	2000-04-27
5	2003-05-18	4.53515404638362	2000-04-27
5	2001-07-23	3.06695274940501	2000-04-27

if we change the function just a little we get the days elapsed from the first measurment:

res <- ddply(df, .(id), transform, dayselapsed=date-min(date))
res


id	date	val	dayselapsed
1	2001-12-31	5.77867959202519	299
1	2003-01-28	5.11833655328897	692
1	2003-11-14	3.75075463437505	982
1	2001-03-07	3.36305093773468	0
1	2001-03-07	6.6141583789233	0
2	2002-08-02	6.98279884579167	795
2	2003-12-30	5.4450019138646	1310
2	2000-05-29	5.92567982716667	0
2	2000-11-26	4.60169956597544	181
2	2002-08-04	5.02395812458	797
3	2002-04-23	5.92590324593659	811
3	2003-04-28	6.59246056959167	1181
3	2002-10-20	5.89780464300662	991
3	2000-02-02	4.0548618816269	0
3	2000-05-30	5.4722633402378	118
4	2000-08-03	3.52737458048037	0
4	2002-11-02	3.98853894162285	821
4	2002-11-13	4.54373088519655	832
4	2003-08-02	4.2144245184346	1094
4	2003-12-19	4.98767478298279	1233
5	2002-08-29	5.23921121209821	854
5	2000-04-27	4.20400241411961	0
5	2003-07-26	6.1821957272646	1185
5	2003-05-18	4.53515404638362	1116
5	2001-07-23	3.06695274940501	452

R, Ruby, Perl und ich

Wednesday, July 6, 2011

R - ddply

ddply

dataframe in dataframe out

example

No comments :

Post a Comment