R, Ruby, Perl und ich: July 2011

Tuesday, July 26, 2011

R - recode() data

recode() - from package lordif

library(lordif)  # maybe you have first to install the package (install.packages("lordif"))

creating example data

z <- sample(0:1, 25, replace=T)
original <- c(0,1)
new <- c("boy", "girl")
z
original
new

z:  1 1 1 0 0 0 1 1 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 1 0
original: 0 1
new:  "boy"  "girl"

so z is a vector consisting of 1 and 0
original contains the distinct values of z
and new contains the values which should replace the original values in the following way: the first item of original is replaced by the first value of new, the second by the second etc pp

z <- recode(z, original, new)
z

 [1] "girl" "girl" "girl" "boy"  "boy"  "boy"  "girl" "girl" "girl" "girl"
[11] "boy"  "boy"  "girl" "girl" "girl" "girl" "boy"  "girl" "girl" "girl"
[21] "girl" "girl" "girl" "girl" "boy"

Wednesday, July 20, 2011

R - t.test for subsets of a data frame (ddply or tapply)

(two sample t-test for groups in data frame here)
First we create a data frame, on which we can work

x <- rnorm(100)                 
y <- sample(1:3, 100, replace=T)
z <- sample(1:2, 100, replace=T)

df <- data.frame(group=y,sex=z,sds=x)
head(df)

group sex       sds
1     1   2 1.3412663
2     1   1 0.8000326
3     2   2 0.0371029
4     1   2 0.2064684
5     1   2 1.6429816
6     3   1 1.3271138

In the following I use ddply() from the plyr-package, so if you did not install and/or load it you have to do it now - the command library() without any argument gives you a list of the installed packages:

install.packages("plyr") # install the package
library(plyr) # load it

Here is the function for doing the t.tests:

t.test.plyr <- function(x, var, mean=0 ){
  y <- rep(NA,10)
  y[6] <- nrow(x)[1]              # count observations
  if(nrow(x) < 2) return(y)       # exits if too less observations
  res <- t.test(x[var], mu=mean)  # doing the test

  y[1] <- res$statistic           # extract values of interest
  y[2] <- res$p.value      
  y[3] <- res$estimate     
  y[4] <- res$conf.int[1]  
  y[5] <- res$conf.int[2]  
  y[7] <- res$parameter    
  y[8] <- res$method       
  y[9] <- res$alternative  
  y[10] <- res$null.value   

  names(y) <- c("statistic","p.value","estimate","conf.int1", "conf.int2", "nobs","dof","method","alternative","null.value")
  y 
}

where t.test.plyr() is a function of:
- x - the data frame
- var - the variable which should be testet
- mean - the mean of the t-test

now we can use ddply in the following way:

result <- ddply(df, .(group, sex), t.test.plyr, "sds")
result

group sex           statistic           p.value            estimate
1     1   1   0.774917157596946 0.445379429071851   0.132214420988956
2     1   2  -0.415594591752269  0.68359099812399 -0.0987341821354776
3     2   1   0.130257003222609 0.899578855830346  0.0602600950757679
4     2   2 -0.0579045286518569 0.954965406753824 -0.0150105343694587
5     3   1    1.18044920537142 0.256202807853187   0.406567809727671
6     3   2    1.49215606704641  0.15126489039304   0.294865619363907
           conf.int1         conf.int2 nobs dof            method alternative
1 -0.218494854049804 0.482923696027717   27  26 One Sample t-test   two.sided
2 -0.605109702463049 0.407641338192093   16  15 One Sample t-test   two.sided
3  -1.00655416438508  1.12707435453662    9   8 One Sample t-test   two.sided
4 -0.592608791299465 0.562587722560548   11  10 One Sample t-test   two.sided
5 -0.327541518602652  1.14067713805799   16  15 One Sample t-test   two.sided
6 -0.117342538637965  0.70707377736578   21  20 One Sample t-test   two.sided
  null.value
1          0
2          0
3          0
4          0
5          0
6          0

If you want to change the mu in you t.test, add the argument mean:

result <- ddply(df, .(group, sex), t.test.plyr, "sds", mean=1)
result

If there is no need or you do not want a data frame as result you can just use tapply(). tapply results in a list with as many elements as subsets, each element contains the results of each t.test (as list).

res <- tapply(df$sds, list(df$sex,df$group), t.test)
res

1      2      3     
1 List,9 List,9 List,9
2 List,9 List,9 List,9

So the first element of res contains the t.test results for sex=1 and group=1 etc pp

res[[1]]

One Sample t-test

data:  X[[1L]] 
t = 0.7749, df = 26, p-value = 0.4454
alternative hypothesis: true mean is not equal to 0 
95 percent confidence interval:
 -0.2184949  0.4829237 
sample estimates:
mean of x 
0.1322144

Friday, July 15, 2011

Ubuntu - convert pdf to images (resolution selectable)

nice command line tool:
pdftoppm

creates one image (png, jpeg pbm or pgm) per page.

e.g. if source.pdf has two pages

pdftoppm -png -r 600 source.pdf target

creates 2 images, solution: 600 DPI, file names: target-01.png, target-02.png

Thursday, July 14, 2011

Emacs - change style of word wrapping

use word wrap

M-x visual-line-mode

Emacs displays long lines by truncation

M-x toggle-truncate-lines

VBA excel - sheets exist ?

The function:

Public Function SheetExists(SheetName As String) As Boolean
    Dim ws As Worksheet
    SheetExists = False
    For Each ws In ThisWorkbook.Worksheets
        If ws.Name = SheetName Then SheetExists = True
    Next ws
End Function

create sheet if it does not exist:

For i = 1 to n
    If SheetExists("Sheet" & CStr(i)) Then
        MsgBox ("Sheet" & CStr(i) & " exists")
    Else
        Worksheets.Add(After:=Worksheets(Worksheets.Count)).Name = "Sheet" & CStr(i)
    End If
Next i

Saturday, July 9, 2011

R ggplot - barplot

barplot

load the package ggplot2 and the diamond data
first we create a object p of class ggplot, where we map clarity to x (horizontal position)

p <- ggplot(diamonds, aes(x=clarity))
summary(p)

data: carat, cut, color, clarity, depth, table, price, x, y, z
  [53940x10]
mapping:  x = clarity
faceting: facet_grid(. ~ ., FALSE)

p contains all information ggplot needs to build a barplot, all we have to do is adding a layer with geom="bar";

p + layer(geom="bar")

to this simple barplot we can add the following (optional) aesthetics:
- colour: color of the borders of the bins
- fill: color of the filling
- size:
- linetype: line type of the borders
- weight:
- alpha: transparency of fill [0,1]

p + layer(geom="bar", geom_params=list(fill="steelblue", colour="black", linetype=4, alpha=0.3))

but a barplot can visualize more information about the data - so we add a new mapping to p

p <- ggplot(diamonds, aes(x=clarity, fill=color)) # where color is the variable of the dataframe diamonds

now the default output looks like:

p + layer(geom="bar")

the plot can be customized by the same aesthetics, fill would override the aesthetic mapping of p
additional we can add a position adjustment

p + layer(geom="bar", position="dodge")

p + layer(geom="bar", position="fill")

if we flip (coord_flip()) the underlying coordinate system, we can rotate the plot

p + layer(geom="bar", position="stack") + coord_flip()

or we even can change the coordinate system to polar coordinates:

p + layer(geom="bar", position="stack") + coord_polar()

R - filled.contour plot

filled.contour

if you have a matrix with nxm dimensions and every entry represents a value of interest and the indices of the entry describe the position, then you can use filled.contour() to visualize the matrix
first we create a simple matrix to demonstrate how it works:

x <- y <- 1:10 # create two vectors with the integers from 1 to 10
z <- outer(x,y) # create a matrix as the outer product of the two vectors
z # show the matrix

      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]    1    2    3    4    5    6    7    8    9    10
 [2,]    2    4    6    8   10   12   14   16   18    20
 [3,]    3    6    9   12   15   18   21   24   27    30
 [4,]    4    8   12   16   20   24   28   32   36    40
 [5,]    5   10   15   20   25   30   35   40   45    50
 [6,]    6   12   18   24   30   36   42   48   54    60
 [7,]    7   14   21   28   35   42   49   56   63    70
 [8,]    8   16   24   32   40   48   56   64   72    80
 [9,]    9   18   27   36   45   54   63   72   81    90
[10,]   10   20   30   40   50   60   70   80   90   100

now let us plot the matrix:

filled.contour(z)

the matrix is mapped to a 1x1 square: the entry with the indices (1,1) and the value 1 to the point with the coordinates (0,0) and the entry with the indices (10,10) and the value (100) to the points with the coordinates (1,1); if you want to change the axes you can add a x and a y argument to the plot;so let's use the vectors x and y from above (with the entries 1 to 10) and also change to colors
filled.contour() uses the cm palette as default (cyan for low, magenta for heigh values);

filled.contour(x=x,y=y,z, col=heat.colors(20))

more palettes you find here

And a last example

my.seq <- seq(-pi, pi, length=50) # creating a vector as a sequence from 0 to 2*pi with 50 entries
my.seq2 <- seq(-0.5*pi, 1.5*pi, length=50) # creating a vector as a sequence from 0 to 2*pi with 50 entries
my.matrix <- outer(sin(my.seq),cos(my.seq2)) # creating the matrix using sin, cos and outer
filled.contour(x=my.seq, y=my.seq2,my.matrix, plot.title=title(main="Products Sin(x)*Cos(y) on [-Pi,Pi]x[-0.5Pi,1.5Pi]", xlab="x", ylab="y"), key.title=title(main="products"), col=terrain.colors(20))

Latex Beamer - package beamercolor

Manual.
You can download the .sty file: here BeamerColor.sty
create a folder in your latex file directory

/usr/share/texmf-texlive/tex/latex/

the path can differ from the path above (e.g. if you use TeTeX).
copy the example.sty file in this directory

cp /home/me/pathtopackage/example.sty /usr/share/texmf-texlive/tex/latex/

run

sudo mktexlsr

add the line \usepackage{example} to you document

Friday, July 8, 2011

RExcel - do not work after installing new R version

When a new version of R is installed, a few things have to be done.

rcom and rscprooxy have to be installed in the new version.
install.packages(c("rcom", "rscproxy"))

the line
library(rcom)
has to be added to the etc\RProfile.site (standard path Win: C:\Programme\R\R-Version\etc\)

comRegisterRegistry() has to be run (in the new version)

bin\RSetReg.exe has to be run (path Win: C:\Programme\R\R-2.12.1\bin\i386)

Thursday, July 7, 2011

R ggplot2 - simple heatplot

simple heatplot

load the library ggplot2, if you have not installed it yet - install it via install.packages("ggplot2")
create the example data frame
map it to the plot

x <- rep(1:20,20)
y <- rep(1:20, rep(20,20))
z <- rnorm(400, mean=0, sd=15)
df <- data.frame(x=x,y=y,z=z)
ggplot(df, aes(x=x,y=y, fill=z)) + geom_tile()

now we can change the colors (colorchart) by add a scale_fill_gradient(low="green", high="red")

ggplot(df, aes(x=x,y=y, fill=z)) + geom_tile()+scale_fill_gradient(low="green", high="red")

an a second one (with more soothing colors)

ggplot(df, aes(x=x,y=y, fill=z)) + geom_tile()+scale_fill_gradient(low="aliceblue", high="midnightblue")

Emacs - make buffer names unique - eindeutige Buffernamen

add the line

(require 'uniquify)

to your emacs configuration file ( .emacs in your home directory )

restart emacs
type: M-x customize-option
then choose the variable: uniquify-buffer-name-style
set it to one of these: forward or reverse or post-forward or post-forward-angle-brackets (just hit Enter)
set for the current session and save for future sessions then exit

R - add a table to a plot

addtable2plot()

suppose you have a vector with different dates of events and you want to investigates if there are month with more or less events, the first step maybe is to draw a barplot
so let us generate first a vector with the different dates:
and than we create a vector with the extracted month (as factor)

my.dat <- as.Date(sample.int(500, 1000, replace=T), origin="2000-01-01") 
# extract month, use the abbreviated form of months
my.months <- months(my.dat, abbreviate=T)
df <- data.frame(date=my.dat, month=my.months)
table(df$month)

Apr Aug Dez Feb Jan Jul Jun Mai Mär Nov Okt Sep 
122  72  66 103 121  61  51  85 129  59  64  67

as one can see the month are in alphabetical order so we have to correct this
therefore we use ISOdatetime and format for creating a vector of the 12 months in the correct order and use this vector for creating a labeled factor out of df$month

df$month <- factor(df$month, levels=format(ISOdatetime(2000,1:12,1,0,0,0),"%b"))
# show the first 50 elements of my.
table(df$month)

Jan Feb Mär Apr Mai Jun Jul Aug Sep Okt Nov Dez 
121 103 129 122  85  51  61  72  67  64  59  66

now we can create hour barplot (use the package plotrix)

# if it is not installed yet, install it via install.packages("plotrix")
# load the package
library(plotrix) 

# create the table for plotting (the same as above)
my.table <- with(df, table(month)) # equivalent to table(df$month)
# plot it! (choosing the color from here)

barp(my.table, col=colors()[c(462:471,473,477)], xlab="Month", ylab="count")

now we create the same plot but add the table

# create the table for the plot - i want the horizontal table so i have to transpose it
m2 <- t(as.matrix(my.table)) 
# barplot
barp(my.table, col=colors()[c(462:471,473,477)], xlab="month", ylab="count")
# add the table: the first two arguments are the position relative to the coord system of the plot
# the third argument is the table (a data frame or a matrix), the bty: "o" draw a box, "n" no box
# bg is the argument for the background 
addtable2plot(3,140,m2, bty="o", bg="aliceblue")

now add the table with ggplot2

require(gridExtra)
require(ggplot2)
# generate random dates
my.dat <- as.Date(sample.int(500, 1000, replace=T), origin="2000-01-01") 
# extract month
my.months <- factor(month(my.dat),levels=1:12,labels=month.abb)
df <- data.frame(date=my.dat, month=my.months)

# prepare table
mytable <- tableGrob(t(as.data.frame(table(df$month))),show.rownames = F,show.colnames = F)

ggplot(df,aes(x=month,fill=month)) +
    geom_bar() +
    ylim(0,150) +
    annotation_custom(mytable,xmin = 4.5,xmax=15,ymin=130,ymax = 150) +
    scale_fill_manual(values=colors()[c(462:471,473,477)],guide="none")
ggsave("tabletoplot.png")

R - generate a vector with names of months (like LETTERS)

> format(ISOdatetime(2000,1:12,1,0,0,0),"%b")

gives:

 [1] "Jan" "Feb" "Mrz" "Apr" "Mai" "Jun" "Jul" "Aug" "Sep" "Okt" "Nov" "Dez"

where the month names depend on local settings. If you use %B instead of %b you get the non abbreviated months. How ever, there is also an easier way if you want the months in English: the vector

> month.name

gives

 [1] "January"   "February"  "March"     "April"     "May"       "June"     
 [7] "July"      "August"    "September" "October"   "November"  "December"

> month.abb

gives

 [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"

Wednesday, July 6, 2011

R - ddply

ddply

dataframe in dataframe out

example

first we create a data frame with the following columns: id, date, value
to create the id column we use rep(), where the first argument 1:5 (the vector (1,2,3,4,5)) is repeated five times, so we have a vector with length 25
the vector dates is generated in the following way:
- generate a vector of random numbers between 1 and 1500 and length 25 (sample.int())
- use as.Date with the origin argument, this converts the integers into Date, where the the origin is day 0 (in the case 0 is convert to the origin i.e. 2000-01-01, 1 is mapped to 2000-01-02 etc pp, so we get a vector of dates between 2000 and beginning 2005
vals are generated by the random numbers function rnorm() (normal distributed with mean 5 and standard deviation 1
last but not least put the three together in one data frame (and let us show the first lines)

id <- rep(1:5,5)
dates <- as.Date(sample.int(1500, 25, replace=T), origin="2000-01-01") 
vals <- rnorm(25, mean=5, sd=1)
df <- data.frame(id=id, date=dates, val=vals)
head(df)

  id       date      val
1  1 2001-12-31 5.778680
2  2 2002-08-02 6.982799
3  3 2002-04-23 5.925903
4  4 2000-08-03 3.527375
5  5 2002-08-29 5.239211
6  1 2003-01-28 5.118337

id encodes a person, date contains the day of the measurement and val the values
now we want to add a column which contains the the first measurement in time
therefore we had to load plyr, then we use the function ddply() (the meaning of the first two letters is data frame in data frame out)
the first argument of the function is the data frame we pass to ddply
the second argument defines the groups (we want the min of each person so our grouping variable is id)
transform says we want to change existing data frame - like recode a variable or add a new (another choice would be summarise - if we want to aggregate the data) - so we add a variable named start and it should be the min() of our date per person

res <- ddply(df, .(id), transform, start=min(date))
res


id	date	val	start
1	2001-12-31	5.77867959202519	2001-03-07
1	2003-01-28	5.11833655328897	2001-03-07
1	2003-11-14	3.75075463437505	2001-03-07
1	2001-03-07	3.36305093773468	2001-03-07
1	2001-03-07	6.6141583789233	2001-03-07
2	2002-08-02	6.98279884579167	2000-05-29
2	2003-12-30	5.4450019138646	2000-05-29
2	2000-05-29	5.92567982716667	2000-05-29
2	2000-11-26	4.60169956597544	2000-05-29
2	2002-08-04	5.02395812458	2000-05-29
3	2002-04-23	5.92590324593659	2000-02-02
3	2003-04-28	6.59246056959167	2000-02-02
3	2002-10-20	5.89780464300662	2000-02-02
3	2000-02-02	4.0548618816269	2000-02-02
3	2000-05-30	5.4722633402378	2000-02-02
4	2000-08-03	3.52737458048037	2000-08-03
4	2002-11-02	3.98853894162285	2000-08-03
4	2002-11-13	4.54373088519655	2000-08-03
4	2003-08-02	4.2144245184346	2000-08-03
4	2003-12-19	4.98767478298279	2000-08-03
5	2002-08-29	5.23921121209821	2000-04-27
5	2000-04-27	4.20400241411961	2000-04-27
5	2003-07-26	6.1821957272646	2000-04-27
5	2003-05-18	4.53515404638362	2000-04-27
5	2001-07-23	3.06695274940501	2000-04-27

if we change the function just a little we get the days elapsed from the first measurment:

res <- ddply(df, .(id), transform, dayselapsed=date-min(date))
res


id	date	val	dayselapsed
1	2001-12-31	5.77867959202519	299
1	2003-01-28	5.11833655328897	692
1	2003-11-14	3.75075463437505	982
1	2001-03-07	3.36305093773468	0
1	2001-03-07	6.6141583789233	0
2	2002-08-02	6.98279884579167	795
2	2003-12-30	5.4450019138646	1310
2	2000-05-29	5.92567982716667	0
2	2000-11-26	4.60169956597544	181
2	2002-08-04	5.02395812458	797
3	2002-04-23	5.92590324593659	811
3	2003-04-28	6.59246056959167	1181
3	2002-10-20	5.89780464300662	991
3	2000-02-02	4.0548618816269	0
3	2000-05-30	5.4722633402378	118
4	2000-08-03	3.52737458048037	0
4	2002-11-02	3.98853894162285	821
4	2002-11-13	4.54373088519655	832
4	2003-08-02	4.2144245184346	1094
4	2003-12-19	4.98767478298279	1233
5	2002-08-29	5.23921121209821	854
5	2000-04-27	4.20400241411961	0
5	2003-07-26	6.1821957272646	1185
5	2003-05-18	4.53515404638362	1116
5	2001-07-23	3.06695274940501	452

Subscribe to: Posts ( Atom )