Tuesday, July 26, 2011

R - recode() data

recode() - from package lordif

library(lordif)  # maybe you have first to install the package (install.packages("lordif"))
  • creating example data
z <- sample(0:1, 25, replace=T)
original <- c(0,1)
new <- c("boy", "girl")
z
original
new
z:  1 1 1 0 0 0 1 1 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 1 0
original: 0 1
new:  "boy"  "girl"
  • so z is a vector consisting of 1 and 0
  • original contains the distinct values of z
  • and new contains the values which should replace the original values in the following way: the first item of original is replaced by the first value of new, the second by the second etc pp
z <- recode(z, original, new)
z
 [1] "girl" "girl" "girl" "boy"  "boy"  "boy"  "girl" "girl" "girl" "girl"
[11] "boy"  "boy"  "girl" "girl" "girl" "girl" "boy"  "girl" "girl" "girl"
[21] "girl" "girl" "girl" "girl" "boy"

u

Wednesday, July 20, 2011

R - t.test for subsets of a data frame (ddply or tapply)

(two sample t-test for groups in data frame here)
First we create a data frame, on which we can work
x <- rnorm(100)                 
y <- sample(1:3, 100, replace=T)
z <- sample(1:2, 100, replace=T)

df <- data.frame(group=y,sex=z,sds=x)
head(df)
group sex       sds
1     1   2 1.3412663
2     1   1 0.8000326
3     2   2 0.0371029
4     1   2 0.2064684
5     1   2 1.6429816
6     3   1 1.3271138
In the following I use ddply() from the plyr-package, so if you did not install and/or load it you have to do it now - the command library() without any argument gives you a list of the installed packages:
install.packages("plyr") # install the package
library(plyr) # load it
Here is the function for doing the t.tests:
t.test.plyr <- function(x, var, mean=0 ){
  y <- rep(NA,10)
  y[6] <- nrow(x)[1]              # count observations
  if(nrow(x) < 2) return(y)       # exits if too less observations
  res <- t.test(x[var], mu=mean)  # doing the test

  y[1] <- res$statistic           # extract values of interest
  y[2] <- res$p.value      
  y[3] <- res$estimate     
  y[4] <- res$conf.int[1]  
  y[5] <- res$conf.int[2]  
  y[7] <- res$parameter    
  y[8] <- res$method       
  y[9] <- res$alternative  
  y[10] <- res$null.value   

  names(y) <- c("statistic","p.value","estimate","conf.int1", "conf.int2", "nobs","dof","method","alternative","null.value")
  y 
}

where t.test.plyr() is a function of:
- x - the data frame
- var - the variable which should be testet
- mean - the mean of the t-test

now we can use ddply in the following way:
result <- ddply(df, .(group, sex), t.test.plyr, "sds")
result
group sex           statistic           p.value            estimate
1     1   1   0.774917157596946 0.445379429071851   0.132214420988956
2     1   2  -0.415594591752269  0.68359099812399 -0.0987341821354776
3     2   1   0.130257003222609 0.899578855830346  0.0602600950757679
4     2   2 -0.0579045286518569 0.954965406753824 -0.0150105343694587
5     3   1    1.18044920537142 0.256202807853187   0.406567809727671
6     3   2    1.49215606704641  0.15126489039304   0.294865619363907
           conf.int1         conf.int2 nobs dof            method alternative
1 -0.218494854049804 0.482923696027717   27  26 One Sample t-test   two.sided
2 -0.605109702463049 0.407641338192093   16  15 One Sample t-test   two.sided
3  -1.00655416438508  1.12707435453662    9   8 One Sample t-test   two.sided
4 -0.592608791299465 0.562587722560548   11  10 One Sample t-test   two.sided
5 -0.327541518602652  1.14067713805799   16  15 One Sample t-test   two.sided
6 -0.117342538637965  0.70707377736578   21  20 One Sample t-test   two.sided
  null.value
1          0
2          0
3          0
4          0
5          0
6          0
If you want to change the mu in you t.test, add the argument mean:
result <- ddply(df, .(group, sex), t.test.plyr, "sds", mean=1)
result
If there is no need or you do not want a data frame as result you can just use tapply(). tapply results in a list with as many elements as subsets, each element contains the results of each t.test (as list).
res <- tapply(df$sds, list(df$sex,df$group), t.test)
res
1      2      3     
1 List,9 List,9 List,9
2 List,9 List,9 List,9
So the first element of res contains the t.test results for sex=1 and group=1 etc pp
res[[1]]
One Sample t-test

data:  X[[1L]] 
t = 0.7749, df = 26, p-value = 0.4454
alternative hypothesis: true mean is not equal to 0 
95 percent confidence interval:
 -0.2184949  0.4829237 
sample estimates:
mean of x 
0.1322144

Friday, July 15, 2011

Ubuntu - convert pdf to images (resolution selectable)

nice command line tool:
pdftoppm

creates one image (png, jpeg pbm or pgm) per page.

e.g. if source.pdf has two pages
pdftoppm -png -r 600 source.pdf target
creates 2 images, solution: 600 DPI, file names: target-01.png, target-02.png

Thursday, July 14, 2011

Emacs - change style of word wrapping

use word wrap
M-x visual-line-mode
Emacs displays long lines by truncation
M-x toggle-truncate-lines

VBA excel - sheets exist ?

The function:
Public Function SheetExists(SheetName As String) As Boolean
    Dim ws As Worksheet
    SheetExists = False
    For Each ws In ThisWorkbook.Worksheets
        If ws.Name = SheetName Then SheetExists = True
    Next ws
End Function
create sheet if it does not exist:
For i = 1 to n
    If SheetExists("Sheet" & CStr(i)) Then
        MsgBox ("Sheet" & CStr(i) & " exists")
    Else
        Worksheets.Add(After:=Worksheets(Worksheets.Count)).Name = "Sheet" & CStr(i)
    End If
Next i

Saturday, July 9, 2011

R ggplot - barplot

barplot

  • load the package ggplot2 and the diamond data
  • first we create a object p of class ggplot, where we map clarity to x (horizontal position)
p <- ggplot(diamonds, aes(x=clarity))
summary(p)
data: carat, cut, color, clarity, depth, table, price, x, y, z
  [53940x10]
mapping:  x = clarity
faceting: facet_grid(. ~ ., FALSE)
  • p contains all information ggplot needs to build a barplot, all we have to do is adding a layer with geom="bar";
p + layer(geom="bar")

  • to this simple barplot we can add the following (optional) aesthetics:
    • colour: color of the borders of the bins
    • fill: color of the filling
    • size:
    • linetype: line type of the borders
    • weight:
    • alpha: transparency of fill [0,1]
p + layer(geom="bar", geom_params=list(fill="steelblue", colour="black", linetype=4, alpha=0.3))

  • but a barplot can visualize more information about the data - so we add a new mapping to p
p <- ggplot(diamonds, aes(x=clarity, fill=color)) # where color is the variable of the dataframe diamonds
  • now the default output looks like:
p + layer(geom="bar")

  • the plot can be customized by the same aesthetics, fill would override the aesthetic mapping of p
  • additional we can add a position adjustment
p + layer(geom="bar", position="dodge") 

  • or
p + layer(geom="bar", position="fill") 

  • if we flip (coordflip()) the underlying coordinate system, we can rotate the plot
p + layer(geom="bar", position="stack") + coord_flip()

  • or we even can change the coordinate system to polar coordinates:
p + layer(geom="bar", position="stack") + coord_polar()

R - filled.contour plot

filled.contour

  • if you have a matrix with nxm dimensions and every entry represents a value of interest and the indices of the entry describe the position, then you can use filled.contour() to visualize the matrix
  • first we create a simple matrix to demonstrate how it works:
x <- y <- 1:10 # create two vectors with the integers from 1 to 10
z <- outer(x,y) # create a matrix as the outer product of the two vectors
z # show the matrix
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]    1    2    3    4    5    6    7    8    9    10
 [2,]    2    4    6    8   10   12   14   16   18    20
 [3,]    3    6    9   12   15   18   21   24   27    30
 [4,]    4    8   12   16   20   24   28   32   36    40
 [5,]    5   10   15   20   25   30   35   40   45    50
 [6,]    6   12   18   24   30   36   42   48   54    60
 [7,]    7   14   21   28   35   42   49   56   63    70
 [8,]    8   16   24   32   40   48   56   64   72    80
 [9,]    9   18   27   36   45   54   63   72   81    90
[10,]   10   20   30   40   50   60   70   80   90   100
  • now let us plot the matrix:
filled.contour(z)

  • the matrix is mapped to a 1x1 square: the entry with the indices (1,1) and the value 1 to the point with the coordinates (0,0) and the entry with the indices (10,10) and the value (100) to the points with the coordinates (1,1); if you want to change the axes you can add a x and a y argument to the plot;so let's use the vectors x and y from above (with the entries 1 to 10) and also change to colors
  • filled.contour() uses the cm palette as default (cyan for low, magenta for heigh values);
filled.contour(x=x,y=y,z, col=heat.colors(20))

  • more palettes you find here

And a last example

my.seq <- seq(-pi, pi, length=50) # creating a vector as a sequence from 0 to 2*pi with 50 entries
my.seq2 <- seq(-0.5*pi, 1.5*pi, length=50) # creating a vector as a sequence from 0 to 2*pi with 50 entries
my.matrix <- outer(sin(my.seq),cos(my.seq2)) # creating the matrix using sin, cos and outer
filled.contour(x=my.seq, y=my.seq2,my.matrix, plot.title=title(main="Products Sin(x)*Cos(y) on [-Pi,Pi]x[-0.5Pi,1.5Pi]", xlab="x", ylab="y"), key.title=title(main="products"), col=terrain.colors(20))

Latex Beamer - package beamercolor

Manual.
You can download the .sty file: here BeamerColor.sty
create a folder in your latex file directory
/usr/share/texmf-texlive/tex/latex/
the path can differ from the path above (e.g. if you use TeTeX).
copy the example.sty file in this directory
cp /home/me/pathtopackage/example.sty /usr/share/texmf-texlive/tex/latex/

run
sudo mktexlsr
add the line \usepackage{example} to you document

Friday, July 8, 2011

RExcel - do not work after installing new R version

When a new version of R is installed, a few things have to be done.

rcom and rscprooxy have to be installed in the new version.
 install.packages(c("rcom", "rscproxy")) 

the line
library(rcom) 
has to be added to the etc\RProfile.site (standard path Win: C:\Programme\R\R-Version\etc\)

comRegisterRegistry() has to be run (in the new version)

bin\RSetReg.exe has to be run  (path Win: C:\Programme\R\R-2.12.1\bin\i386)

Thursday, July 7, 2011

R ggplot2 - simple heatplot

simple heatplot

  • load the library ggplot2, if you have not installed it yet - install it via install.packages("ggplot2")
  • create the example data frame
  • map it to the plot
x <- rep(1:20,20)
y <- rep(1:20, rep(20,20))
z <- rnorm(400, mean=0, sd=15)
df <- data.frame(x=x,y=y,z=z)
ggplot(df, aes(x=x,y=y, fill=z)) + geom_tile()

  • now we can change the colors (colorchart) by add a scale_fill_gradient(low="green", high="red")
ggplot(df, aes(x=x,y=y, fill=z)) + geom_tile()+scale_fill_gradient(low="green", high="red")

  • an a second one (with more soothing colors)
ggplot(df, aes(x=x,y=y, fill=z)) + geom_tile()+scale_fill_gradient(low="aliceblue", high="midnightblue")

Emacs - make buffer names unique - eindeutige Buffernamen

  • add the line 
(require 'uniquify)

to your emacs configuration file ( .emacs in your home directory )

  • restart emacs
  • type: M-x customize-option
  • then choose the variable: uniquify-buffer-name-style
  • set it to one of these: forward or reverse or post-forward or post-forward-angle-brackets (just hit Enter)
  • set for the current session and save for future sessions then exit

R - add a table to a plot

addtable2plot()

  • suppose you have a vector with different dates of events and you want to investigates if there are month with more or less events, the first step maybe is to draw a barplot
  • so let us generate first a vector with the different dates:
  • and than we create a vector with the extracted month (as factor)
my.dat <- as.Date(sample.int(500, 1000, replace=T), origin="2000-01-01") 
# extract month, use the abbreviated form of months
my.months <- months(my.dat, abbreviate=T)
df <- data.frame(date=my.dat, month=my.months)
table(df$month) 
Apr Aug Dez Feb Jan Jul Jun Mai Mär Nov Okt Sep 
122  72  66 103 121  61  51  85 129  59  64  67
  • as one can see the month are in alphabetical order so we have to correct this
  • therefore we use ISOdatetime and format for creating a vector of the 12 months in the correct order and use this vector for creating a labeled factor out of df$month
df$month <- factor(df$month, levels=format(ISOdatetime(2000,1:12,1,0,0,0),"%b"))
# show the first 50 elements of my.
table(df$month)
Jan Feb Mär Apr Mai Jun Jul Aug Sep Okt Nov Dez 
121 103 129 122  85  51  61  72  67  64  59  66
  • now we can create hour barplot (use the package plotrix)
# if it is not installed yet, install it via install.packages("plotrix")
# load the package
library(plotrix) 

# create the table for plotting (the same as above)
my.table <- with(df, table(month)) # equivalent to table(df$month)
# plot it! (choosing the color from here)

barp(my.table, col=colors()[c(462:471,473,477)], xlab="Month", ylab="count")

  • now we create the same plot but add the table
# create the table for the plot - i want the horizontal table so i have to transpose it
m2 <- t(as.matrix(my.table)) 
# barplot
barp(my.table, col=colors()[c(462:471,473,477)], xlab="month", ylab="count")
# add the table: the first two arguments are the position relative to the coord system of the plot
# the third argument is the table (a data frame or a matrix), the bty: "o" draw a box, "n" no box
# bg is the argument for the background 
addtable2plot(3,140,m2, bty="o", bg="aliceblue")

now add the table with ggplot2


require(gridExtra)
require(ggplot2)
# generate random dates
my.dat <- as.Date(sample.int(500, 1000, replace=T), origin="2000-01-01") 
# extract month
my.months <- factor(month(my.dat),levels=1:12,labels=month.abb)
df <- data.frame(date=my.dat, month=my.months)

# prepare table
mytable <- tableGrob(t(as.data.frame(table(df$month))),show.rownames = F,show.colnames = F)

ggplot(df,aes(x=month,fill=month)) +
    geom_bar() +
    ylim(0,150) +
    annotation_custom(mytable,xmin = 4.5,xmax=15,ymin=130,ymax = 150) +
    scale_fill_manual(values=colors()[c(462:471,473,477)],guide="none")
ggsave("tabletoplot.png")



R - generate a vector with names of months (like LETTERS)

> format(ISOdatetime(2000,1:12,1,0,0,0),"%b")
gives:
 [1] "Jan" "Feb" "Mrz" "Apr" "Mai" "Jun" "Jul" "Aug" "Sep" "Okt" "Nov" "Dez"
where the month names depend on local settings. If you use %B instead of %b you get the non abbreviated months. How ever, there is also an easier way if you want the months in English: the vector
> month.name
 
gives
 [1] "January"   "February"  "March"     "April"     "May"       "June"     
 [7] "July"      "August"    "September" "October"   "November"  "December" 
or
> month.abb
 
gives
 [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"

Wednesday, July 6, 2011

R - ddply

ddply

dataframe in dataframe out

example

  • first we create a data frame with the following columns: id, date, value
  • to create the id column we use rep(), where the first argument 1:5 (the vector (1,2,3,4,5)) is repeated five times, so we have a vector with length 25
  • the vector dates is generated in the following way:
    • generate a vector of random numbers between 1 and 1500 and length 25 (sample.int())
    • use as.Date with the origin argument, this converts the integers into Date, where the the origin is day 0 (in the case 0 is convert to the origin i.e. 2000-01-01, 1 is mapped to 2000-01-02 etc pp, so we get a vector of dates between 2000 and beginning 2005
  • vals are generated by the random numbers function rnorm() (normal distributed with mean 5 and standard deviation 1
  • last but not least put the three together in one data frame (and let us show the first lines)
id <- rep(1:5,5)
dates <- as.Date(sample.int(1500, 25, replace=T), origin="2000-01-01") 
vals <- rnorm(25, mean=5, sd=1)
df <- data.frame(id=id, date=dates, val=vals)
head(df)
  id       date      val
1  1 2001-12-31 5.778680
2  2 2002-08-02 6.982799
3  3 2002-04-23 5.925903
4  4 2000-08-03 3.527375
5  5 2002-08-29 5.239211
6  1 2003-01-28 5.118337
  • id encodes a person, date contains the day of the measurement and val the values
  • now we want to add a column which contains the the first measurement in time
  • therefore we had to load plyr, then we use the function ddply() (the meaning of the first two letters is data frame in data frame out)
  • the first argument of the function is the data frame we pass to ddply
  • the second argument defines the groups (we want the min of each person so our grouping variable is id)
  • transform says we want to change existing data frame - like recode a variable or add a new (another choice would be summarise - if we want to aggregate the data) - so we add a variable named start and it should be the min() of our date per person
res <- ddply(df, .(id), transform, start=min(date))
res
iddatevalstart
12001-12-315.778679592025192001-03-07
12003-01-285.118336553288972001-03-07
12003-11-143.750754634375052001-03-07
12001-03-073.363050937734682001-03-07
12001-03-076.61415837892332001-03-07
22002-08-026.982798845791672000-05-29
22003-12-305.44500191386462000-05-29
22000-05-295.925679827166672000-05-29
22000-11-264.601699565975442000-05-29
22002-08-045.023958124582000-05-29
32002-04-235.925903245936592000-02-02
32003-04-286.592460569591672000-02-02
32002-10-205.897804643006622000-02-02
32000-02-024.05486188162692000-02-02
32000-05-305.47226334023782000-02-02
42000-08-033.527374580480372000-08-03
42002-11-023.988538941622852000-08-03
42002-11-134.543730885196552000-08-03
42003-08-024.21442451843462000-08-03
42003-12-194.987674782982792000-08-03
52002-08-295.239211212098212000-04-27
52000-04-274.204002414119612000-04-27
52003-07-266.18219572726462000-04-27
52003-05-184.535154046383622000-04-27
52001-07-233.066952749405012000-04-27
  • if we change the function just a little we get the days elapsed from the first measurment:
res <- ddply(df, .(id), transform, dayselapsed=date-min(date))
res
iddatevaldayselapsed
12001-12-315.77867959202519299
12003-01-285.11833655328897692
12003-11-143.75075463437505982
12001-03-073.363050937734680
12001-03-076.61415837892330
22002-08-026.98279884579167795
22003-12-305.44500191386461310
22000-05-295.925679827166670
22000-11-264.60169956597544181
22002-08-045.02395812458797
32002-04-235.92590324593659811
32003-04-286.592460569591671181
32002-10-205.89780464300662991
32000-02-024.05486188162690
32000-05-305.4722633402378118
42000-08-033.527374580480370
42002-11-023.98853894162285821
42002-11-134.54373088519655832
42003-08-024.21442451843461094
42003-12-194.987674782982791233
52002-08-295.23921121209821854
52000-04-274.204002414119610
52003-07-266.18219572726461185
52003-05-184.535154046383621116
52001-07-233.06695274940501452