R, Ruby, Perl und ich: July 2013

Friday, July 26, 2013

cdc - now available: 2012 BRFSS data

Data and documentation are available here:
http://www.cdc.gov/brfss/annual_data/annual_2012.html

And this is the link to the
maps 1984 - 2012

Monday, July 22, 2013

R ggbio - visualize a single chromosome

plotIdeogram() is a wrapper function, synonyms to plotSingleChrom()
requires packages rtracklayer, GenomicRanges, BioGenerics, parallel
takes a name of a genome as argument and downloads the cytoband table from UCSC

require(ggbio)
p <- plotIdeogram()

1: hg19       2: hg18       3: hg17       4: hg16       5: vicPac1
  6: dasNov3    7: otoGar3    8: papHam1    9: papAnu2   10: felCat5
 11: felCat4   12: felCat3   13: panTro4   14: panTro3   15: panTro2
 16: panTro1   17: bosTau7   18: bosTau6   19: bosTau4   20: bosTau3
 21: bosTau2   22: canFam3   23: canFam2   24: canFam1   25: turTru2
 26: loxAfr3   27: musFur1   28: nomLeu3   29: nomLeu2   30: nomLeu1
 31: gorGor3   32: cavPor3   33: eriEur1   34: equCab2   35: equCab1
 36: dipOrd1   37: triMan1   38: calJac3   39: calJac1   40: pteVam1
 41: myoLuc2   42: mm10      43: mm9       44: mm8       45: mm7    
 46: micMur1   47: hetGla2   48: hetGla1   49: monDom5   50: monDom4
 51: monDom1   52: ponAbe2   53: ailMel1   54: susScr3   55: susScr2
 56: ochPri2   57: ornAna1   58: oryCun2   59: rn5       60: rn4    
 61: rn3       62: rheMac3   63: rheMac2   64: proCap1   65: oviAri1
 66: sorAra1   67: choHof1   68: speTri2   69: saiBol1   70: sorAra1
 71: sarHar1   72: echTel1   73: tupBel1   74: macEug2   75: cerSim1
 76: gadMor1   77: melUnd1   78: galGal4   79: galGal3   80: galGal2
 81: latCha1   82: fr3       83: fr2       84: fr1       85: petMar2
 86: petMar1   87: anoCar2   88: anoCar1   89: oryLat2   90: geoFor1
 91: oreNil2   92: chrPic1   93: gasAcu1   94: tetNig2   95: tetNig1
 96: melGal1   97: xenTro3   98: xenTro2   99: xenTro1  100: taeGut1
101: danRer7  102: danRer6  103: danRer5  104: danRer4  105: danRer3
106: ci2      107: ci1      108: braFlo1  109: strPur2  110: strPur1
111: apiMel2  112: apiMel1  113: anoGam1  114: droAna2  115: droAna1
116: droEre1  117: droGri1  118: dm3      119: dm2      120: dm1    
121: droMoj2  122: droMoj1  123: droPer1  124: dp3      125: dp2    
126: droSec1  127: droSim1  128: droVir2  129: droVir1  130: droYak2
131: droYak1  132: caePb2   133: caePb1   134: cb3      135: cb1    
136: ce10     137: ce6      138: ce4      139: ce2      140: caeJap1
141: caeRem3  142: caeRem2  143: priPac1  144: aplCal1  145: sacCer3
146: sacCer2  147: sacCer1

Selection: 1
Loading…
Done
use chr1 automatically

plot chromosom 1 from the hg19 genome (chromosom 1 is the default)

p <- plotIdeogram(genome="hg19")
p

plot second chromosome

p <- plotIdeogram(genome="hg19",subchr="chr2" )
p

get rid of the cytoband

p <- plotIdeogram(genome="hg19",subchr="chr2",cytoband=F)
p

or change aspect ratio (you can change it also afterwards via + theme(aspect.ratio= )

p <- plotIdeogram(genome="hg19",subchr="chr2",aspect.ratio=1/10 )
p

have a closer look at the structure of p

attributes(p)

$.S3Class
[1] "gg"     "ggplot"

$class
[1] "ideogram"
attr(,"package")
[1] "ggbio"

$names
[1] "data"        "layers"      "scales"      "mapping"     "theme"      
[6] "coordinates" "facet"       "plot_env"    "labels"     

$xlab
[1] ""

$main
[1] ""

$ideogram

$xlabel
[1] FALSE

$ideogram.data
GRanges with 62 ranges and 2 metadata columns:
       seqnames                 ranges strand   |     name gieStain
          <Rle>              <IRanges>  <Rle>   | <factor> <factor>
   [1]     chr2   [       0,  4400000]      *   |    p25.3     gneg
   [2]     chr2   [ 4400000,  7100000]      *   |    p25.2   gpos50
   [3]     chr2   [ 7100000, 12200000]      *   |    p25.1     gneg
   [4]     chr2   [12200000, 16700000]      *   |    p24.3   gpos75
   [5]     chr2   [16700000, 19200000]      *   |    p24.2     gneg
   ...      ...                    ...    ... ...      ...      ...
  [58]     chr2 [225200000, 226100000]      *   |    q36.2     gneg
  [59]     chr2 [226100000, 231000000]      *   |    q36.3  gpos100
  [60]     chr2 [231000000, 235600000]      *   |    q37.1     gneg
  [61]     chr2 [235600000, 237300000]      *   |    q37.2   gpos50
  [62]     chr2 [237300000, 243199373]      *   |    q37.3     gneg
  ---
  seqlengths:
    chr1 chr10 chr11 chr12 chr13 chr14 ...  chr6  chr7  chr8  chr9  chrX  chrY
      NA    NA    NA    NA    NA    NA ...    NA    NA    NA    NA    NA    NA

$subchr
[1] "chr2"

$aspect.ratio
[1] 0.1

$color
[1] "red"

$fill
[1] "red"

$alpha
[1] 0.7

$size
[1] 1

$zoom.offset
[1] 0.1

so we see: data are stored within p, so it is not necessary to download them again
change aspect ratio (you can change it also afterwards via + theme(aspect.ratio= )

md <- attr(p,"ideogram.data")
plotIdeogram(md, aspect.ratio = 0.3)

so you can always download data and load it later using the getIdeogram() function, which has three arguments:
- genome must be one of the result from ucscGenomes()$db
- subchr defines a subset (character vectors)
- cytobands must be TRUE or FALSE
you can also use the underlying functions available in the rtracklayer package
ucscGenomes() returns a data frame with information about the available UCSC genomes (identifier, specie, date, official name)

head(ucscGenomes())

db   species      date                               name
1    hg19     Human Feb. 2009 Genome Reference Consortium GRCh37
2    hg18     Human Mar. 2006                    NCBI Build 36.1
3    hg17     Human  May 2004                      NCBI Build 35
4    hg16     Human Jul. 2003                      NCBI Build 34
5 vicPac1    Alpaca Jul. 2008          Broad Institute VicPac1.0
6 dasNov3 Armadillo Dec. 2011            Broad Institute DasNov3

plot ideogram of hg19 (data are contained in the biovizBase package)

data(hg19IdeogramCyto, package="biovizBase" )
plotIdeogram(hg19IdeogramCyto)

add zoom.region

plotIdeogram(hg19IdeogramCyto,zoom.region=c(1e7,5e7) )

change color of the zoom region (border and filling)

plotIdeogram(hg19IdeogramCyto,zoom.region=c(1e7,5e7),fill="blue",color="blue" )

change zoom offset (zoom.offset) and size of the frame (size)

p <- plotIdeogram(hg19IdeogramCyto,zoom.region=c(1e7,5e7),fill="blue",color="blue",zoom.offset=4,size=2)
p

visualize chromosome 2 instead

p + xlim(GRanges("chr2", IRanges(1e7, 5e7)))

add labels on the x-axis

plotIdeogram(hg19IdeogramCyto,zoom.region=c(1e7,5e7),fill="blue",color="blue",zoom.offset=4,size=2,xlabel=T)

plot overview

autoplot(seqinfo(hg19Ideogram)[paste("chr",c(14:20,"X","Y"),sep="")])

Friday, July 19, 2013

R ggbio - tracks

container for a plot you want to align
constructor tracks()
first we produce two plots:

require(ggbio)
require(gridExtra)
df1 <- data.frame(time = 1:100, score = sin((1:100)/20)*10)
p1 <- ggplot(data = df1, aes(x = time, y = score)) +
      geom_line()
df2 <- data.frame(time = 30:120, score = sin((30:120)/20)*10, value = rnorm(120-30 + 1))
p2 <- ggplot(data = df2, aes(x = time, y = score)) +
    geom_line() +
    geom_point(size = 2, aes(color = value))
grid.arrange(p1,p2)

the two plots have different scale on the x-axis
we want to align them on exactly the same x-axis to make it more easy to compare each other

tracks(p1,p2)

now add a theme (included in the ggbio package)

tracks(time1=p1,time2=p2) +
    xlim(1,40) +
    theme_tracks_sunset()

add a title

tracks(time1=p1,time2=p2,title="My title")

set background colors (for each of the single plots)

tracks(time1=p1,time2=p2,track.plot.color=c("darkred","darkgreen"))

set background color (for the whole tracks)

tracks(time1=p1,time2=p2,track.bg.color="darkred")

set color of the label borders and the filling

tracks(time1=p1,time2=p2,label.bg.color="darkred",label.bg.fill="lightblue")

set color and size of the label text and the width of the labels

tracks(time1=p1,time2=p2,label.text.color="midnightblue",label.text.cex=2,label.width=unit(2.5,"cm"))

other zoom methods:
- scale along with the limits of a GRange

require(GenomicRanges)
gr <- GRanges("chr", IRanges(1, 40))
tracks(time1 = p1, time2 = p2) + xlim(gr)

scale along with the limits of a IRange

require(GenomicRanges)
gr <- GRanges("chr", IRanges(1, 40))
tracks(time1 = p1, time2 = p2) + xlim(ranges(gr))

change limits afterwards

require(GenomicRanges)
gr <- GRanges("chr", IRanges(1, 40))
trks <- tracks(time1 = p1, time2 = p2) + xlim(ranges(gr))
xlim(trks)

[1]  1 40

xlim(trks) <- c(1,30)
trks

Thursday, July 18, 2013

R - multiple comparisons

along with chapter 16 of Draghici: Statistics and Data Analysis for Microarrays Using R and Bioconductor

main function in R to obtain corrected p-vals p.adjust
to get available methods type:

p.adjust.methods

[1] "holm"       "hochberg"   "hommel"     "bonferroni" "BH"        
[6] "BY"         "fdr"        "none"

simulate 10 raw p-vals on gene level

raw.p.vals <- runif(10,0,0.5)
names(raw.p.vals) <- paste("gene",1:length(raw.p.vals),sep="")
raw.p.vals

     gene1      gene2      gene3      gene4      gene5      gene6      gene7 
0.30128864 0.45386933 0.11451989 0.17423164 0.36300090 0.32262681 0.24504471 
     gene8      gene9     gene10 
0.27808264 0.08773618 0.49969999

get the corrected vals for example
- Holm's stepwise correction

p.adjust(raw.p.vals,method="holm")

    gene1     gene2     gene3     gene4     gene5     gene6     gene7     gene8 
1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 
    gene9    gene10 
0.8773618 1.0000000

False Discovery Rate

p.adjust(raw.p.vals,method="fdr")

    gene1     gene2     gene3     gene4     gene5     gene6     gene7     gene8 
0.4537511 0.4997000 0.4537511 0.4537511 0.4537511 0.4537511 0.4537511 0.4537511 
    gene9    gene10 
0.4537511 0.4997000

the multtest package (available from bioconductor) contains also a function for correction (mt.rawp2adjp())

require(multtest)
mymethods <- c("Bonferroni", "Holm", "Hochberg", "SidakSS", "SidakSD","BH", "BY","ABH","TSBH")
mt.rawp2adjp(raw.p.vals,proc=mymethods)

$adjp
            rawp Bonferroni      Holm Hochberg   SidakSS   SidakSD        BH BY
 [1,] 0.08773618  0.8773618 0.8773618   0.4997 0.6007872 0.6007872 0.4537511  1
 [2,] 0.11451989  1.0000000 1.0000000   0.4997 0.7036615 0.6653358 0.4537511  1
 [3,] 0.17423164  1.0000000 1.0000000   0.4997 0.8525712 0.7837949 0.4537511  1
 [4,] 0.24504471  1.0000000 1.0000000   0.4997 0.9398532 0.8602188 0.4537511  1
 [5,] 0.27808264  1.0000000 1.0000000   0.4997 0.9615519 0.8602188 0.4537511  1
 [6,] 0.30128864  1.0000000 1.0000000   0.4997 0.9722682 0.8602188 0.4537511  1
 [7,] 0.32262681  1.0000000 1.0000000   0.4997 0.9796633 0.8602188 0.4537511  1
 [8,] 0.36300090  1.0000000 1.0000000   0.4997 0.9890001 0.8602188 0.4537511  1
 [9,] 0.45386933  1.0000000 1.0000000   0.4997 0.9976397 0.8602188 0.4997000  1
[10,] 0.49969999  1.0000000 1.0000000   0.4997 0.9990176 0.8602188 0.4997000  1
      ABH TSBH_0.05
 [1,]  NA 0.4537511
 [2,]  NA 0.4537511
 [3,]  NA 0.4537511
 [4,]  NA 0.4537511
 [5,]  NA 0.4537511
 [6,]  NA 0.4537511
 [7,]  NA 0.4537511
 [8,]  NA 0.4537511
 [9,]  NA 0.4997000
[10,]  NA 0.4997000

$index
 [1]  9  3  4  7  8  1  6  5  2 10

$h0.ABH
[1] NA

$h0.TSBH
h0.TSBH_0.05 
          10 

Warnmeldung:
In min(which(diff(h0.m, na.rm = TRUE) > 0), na.rm = TRUE) :
  kein nicht-fehlendes Argument für min; gebe Inf zurück

Wednesday, July 17, 2013

Ubuntu - merge several pdf

gs -dNOPAUSE -sDEVICE=pdfwrite -sOUTPUTFILE=app.pdf -dBATCH application.pdf 201302cv.pdf

Sunday, July 7, 2013

R - Representation of microarray data

along with chapter 7 of Draghici: Statistics and Data Analysis for Microarrays Using R and Bioconductor

using the data of T. Golub's paper from 1999 on leukemia classification
the data is contained in a package and can be installed

source("http://bioconductor.org/biocLite.R")
biocLite("golubEsets")

then load the package and the data

require(golubEsets)
data(Golub_Merge)
Golub_Merge

ExpressionSet (storageMode: lockedEnvironment)
assayData: 7129 features, 72 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: 39 40 ... 33 (72 total)
  varLabels: Samples ALL.AML ... Source (11 total)
  varMetadata: labelDescription
featureData: none
experimentData: use 'experimentData(object)'
  pubMedIds: 10521349 
Annotation: hu6800

Golub_Merge is of class ExpressionSet
the ExpressionSet class is part of the Biobase package
this class is designed to combine several different sources of information into one single structure
is input for many Bioconductor functions
it consists of:
- expression data from microarray experiments (assayData)
- meta data describing samples in experiments (phenoData)
- annotations and meta-data about the features on the chip or technology used for the experiment (featureData,annotation)
- information related to the protocol used for processing each sample (and usually extracted from manufacturer files, protocolData)
- and a exible structure to describe the experiment (experimentData)
so we can get experiment-level metadata along with the pubmed ID

experimentData(Golub_Merge)

Experiment data
  Experimenter name: Golub TR et al. 
  Laboratory: Whitehead 
  Contact information: 
 
  Title: ALL/AML discrimination 
  URL: www-genome.wi.mit.edu/mpr/data_set_ALL_AML.html 
  PMIDs: 10521349 

  Abstract: A 133 word abstract is available. Use 'abstract' method.

show the first part of the abstract

substr(abstract(Golub_Merge),1,102)

[1] "Although cancer classification has improved over the past 30 years, there has been no general approach"

one get get the dimension of the expression data

dim(exprs(Golub_Merge))

look at the five rows of the first 5 columns

exprs(Golub_Merge)[1:5,1:5]

                 39   40   42   47   48
AFFX-BioB-5_at -342  -87   22 -243 -130
AFFX-BioB-M_at -200 -248 -153 -218 -177
AFFX-BioB-3_at   41  262   17 -163  -28
AFFX-BioC-5_at  328  295  276  182  266
AFFX-BioC-3_at -224 -226 -211 -289 -170

retrieve information on experimental phenotypes (again we look only at the first five samples/rows)

pData(Golub_Merge)[1:5,]

Samples ALL.AML BM.PB T.B.cell  FAB      Date Gender pctBlasts Treatment
39      39     ALL    BM   B-cell <NA>                F        NA      <NA>
40      40     ALL    BM   B-cell <NA> 5/16/1980      F        NA      <NA>
42      42     ALL    BM   B-cell <NA>      <NA>      F        NA      <NA>
47      47     ALL    BM   B-cell <NA>  9/5/1986      M        NA      <NA>
48      48     ALL    BM   B-cell <NA> 2/28/1992      F        NA      <NA>
     PS Source
39 0.78   DFCI
40 0.68   DFCI
42 0.42   DFCI
47 0.81   DFCI
48 0.94   DFCI

how are the assay reporters named?

featureNames(Golub_Merge)[1:5]

[1] "AFFX-BioB-5_at" "AFFX-BioB-M_at" "AFFX-BioB-3_at" "AFFX-BioC-5_at"
[5] "AFFX-BioC-3_at"

how are the samples named?

sampleNames(Golub_Merge)

 [1] "39" "40" "42" "47" "48" "49" "41" "43" "44" "45" "46" "70" "71" "72" "68"
[16] "69" "67" "55" "56" "59" "52" "53" "51" "50" "54" "57" "58" "60" "61" "65"
[31] "66" "63" "64" "62" "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11"
[46] "12" "13" "14" "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26"
[61] "27" "34" "35" "36" "37" "38" "28" "29" "30" "31" "32" "33"

show the distribution of the primary outcome

table(Golub_Merge$ALL.AML)

ALL AML 
 47  25

R - annotation of a microarray platform

along with chapter 7 of Draghici: Statistics and Data Analysis for Microarrays Using R and Bioconductor

use the annotation() accessor to get the type of microarray used

annotation(Golub_Merge)

so we use the hu6800.db package, which provides detailed information about the hu6800 platform. The package is updated biannually.
install the package via biocLite("hu6800.db")
if you want to see what objects this package supports type

ls("package:hu6800.db")

[1] "hu6800"              "hu6800ACCNUM"        "hu6800ALIAS2PROBE"  
 [4] "hu6800CHR"           "hu6800CHRLENGTHS"    "hu6800CHRLOC"       
 [7] "hu6800CHRLOCEND"     "hu6800.db"           "hu6800_dbconn"      
[10] "hu6800_dbfile"       "hu6800_dbInfo"       "hu6800_dbschema"    
[13] "hu6800ENSEMBL"       "hu6800ENSEMBL2PROBE" "hu6800ENTREZID"     
[16] "hu6800ENZYME"        "hu6800ENZYME2PROBE"  "hu6800GENENAME"     
[19] "hu6800GO"            "hu6800GO2ALLPROBES"  "hu6800GO2PROBE"     
[22] "hu6800MAP"           "hu6800MAPCOUNTS"     "hu6800OMIM"         
[25] "hu6800ORGANISM"      "hu6800ORGPKG"        "hu6800PATH"         
[28] "hu6800PATH2PROBE"    "hu6800PFAM"          "hu6800PMID"         
[31] "hu6800PMID2PROBE"    "hu6800PROSITE"       "hu6800REFSEQ"       
[34] "hu6800SYMBOL"        "hu6800UNIGENE"       "hu6800UNIPROT"

to get the mappings between manufacturer identifiers and gene abbreviations you can use hu6800SYMBOL and mget()

mget(ls(hu6800SYMBOL)[1:5],hu6800SYMBOL )

$A28102_at
[1] "GABRA3"

$AB000114_at
[1] "OMD"

$AB000115_at
[1] "IFI44L"

$AB000220_at
[1] "SEMA3C"

$AB000381_s_at
[1] "GML"

you can also use the DBI package infrastructure
- list the tables available for NCBI Entrez Gene metadata

require(DBI)
hh <- org.Hs.eg_dbconn()
dbListTables(hh)

[1] "accessions"            "alias"                 "chrlengths"           
 [4] "chromosome_locations"  "chromosomes"           "cytogenetic_locations"
 [7] "ec"                    "ensembl"               "ensembl2ncbi"         
[10] "ensembl_prot"          "ensembl_trans"         "gene_info"            
[13] "genes"                 "go"                    "go_all"               
[16] "go_bp"                 "go_bp_all"             "go_cc"                
[19] "go_cc_all"             "go_mf"                 "go_mf_all"            
[22] "kegg"                  "map_counts"            "map_metadata"         
[25] "metadata"              "ncbi2ensembl"          "omim"                 
[28] "pfam"                  "prosite"               "pubmed"               
[31] "refseq"                "sqlite_stat1"          "ucsc"                 
[34] "unigene"               "uniprot"

query for genes annotated as sarcoma oncogenes

SQL <- "select * from gene_info gi, genes g where gi._id=g._id and gene_name like '%arcoma%oncogene%'"
sops <- dbGetQuery(hh,SQL)
head(sops)

_id                                                     gene_name symbol
1  314              v-raf murine sarcoma 3611 viral oncogene homolog   ARAF
2  559                v-raf murine sarcoma viral oncogene homolog B1   BRAF
3 1156             v-crk sarcoma virus CT10 oncogene homolog (avian)    CRK
4 1157        v-crk sarcoma virus CT10 oncogene homolog (avian)-like   CRKL
5 1834                                       feline sarcoma oncogene    FES
6 1859 Gardner-Rasheed feline sarcoma viral (v-fgr) oncogene homolog    FGR
   _id gene_id
1  314     369
2  559     673
3 1156    1398
4 1157    1399
5 1834    2242
6 1859    2268

find the LYN gene

sops[grep("LYN",sops$symbol),]

so we see it's in the 13th row and the gene id is 4067

sops[grep("LYN",sops$symbol),]

the we can use get() to get the probe set identifier

get("4067",revmap(hu6800ENTREZID))

[1] "M16038_at"

display the expression data of the gene by disease group

boxplot(exprs(Golub_Merge)["M16038_at",]~Golub_Merge$ALL.AML,col=c("darkgreen","gold"),notch=T)

and the ggplot2 version

require(ggplot2)
gmdata <- data.frame(expr=exprs(Golub_Merge)["M16038_at",],ALL.AML=Golub_Merge$ALL.AML)
ggplot(gmdata,aes(x=ALL.AML,y=expr,fill=ALL.AML)) +
    geom_boxplot(notch=T) +
    scale_fill_manual(values=c("darkgreen","gold")) +
    xlab("")
ggsave("ggboxpl.png")

Friday, July 5, 2013

Ubuntu - Replace string in all files in one directory

use du

grep -rl 'string' somedir/ | xargs sed -i 's/string/string2/g'

Monday, July 1, 2013

R search

RDocumentation.org

RSeek

Subscribe to: Posts ( Atom )