Monday, July 22, 2013

R ggbio - visualize a single chromosome

  • plotIdeogram() is a wrapper function, synonyms to plotSingleChrom()
  • requires packages rtracklayer, GenomicRanges, BioGenerics, parallel
  • takes a name of a genome as argument and downloads the cytoband table from UCSC
require(ggbio)
p <- plotIdeogram()
1: hg19       2: hg18       3: hg17       4: hg16       5: vicPac1
  6: dasNov3    7: otoGar3    8: papHam1    9: papAnu2   10: felCat5
 11: felCat4   12: felCat3   13: panTro4   14: panTro3   15: panTro2
 16: panTro1   17: bosTau7   18: bosTau6   19: bosTau4   20: bosTau3
 21: bosTau2   22: canFam3   23: canFam2   24: canFam1   25: turTru2
 26: loxAfr3   27: musFur1   28: nomLeu3   29: nomLeu2   30: nomLeu1
 31: gorGor3   32: cavPor3   33: eriEur1   34: equCab2   35: equCab1
 36: dipOrd1   37: triMan1   38: calJac3   39: calJac1   40: pteVam1
 41: myoLuc2   42: mm10      43: mm9       44: mm8       45: mm7    
 46: micMur1   47: hetGla2   48: hetGla1   49: monDom5   50: monDom4
 51: monDom1   52: ponAbe2   53: ailMel1   54: susScr3   55: susScr2
 56: ochPri2   57: ornAna1   58: oryCun2   59: rn5       60: rn4    
 61: rn3       62: rheMac3   63: rheMac2   64: proCap1   65: oviAri1
 66: sorAra1   67: choHof1   68: speTri2   69: saiBol1   70: sorAra1
 71: sarHar1   72: echTel1   73: tupBel1   74: macEug2   75: cerSim1
 76: gadMor1   77: melUnd1   78: galGal4   79: galGal3   80: galGal2
 81: latCha1   82: fr3       83: fr2       84: fr1       85: petMar2
 86: petMar1   87: anoCar2   88: anoCar1   89: oryLat2   90: geoFor1
 91: oreNil2   92: chrPic1   93: gasAcu1   94: tetNig2   95: tetNig1
 96: melGal1   97: xenTro3   98: xenTro2   99: xenTro1  100: taeGut1
101: danRer7  102: danRer6  103: danRer5  104: danRer4  105: danRer3
106: ci2      107: ci1      108: braFlo1  109: strPur2  110: strPur1
111: apiMel2  112: apiMel1  113: anoGam1  114: droAna2  115: droAna1
116: droEre1  117: droGri1  118: dm3      119: dm2      120: dm1    
121: droMoj2  122: droMoj1  123: droPer1  124: dp3      125: dp2    
126: droSec1  127: droSim1  128: droVir2  129: droVir1  130: droYak2
131: droYak1  132: caePb2   133: caePb1   134: cb3      135: cb1    
136: ce10     137: ce6      138: ce4      139: ce2      140: caeJap1
141: caeRem3  142: caeRem2  143: priPac1  144: aplCal1  145: sacCer3
146: sacCer2  147: sacCer1  

Selection: 1
Loading…
Done
use chr1 automatically

  • plot chromosom 1 from the hg19 genome (chromosom 1 is the default)
p <- plotIdeogram(genome="hg19")
p

  • plot second chromosome
p <- plotIdeogram(genome="hg19",subchr="chr2" )
p

  • get rid of the cytoband
p <- plotIdeogram(genome="hg19",subchr="chr2",cytoband=F)
p

  • or change aspect ratio (you can change it also afterwards via + theme(aspect.ratio= )
p <- plotIdeogram(genome="hg19",subchr="chr2",aspect.ratio=1/10 )
p

  • have a closer look at the structure of p
attributes(p)
$.S3Class
[1] "gg"     "ggplot"

$class
[1] "ideogram"
attr(,"package")
[1] "ggbio"

$names
[1] "data"        "layers"      "scales"      "mapping"     "theme"      
[6] "coordinates" "facet"       "plot_env"    "labels"     

$xlab
[1] ""

$main
[1] ""

$ideogram

$xlabel
[1] FALSE

$ideogram.data
GRanges with 62 ranges and 2 metadata columns:
       seqnames                 ranges strand   |     name gieStain
          <Rle>              <IRanges>  <Rle>   | <factor> <factor>
   [1]     chr2   [       0,  4400000]      *   |    p25.3     gneg
   [2]     chr2   [ 4400000,  7100000]      *   |    p25.2   gpos50
   [3]     chr2   [ 7100000, 12200000]      *   |    p25.1     gneg
   [4]     chr2   [12200000, 16700000]      *   |    p24.3   gpos75
   [5]     chr2   [16700000, 19200000]      *   |    p24.2     gneg
   ...      ...                    ...    ... ...      ...      ...
  [58]     chr2 [225200000, 226100000]      *   |    q36.2     gneg
  [59]     chr2 [226100000, 231000000]      *   |    q36.3  gpos100
  [60]     chr2 [231000000, 235600000]      *   |    q37.1     gneg
  [61]     chr2 [235600000, 237300000]      *   |    q37.2   gpos50
  [62]     chr2 [237300000, 243199373]      *   |    q37.3     gneg
  ---
  seqlengths:
    chr1 chr10 chr11 chr12 chr13 chr14 ...  chr6  chr7  chr8  chr9  chrX  chrY
      NA    NA    NA    NA    NA    NA ...    NA    NA    NA    NA    NA    NA

$subchr
[1] "chr2"

$aspect.ratio
[1] 0.1

$color
[1] "red"

$fill
[1] "red"

$alpha
[1] 0.7

$size
[1] 1

$zoom.offset
[1] 0.1


  • so we see: data are stored within p, so it is not necessary to download them again
  • change aspect ratio (you can change it also afterwards via + theme(aspect.ratio= )
md <- attr(p,"ideogram.data")
plotIdeogram(md, aspect.ratio = 0.3)

  • so you can always download data and load it later using the getIdeogram() function, which has three arguments:
    • genome must be one of the result from ucscGenomes()$db
    • subchr defines a subset (character vectors)
    • cytobands must be TRUE or FALSE

  • you can also use the underlying functions available in the rtracklayer package
  • ucscGenomes() returns a data frame with information about the available UCSC genomes (identifier, specie, date, official name)
head(ucscGenomes())
db   species      date                               name
1    hg19     Human Feb. 2009 Genome Reference Consortium GRCh37
2    hg18     Human Mar. 2006                    NCBI Build 36.1
3    hg17     Human  May 2004                      NCBI Build 35
4    hg16     Human Jul. 2003                      NCBI Build 34
5 vicPac1    Alpaca Jul. 2008          Broad Institute VicPac1.0
6 dasNov3 Armadillo Dec. 2011            Broad Institute DasNov3
  • plot ideogram of hg19 (data are contained in the biovizBase package)
data(hg19IdeogramCyto, package="biovizBase" )
plotIdeogram(hg19IdeogramCyto)

  • add zoom.region
plotIdeogram(hg19IdeogramCyto,zoom.region=c(1e7,5e7) )

  • change color of the zoom region (border and filling)
plotIdeogram(hg19IdeogramCyto,zoom.region=c(1e7,5e7),fill="blue",color="blue" )

  • change zoom offset (zoom.offset) and size of the frame (size)
p <- plotIdeogram(hg19IdeogramCyto,zoom.region=c(1e7,5e7),fill="blue",color="blue",zoom.offset=4,size=2)
p

  • visualize chromosome 2 instead
p + xlim(GRanges("chr2", IRanges(1e7, 5e7)))

  • add labels on the x-axis
plotIdeogram(hg19IdeogramCyto,zoom.region=c(1e7,5e7),fill="blue",color="blue",zoom.offset=4,size=2,xlabel=T)

  • plot overview
autoplot(seqinfo(hg19Ideogram)[paste("chr",c(14:20,"X","Y"),sep="")])

Friday, July 19, 2013

R ggbio - tracks

  • container for a plot you want to align
  • constructor tracks()
  • first we produce two plots:
require(ggbio)
require(gridExtra)
df1 <- data.frame(time = 1:100, score = sin((1:100)/20)*10)
p1 <- ggplot(data = df1, aes(x = time, y = score)) +
      geom_line()
df2 <- data.frame(time = 30:120, score = sin((30:120)/20)*10, value = rnorm(120-30 + 1))
p2 <- ggplot(data = df2, aes(x = time, y = score)) +
    geom_line() +
    geom_point(size = 2, aes(color = value))
grid.arrange(p1,p2)

  • the two plots have different scale on the x-axis
  • we want to align them on exactly the same x-axis to make it more easy to compare each other
tracks(p1,p2)

  • now add a theme (included in the ggbio package)
tracks(time1=p1,time2=p2) +
    xlim(1,40) +
    theme_tracks_sunset()

  • add a title
tracks(time1=p1,time2=p2,title="My title")

  • set background colors (for each of the single plots)
tracks(time1=p1,time2=p2,track.plot.color=c("darkred","darkgreen"))

  • set background color (for the whole tracks)
tracks(time1=p1,time2=p2,track.bg.color="darkred")

  • set color of the label borders and the filling
tracks(time1=p1,time2=p2,label.bg.color="darkred",label.bg.fill="lightblue")

  • set color and size of the label text and the width of the labels
tracks(time1=p1,time2=p2,label.text.color="midnightblue",label.text.cex=2,label.width=unit(2.5,"cm"))

  • other zoom methods:
    • scale along with the limits of a GRange

require(GenomicRanges)
gr <- GRanges("chr", IRanges(1, 40))
tracks(time1 = p1, time2 = p2) + xlim(gr)

  • scale along with the limits of a IRange
require(GenomicRanges)
gr <- GRanges("chr", IRanges(1, 40))
tracks(time1 = p1, time2 = p2) + xlim(ranges(gr))

  • change limits afterwards
require(GenomicRanges)
gr <- GRanges("chr", IRanges(1, 40))
trks <- tracks(time1 = p1, time2 = p2) + xlim(ranges(gr))
xlim(trks)
[1]  1 40
xlim(trks) <- c(1,30)
trks

Thursday, July 18, 2013

R - multiple comparisons

along with chapter 16 of Draghici: Statistics and Data Analysis for Microarrays Using R and Bioconductor
  • main function in R to obtain corrected p-vals p.adjust
  • to get available methods type:
p.adjust.methods
[1] "holm"       "hochberg"   "hommel"     "bonferroni" "BH"        
[6] "BY"         "fdr"        "none"
  • simulate 10 raw p-vals on gene level
raw.p.vals <- runif(10,0,0.5)
names(raw.p.vals) <- paste("gene",1:length(raw.p.vals),sep="")
raw.p.vals
     gene1      gene2      gene3      gene4      gene5      gene6      gene7 
0.30128864 0.45386933 0.11451989 0.17423164 0.36300090 0.32262681 0.24504471 
     gene8      gene9     gene10 
0.27808264 0.08773618 0.49969999
  • get the corrected vals for example
    • Holm's stepwise correction

p.adjust(raw.p.vals,method="holm")
    gene1     gene2     gene3     gene4     gene5     gene6     gene7     gene8 
1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 
    gene9    gene10 
0.8773618 1.0000000
  • False Discovery Rate
p.adjust(raw.p.vals,method="fdr")
    gene1     gene2     gene3     gene4     gene5     gene6     gene7     gene8 
0.4537511 0.4997000 0.4537511 0.4537511 0.4537511 0.4537511 0.4537511 0.4537511 
    gene9    gene10 
0.4537511 0.4997000
  • the multtest package (available from bioconductor) contains also a function for correction (mt.rawp2adjp())
require(multtest)
mymethods <- c("Bonferroni", "Holm", "Hochberg", "SidakSS", "SidakSD","BH", "BY","ABH","TSBH")
mt.rawp2adjp(raw.p.vals,proc=mymethods)
$adjp
            rawp Bonferroni      Holm Hochberg   SidakSS   SidakSD        BH BY
 [1,] 0.08773618  0.8773618 0.8773618   0.4997 0.6007872 0.6007872 0.4537511  1
 [2,] 0.11451989  1.0000000 1.0000000   0.4997 0.7036615 0.6653358 0.4537511  1
 [3,] 0.17423164  1.0000000 1.0000000   0.4997 0.8525712 0.7837949 0.4537511  1
 [4,] 0.24504471  1.0000000 1.0000000   0.4997 0.9398532 0.8602188 0.4537511  1
 [5,] 0.27808264  1.0000000 1.0000000   0.4997 0.9615519 0.8602188 0.4537511  1
 [6,] 0.30128864  1.0000000 1.0000000   0.4997 0.9722682 0.8602188 0.4537511  1
 [7,] 0.32262681  1.0000000 1.0000000   0.4997 0.9796633 0.8602188 0.4537511  1
 [8,] 0.36300090  1.0000000 1.0000000   0.4997 0.9890001 0.8602188 0.4537511  1
 [9,] 0.45386933  1.0000000 1.0000000   0.4997 0.9976397 0.8602188 0.4997000  1
[10,] 0.49969999  1.0000000 1.0000000   0.4997 0.9990176 0.8602188 0.4997000  1
      ABH TSBH_0.05
 [1,]  NA 0.4537511
 [2,]  NA 0.4537511
 [3,]  NA 0.4537511
 [4,]  NA 0.4537511
 [5,]  NA 0.4537511
 [6,]  NA 0.4537511
 [7,]  NA 0.4537511
 [8,]  NA 0.4537511
 [9,]  NA 0.4997000
[10,]  NA 0.4997000

$index
 [1]  9  3  4  7  8  1  6  5  2 10

$h0.ABH
[1] NA

$h0.TSBH
h0.TSBH_0.05 
          10 

Warnmeldung:
In min(which(diff(h0.m, na.rm = TRUE) > 0), na.rm = TRUE) :
  kein nicht-fehlendes Argument für min; gebe Inf zurück

Wednesday, July 17, 2013

Ubuntu - merge several pdf

gs -dNOPAUSE -sDEVICE=pdfwrite -sOUTPUTFILE=app.pdf -dBATCH application.pdf 201302cv.pdf 

Sunday, July 7, 2013

R - Representation of microarray data

along with chapter 7 of Draghici: Statistics and Data Analysis for Microarrays Using R and Bioconductor
  • using the data of T. Golub's paper from 1999 on leukemia classification
  • the data is contained in a package and can be installed
source("http://bioconductor.org/biocLite.R")
biocLite("golubEsets")
  • then load the package and the data
require(golubEsets)
data(Golub_Merge)
Golub_Merge  
ExpressionSet (storageMode: lockedEnvironment)
assayData: 7129 features, 72 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: 39 40 ... 33 (72 total)
  varLabels: Samples ALL.AML ... Source (11 total)
  varMetadata: labelDescription
featureData: none
experimentData: use 'experimentData(object)'
  pubMedIds: 10521349 
Annotation: hu6800

  • Golub_Merge is of class ExpressionSet
  • the ExpressionSet class is part of the Biobase package
  • this class is designed to combine several different sources of information into one single structure
  • is input for many Bioconductor functions
  • it consists of:
    • expression data from microarray experiments (assayData)
    • meta data describing samples in experiments (phenoData)
    • annotations and meta-data about the features on the chip or technology used for the experiment (featureData,annotation)
    • information related to the protocol used for processing each sample (and usually extracted from manufacturer files, protocolData)
    • and a exible structure to describe the experiment (experimentData)

  • so we can get experiment-level metadata along with the pubmed ID
experimentData(Golub_Merge)
Experiment data
  Experimenter name: Golub TR et al. 
  Laboratory: Whitehead 
  Contact information: 
 
  Title: ALL/AML discrimination 
  URL: www-genome.wi.mit.edu/mpr/data_set_ALL_AML.html 
  PMIDs: 10521349 

  Abstract: A 133 word abstract is available. Use 'abstract' method.

  • show the first part of the abstract
substr(abstract(Golub_Merge),1,102)
[1] "Although cancer classification has improved over the past 30 years, there has been no general approach"
  • one get get the dimension of the expression data
dim(exprs(Golub_Merge))
  • look at the five rows of the first 5 columns
exprs(Golub_Merge)[1:5,1:5]
                 39   40   42   47   48
AFFX-BioB-5_at -342  -87   22 -243 -130
AFFX-BioB-M_at -200 -248 -153 -218 -177
AFFX-BioB-3_at   41  262   17 -163  -28
AFFX-BioC-5_at  328  295  276  182  266
AFFX-BioC-3_at -224 -226 -211 -289 -170
  • retrieve information on experimental phenotypes (again we look only at the first five samples/rows)
pData(Golub_Merge)[1:5,]
Samples ALL.AML BM.PB T.B.cell  FAB      Date Gender pctBlasts Treatment
39      39     ALL    BM   B-cell <NA>                F        NA      <NA>
40      40     ALL    BM   B-cell <NA> 5/16/1980      F        NA      <NA>
42      42     ALL    BM   B-cell <NA>      <NA>      F        NA      <NA>
47      47     ALL    BM   B-cell <NA>  9/5/1986      M        NA      <NA>
48      48     ALL    BM   B-cell <NA> 2/28/1992      F        NA      <NA>
     PS Source
39 0.78   DFCI
40 0.68   DFCI
42 0.42   DFCI
47 0.81   DFCI
48 0.94   DFCI

  • how are the assay reporters named?
featureNames(Golub_Merge)[1:5]
[1] "AFFX-BioB-5_at" "AFFX-BioB-M_at" "AFFX-BioB-3_at" "AFFX-BioC-5_at"
[5] "AFFX-BioC-3_at"
  • how are the samples named?
sampleNames(Golub_Merge)
 [1] "39" "40" "42" "47" "48" "49" "41" "43" "44" "45" "46" "70" "71" "72" "68"
[16] "69" "67" "55" "56" "59" "52" "53" "51" "50" "54" "57" "58" "60" "61" "65"
[31] "66" "63" "64" "62" "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11"
[46] "12" "13" "14" "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26"
[61] "27" "34" "35" "36" "37" "38" "28" "29" "30" "31" "32" "33"
  • show the distribution of the primary outcome
table(Golub_Merge$ALL.AML)
ALL AML 
 47  25

R - annotation of a microarray platform

along with chapter 7 of Draghici: Statistics and Data Analysis for Microarrays Using R and Bioconductor

  • use the annotation() accessor to get the type of microarray used

annotation(Golub_Merge)

  • so we use the hu6800.db package, which provides detailed information about the hu6800 platform. The package is updated biannually.
  • install the package via biocLite("hu6800.db")
  • if you want to see what objects this package supports type

ls("package:hu6800.db")

[1] "hu6800"              "hu6800ACCNUM"        "hu6800ALIAS2PROBE"  
 [4] "hu6800CHR"           "hu6800CHRLENGTHS"    "hu6800CHRLOC"       
 [7] "hu6800CHRLOCEND"     "hu6800.db"           "hu6800_dbconn"      
[10] "hu6800_dbfile"       "hu6800_dbInfo"       "hu6800_dbschema"    
[13] "hu6800ENSEMBL"       "hu6800ENSEMBL2PROBE" "hu6800ENTREZID"     
[16] "hu6800ENZYME"        "hu6800ENZYME2PROBE"  "hu6800GENENAME"     
[19] "hu6800GO"            "hu6800GO2ALLPROBES"  "hu6800GO2PROBE"     
[22] "hu6800MAP"           "hu6800MAPCOUNTS"     "hu6800OMIM"         
[25] "hu6800ORGANISM"      "hu6800ORGPKG"        "hu6800PATH"         
[28] "hu6800PATH2PROBE"    "hu6800PFAM"          "hu6800PMID"         
[31] "hu6800PMID2PROBE"    "hu6800PROSITE"       "hu6800REFSEQ"       
[34] "hu6800SYMBOL"        "hu6800UNIGENE"       "hu6800UNIPROT"

  • to get the mappings between manufacturer identifiers and gene abbreviations you can use hu6800SYMBOL and mget()
mget(ls(hu6800SYMBOL)[1:5],hu6800SYMBOL )
$A28102_at
[1] "GABRA3"

$AB000114_at
[1] "OMD"

$AB000115_at
[1] "IFI44L"

$AB000220_at
[1] "SEMA3C"

$AB000381_s_at
[1] "GML"
  • you can also use the DBI package infrastructure
    • list the tables available for NCBI Entrez Gene metadata

require(DBI)
hh <- org.Hs.eg_dbconn()
dbListTables(hh)
[1] "accessions"            "alias"                 "chrlengths"           
 [4] "chromosome_locations"  "chromosomes"           "cytogenetic_locations"
 [7] "ec"                    "ensembl"               "ensembl2ncbi"         
[10] "ensembl_prot"          "ensembl_trans"         "gene_info"            
[13] "genes"                 "go"                    "go_all"               
[16] "go_bp"                 "go_bp_all"             "go_cc"                
[19] "go_cc_all"             "go_mf"                 "go_mf_all"            
[22] "kegg"                  "map_counts"            "map_metadata"         
[25] "metadata"              "ncbi2ensembl"          "omim"                 
[28] "pfam"                  "prosite"               "pubmed"               
[31] "refseq"                "sqlite_stat1"          "ucsc"                 
[34] "unigene"               "uniprot"
  • query for genes annotated as sarcoma oncogenes
SQL <- "select * from gene_info gi, genes g where gi._id=g._id and gene_name like '%arcoma%oncogene%'"
sops <- dbGetQuery(hh,SQL)
head(sops)
_id                                                     gene_name symbol
1  314              v-raf murine sarcoma 3611 viral oncogene homolog   ARAF
2  559                v-raf murine sarcoma viral oncogene homolog B1   BRAF
3 1156             v-crk sarcoma virus CT10 oncogene homolog (avian)    CRK
4 1157        v-crk sarcoma virus CT10 oncogene homolog (avian)-like   CRKL
5 1834                                       feline sarcoma oncogene    FES
6 1859 Gardner-Rasheed feline sarcoma viral (v-fgr) oncogene homolog    FGR
   _id gene_id
1  314     369
2  559     673
3 1156    1398
4 1157    1399
5 1834    2242
6 1859    2268
  • find the LYN gene
sops[grep("LYN",sops$symbol),]
  • so we see it's in the 13th row and the gene id is 4067
sops[grep("LYN",sops$symbol),]
  • the we can use get() to get the probe set identifier
get("4067",revmap(hu6800ENTREZID))
[1] "M16038_at"
  • display the expression data of the gene by disease group
boxplot(exprs(Golub_Merge)["M16038_at",]~Golub_Merge$ALL.AML,col=c("darkgreen","gold"),notch=T)

  • and the ggplot2 version
require(ggplot2)
gmdata <- data.frame(expr=exprs(Golub_Merge)["M16038_at",],ALL.AML=Golub_Merge$ALL.AML)
ggplot(gmdata,aes(x=ALL.AML,y=expr,fill=ALL.AML)) +
    geom_boxplot(notch=T) +
    scale_fill_manual(values=c("darkgreen","gold")) +
    xlab("")
ggsave("ggboxpl.png")

Friday, July 5, 2013