my personal notepad - for all the things I used to write on pieces of paper ( which where never seen again esp. when I needed them ) ...
Friday, July 26, 2013
cdc - now available: 2012 BRFSS data
Data and documentation are available here:
http://www.cdc.gov/brfss/annual_data/annual_2012.html
And this is the link to the
maps 1984 - 2012
Monday, July 22, 2013
R ggbio - visualize a single chromosome
plotIdeogram()
is a wrapper function, synonyms toplotSingleChrom()
- requires packages
rtracklayer, GenomicRanges, BioGenerics, parallel
- takes a name of a genome as argument and downloads the cytoband table from UCSC
require(ggbio) p <- plotIdeogram()
1: hg19 2: hg18 3: hg17 4: hg16 5: vicPac1 6: dasNov3 7: otoGar3 8: papHam1 9: papAnu2 10: felCat5 11: felCat4 12: felCat3 13: panTro4 14: panTro3 15: panTro2 16: panTro1 17: bosTau7 18: bosTau6 19: bosTau4 20: bosTau3 21: bosTau2 22: canFam3 23: canFam2 24: canFam1 25: turTru2 26: loxAfr3 27: musFur1 28: nomLeu3 29: nomLeu2 30: nomLeu1 31: gorGor3 32: cavPor3 33: eriEur1 34: equCab2 35: equCab1 36: dipOrd1 37: triMan1 38: calJac3 39: calJac1 40: pteVam1 41: myoLuc2 42: mm10 43: mm9 44: mm8 45: mm7 46: micMur1 47: hetGla2 48: hetGla1 49: monDom5 50: monDom4 51: monDom1 52: ponAbe2 53: ailMel1 54: susScr3 55: susScr2 56: ochPri2 57: ornAna1 58: oryCun2 59: rn5 60: rn4 61: rn3 62: rheMac3 63: rheMac2 64: proCap1 65: oviAri1 66: sorAra1 67: choHof1 68: speTri2 69: saiBol1 70: sorAra1 71: sarHar1 72: echTel1 73: tupBel1 74: macEug2 75: cerSim1 76: gadMor1 77: melUnd1 78: galGal4 79: galGal3 80: galGal2 81: latCha1 82: fr3 83: fr2 84: fr1 85: petMar2 86: petMar1 87: anoCar2 88: anoCar1 89: oryLat2 90: geoFor1 91: oreNil2 92: chrPic1 93: gasAcu1 94: tetNig2 95: tetNig1 96: melGal1 97: xenTro3 98: xenTro2 99: xenTro1 100: taeGut1 101: danRer7 102: danRer6 103: danRer5 104: danRer4 105: danRer3 106: ci2 107: ci1 108: braFlo1 109: strPur2 110: strPur1 111: apiMel2 112: apiMel1 113: anoGam1 114: droAna2 115: droAna1 116: droEre1 117: droGri1 118: dm3 119: dm2 120: dm1 121: droMoj2 122: droMoj1 123: droPer1 124: dp3 125: dp2 126: droSec1 127: droSim1 128: droVir2 129: droVir1 130: droYak2 131: droYak1 132: caePb2 133: caePb1 134: cb3 135: cb1 136: ce10 137: ce6 138: ce4 139: ce2 140: caeJap1 141: caeRem3 142: caeRem2 143: priPac1 144: aplCal1 145: sacCer3 146: sacCer2 147: sacCer1
Selection: 1
Loading…
Done
use chr1 automatically
- plot chromosom 1 from the hg19 genome (chromosom 1 is the default)
p <- plotIdeogram(genome="hg19") p
- plot second chromosome
p <- plotIdeogram(genome="hg19",subchr="chr2" ) p
- get rid of the cytoband
p <- plotIdeogram(genome="hg19",subchr="chr2",cytoband=F) p
- or change aspect ratio (you can change it also afterwards via
+ theme(aspect.ratio= )
p <- plotIdeogram(genome="hg19",subchr="chr2",aspect.ratio=1/10 ) p
- have a closer look at the structure of
p
attributes(p)
$.S3Class [1] "gg" "ggplot" $class [1] "ideogram" attr(,"package") [1] "ggbio" $names [1] "data" "layers" "scales" "mapping" "theme" [6] "coordinates" "facet" "plot_env" "labels" $xlab [1] "" $main [1] "" $ideogram $xlabel [1] FALSE $ideogram.data GRanges with 62 ranges and 2 metadata columns: seqnames ranges strand | name gieStain <Rle> <IRanges> <Rle> | <factor> <factor> [1] chr2 [ 0, 4400000] * | p25.3 gneg [2] chr2 [ 4400000, 7100000] * | p25.2 gpos50 [3] chr2 [ 7100000, 12200000] * | p25.1 gneg [4] chr2 [12200000, 16700000] * | p24.3 gpos75 [5] chr2 [16700000, 19200000] * | p24.2 gneg ... ... ... ... ... ... ... [58] chr2 [225200000, 226100000] * | q36.2 gneg [59] chr2 [226100000, 231000000] * | q36.3 gpos100 [60] chr2 [231000000, 235600000] * | q37.1 gneg [61] chr2 [235600000, 237300000] * | q37.2 gpos50 [62] chr2 [237300000, 243199373] * | q37.3 gneg --- seqlengths: chr1 chr10 chr11 chr12 chr13 chr14 ... chr6 chr7 chr8 chr9 chrX chrY NA NA NA NA NA NA ... NA NA NA NA NA NA $subchr [1] "chr2" $aspect.ratio [1] 0.1 $color [1] "red" $fill [1] "red" $alpha [1] 0.7 $size [1] 1 $zoom.offset [1] 0.1
- so we see: data are stored within
p
, so it is not necessary to download them again
- change aspect ratio (you can change it also afterwards via
+ theme(aspect.ratio= )
md <- attr(p,"ideogram.data") plotIdeogram(md, aspect.ratio = 0.3)
- so you can always download data and load it later using the
getIdeogram()
function, which has three arguments:
genome
must be one of the result fromucscGenomes()$db
subchr
defines a subset (character vectors)
cytobands
must beTRUE
orFALSE
- you can also use the underlying functions available in the
rtracklayer
package
ucscGenomes()
returns a data frame with information about the available UCSC genomes (identifier, specie, date, official name)
head(ucscGenomes())
db species date name 1 hg19 Human Feb. 2009 Genome Reference Consortium GRCh37 2 hg18 Human Mar. 2006 NCBI Build 36.1 3 hg17 Human May 2004 NCBI Build 35 4 hg16 Human Jul. 2003 NCBI Build 34 5 vicPac1 Alpaca Jul. 2008 Broad Institute VicPac1.0 6 dasNov3 Armadillo Dec. 2011 Broad Institute DasNov3
- plot ideogram of hg19 (data are contained in the
biovizBase
package)
data(hg19IdeogramCyto, package="biovizBase" )
plotIdeogram(hg19IdeogramCyto)
- add
zoom.region
plotIdeogram(hg19IdeogramCyto,zoom.region=c(1e7,5e7) )
- change color of the zoom region (border and filling)
plotIdeogram(hg19IdeogramCyto,zoom.region=c(1e7,5e7),fill="blue",color="blue" )
- change zoom offset (
zoom.offset
) and size of the frame (size
)
p <- plotIdeogram(hg19IdeogramCyto,zoom.region=c(1e7,5e7),fill="blue",color="blue",zoom.offset=4,size=2) p
- visualize chromosome 2 instead
p + xlim(GRanges("chr2", IRanges(1e7, 5e7)))
- add labels on the x-axis
plotIdeogram(hg19IdeogramCyto,zoom.region=c(1e7,5e7),fill="blue",color="blue",zoom.offset=4,size=2,xlabel=T)
- plot overview
autoplot(seqinfo(hg19Ideogram)[paste("chr",c(14:20,"X","Y"),sep="")])
Friday, July 19, 2013
R ggbio - tracks
- container for a plot you want to align
- constructor
tracks()
- first we produce two plots:
require(ggbio) require(gridExtra) df1 <- data.frame(time = 1:100, score = sin((1:100)/20)*10) p1 <- ggplot(data = df1, aes(x = time, y = score)) + geom_line() df2 <- data.frame(time = 30:120, score = sin((30:120)/20)*10, value = rnorm(120-30 + 1)) p2 <- ggplot(data = df2, aes(x = time, y = score)) + geom_line() + geom_point(size = 2, aes(color = value)) grid.arrange(p1,p2)
- the two plots have different scale on the x-axis
- we want to align them on exactly the same x-axis to make it more easy to compare each other
tracks(p1,p2)
- now add a
theme
(included in theggbio
package)
tracks(time1=p1,time2=p2) + xlim(1,40) + theme_tracks_sunset()
- add a title
tracks(time1=p1,time2=p2,title="My title")
- set background colors (for each of the single plots)
tracks(time1=p1,time2=p2,track.plot.color=c("darkred","darkgreen"))
- set background color (for the whole tracks)
tracks(time1=p1,time2=p2,track.bg.color="darkred")
- set color of the label borders and the filling
tracks(time1=p1,time2=p2,label.bg.color="darkred",label.bg.fill="lightblue")
- set color and size of the label text and the width of the labels
tracks(time1=p1,time2=p2,label.text.color="midnightblue",label.text.cex=2,label.width=unit(2.5,"cm"))
- other zoom methods:
- scale along with the limits of a
GRange
- scale along with the limits of a
require(GenomicRanges) gr <- GRanges("chr", IRanges(1, 40)) tracks(time1 = p1, time2 = p2) + xlim(gr)
- scale along with the limits of a
IRange
require(GenomicRanges) gr <- GRanges("chr", IRanges(1, 40)) tracks(time1 = p1, time2 = p2) + xlim(ranges(gr))
- change limits afterwards
require(GenomicRanges) gr <- GRanges("chr", IRanges(1, 40)) trks <- tracks(time1 = p1, time2 = p2) + xlim(ranges(gr)) xlim(trks)
[1] 1 40
xlim(trks) <- c(1,30)
trks
Thursday, July 18, 2013
R - multiple comparisons
along with chapter 16 of Draghici: Statistics and Data Analysis for Microarrays Using R and Bioconductor
- main function in R to obtain corrected p-vals
p.adjust
- to get available methods type:
p.adjust.methods
[1] "holm" "hochberg" "hommel" "bonferroni" "BH" [6] "BY" "fdr" "none"
- simulate 10 raw p-vals on gene level
raw.p.vals <- runif(10,0,0.5) names(raw.p.vals) <- paste("gene",1:length(raw.p.vals),sep="") raw.p.vals
gene1 gene2 gene3 gene4 gene5 gene6 gene7 0.30128864 0.45386933 0.11451989 0.17423164 0.36300090 0.32262681 0.24504471 gene8 gene9 gene10 0.27808264 0.08773618 0.49969999
- get the corrected vals for example
- Holm's stepwise correction
- Holm's stepwise correction
p.adjust(raw.p.vals,method="holm")
gene1 gene2 gene3 gene4 gene5 gene6 gene7 gene8 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 gene9 gene10 0.8773618 1.0000000
- False Discovery Rate
p.adjust(raw.p.vals,method="fdr")
gene1 gene2 gene3 gene4 gene5 gene6 gene7 gene8 0.4537511 0.4997000 0.4537511 0.4537511 0.4537511 0.4537511 0.4537511 0.4537511 gene9 gene10 0.4537511 0.4997000
- the
multtest
package (available from bioconductor) contains also a function for correction (mt.rawp2adjp()
)
require(multtest) mymethods <- c("Bonferroni", "Holm", "Hochberg", "SidakSS", "SidakSD","BH", "BY","ABH","TSBH") mt.rawp2adjp(raw.p.vals,proc=mymethods)
$adjp rawp Bonferroni Holm Hochberg SidakSS SidakSD BH BY [1,] 0.08773618 0.8773618 0.8773618 0.4997 0.6007872 0.6007872 0.4537511 1 [2,] 0.11451989 1.0000000 1.0000000 0.4997 0.7036615 0.6653358 0.4537511 1 [3,] 0.17423164 1.0000000 1.0000000 0.4997 0.8525712 0.7837949 0.4537511 1 [4,] 0.24504471 1.0000000 1.0000000 0.4997 0.9398532 0.8602188 0.4537511 1 [5,] 0.27808264 1.0000000 1.0000000 0.4997 0.9615519 0.8602188 0.4537511 1 [6,] 0.30128864 1.0000000 1.0000000 0.4997 0.9722682 0.8602188 0.4537511 1 [7,] 0.32262681 1.0000000 1.0000000 0.4997 0.9796633 0.8602188 0.4537511 1 [8,] 0.36300090 1.0000000 1.0000000 0.4997 0.9890001 0.8602188 0.4537511 1 [9,] 0.45386933 1.0000000 1.0000000 0.4997 0.9976397 0.8602188 0.4997000 1 [10,] 0.49969999 1.0000000 1.0000000 0.4997 0.9990176 0.8602188 0.4997000 1 ABH TSBH_0.05 [1,] NA 0.4537511 [2,] NA 0.4537511 [3,] NA 0.4537511 [4,] NA 0.4537511 [5,] NA 0.4537511 [6,] NA 0.4537511 [7,] NA 0.4537511 [8,] NA 0.4537511 [9,] NA 0.4997000 [10,] NA 0.4997000 $index [1] 9 3 4 7 8 1 6 5 2 10 $h0.ABH [1] NA $h0.TSBH h0.TSBH_0.05 10 Warnmeldung: In min(which(diff(h0.m, na.rm = TRUE) > 0), na.rm = TRUE) : kein nicht-fehlendes Argument für min; gebe Inf zurück
Wednesday, July 17, 2013
Ubuntu - merge several pdf
gs -dNOPAUSE -sDEVICE=pdfwrite -sOUTPUTFILE=app.pdf -dBATCH application.pdf 201302cv.pdf
Sunday, July 7, 2013
R - Representation of microarray data
along with chapter 7 of Draghici: Statistics and Data Analysis for Microarrays Using R and Bioconductor
- using the data of T. Golub's paper from 1999 on leukemia classification
- the data is contained in a package and can be installed
source("http://bioconductor.org/biocLite.R") biocLite("golubEsets")
- then load the package and the data
require(golubEsets)
data(Golub_Merge)
Golub_Merge
ExpressionSet (storageMode: lockedEnvironment) assayData: 7129 features, 72 samples element names: exprs protocolData: none phenoData sampleNames: 39 40 ... 33 (72 total) varLabels: Samples ALL.AML ... Source (11 total) varMetadata: labelDescription featureData: none experimentData: use 'experimentData(object)' pubMedIds: 10521349 Annotation: hu6800
Golub_Merge
is of classExpressionSet
- the
ExpressionSet
class is part of theBiobase
package
- this class is designed to combine several different sources of information into one single structure
- is input for many Bioconductor functions
- it consists of:
- expression data from microarray experiments (
assayData
)
- meta data describing samples in experiments (
phenoData
)
- annotations and meta-data about the features on the chip or technology used for the experiment (
featureData,annotation
)
- information related to the protocol used for processing each sample (and usually extracted from manufacturer files,
protocolData
)
- and a exible structure to describe the experiment (
experimentData
)
- expression data from microarray experiments (
- so we can get experiment-level metadata along with the pubmed ID
experimentData(Golub_Merge)
Experiment data Experimenter name: Golub TR et al. Laboratory: Whitehead Contact information: Title: ALL/AML discrimination URL: www-genome.wi.mit.edu/mpr/data_set_ALL_AML.html PMIDs: 10521349 Abstract: A 133 word abstract is available. Use 'abstract' method.
- show the first part of the abstract
substr(abstract(Golub_Merge),1,102)
[1] "Although cancer classification has improved over the past 30 years, there has been no general approach"
- one get get the dimension of the expression data
dim(exprs(Golub_Merge))
- look at the five rows of the first 5 columns
exprs(Golub_Merge)[1:5,1:5]
39 40 42 47 48 AFFX-BioB-5_at -342 -87 22 -243 -130 AFFX-BioB-M_at -200 -248 -153 -218 -177 AFFX-BioB-3_at 41 262 17 -163 -28 AFFX-BioC-5_at 328 295 276 182 266 AFFX-BioC-3_at -224 -226 -211 -289 -170
- retrieve information on experimental phenotypes (again we look only at the first five samples/rows)
pData(Golub_Merge)[1:5,]
Samples ALL.AML BM.PB T.B.cell FAB Date Gender pctBlasts Treatment 39 39 ALL BM B-cell <NA> F NA <NA> 40 40 ALL BM B-cell <NA> 5/16/1980 F NA <NA> 42 42 ALL BM B-cell <NA> <NA> F NA <NA> 47 47 ALL BM B-cell <NA> 9/5/1986 M NA <NA> 48 48 ALL BM B-cell <NA> 2/28/1992 F NA <NA> PS Source 39 0.78 DFCI 40 0.68 DFCI 42 0.42 DFCI 47 0.81 DFCI 48 0.94 DFCI
- how are the assay reporters named?
featureNames(Golub_Merge)[1:5]
[1] "AFFX-BioB-5_at" "AFFX-BioB-M_at" "AFFX-BioB-3_at" "AFFX-BioC-5_at" [5] "AFFX-BioC-3_at"
- how are the samples named?
sampleNames(Golub_Merge)
[1] "39" "40" "42" "47" "48" "49" "41" "43" "44" "45" "46" "70" "71" "72" "68" [16] "69" "67" "55" "56" "59" "52" "53" "51" "50" "54" "57" "58" "60" "61" "65" [31] "66" "63" "64" "62" "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" [46] "12" "13" "14" "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" [61] "27" "34" "35" "36" "37" "38" "28" "29" "30" "31" "32" "33"
- show the distribution of the primary outcome
table(Golub_Merge$ALL.AML)
ALL AML 47 25
R - annotation of a microarray platform
along with chapter 7 of Draghici: Statistics and Data Analysis for Microarrays Using R and Bioconductor
- use the
annotation()
accessor to get the type of microarray used
annotation(Golub_Merge)
- so we use the
hu6800.db
package, which provides detailed information about the hu6800 platform. The package is updated biannually.
- install the package via
biocLite("hu6800.db")
- if you want to see what objects this package supports type
ls("package:hu6800.db")
[1] "hu6800" "hu6800ACCNUM" "hu6800ALIAS2PROBE" [4] "hu6800CHR" "hu6800CHRLENGTHS" "hu6800CHRLOC" [7] "hu6800CHRLOCEND" "hu6800.db" "hu6800_dbconn" [10] "hu6800_dbfile" "hu6800_dbInfo" "hu6800_dbschema" [13] "hu6800ENSEMBL" "hu6800ENSEMBL2PROBE" "hu6800ENTREZID" [16] "hu6800ENZYME" "hu6800ENZYME2PROBE" "hu6800GENENAME" [19] "hu6800GO" "hu6800GO2ALLPROBES" "hu6800GO2PROBE" [22] "hu6800MAP" "hu6800MAPCOUNTS" "hu6800OMIM" [25] "hu6800ORGANISM" "hu6800ORGPKG" "hu6800PATH" [28] "hu6800PATH2PROBE" "hu6800PFAM" "hu6800PMID" [31] "hu6800PMID2PROBE" "hu6800PROSITE" "hu6800REFSEQ" [34] "hu6800SYMBOL" "hu6800UNIGENE" "hu6800UNIPROT"
- to get the mappings between manufacturer identifiers and gene abbreviations you can use
hu6800SYMBOL
andmget()
mget(ls(hu6800SYMBOL)[1:5],hu6800SYMBOL )
$A28102_at [1] "GABRA3" $AB000114_at [1] "OMD" $AB000115_at [1] "IFI44L" $AB000220_at [1] "SEMA3C" $AB000381_s_at [1] "GML"
- you can also use the
DBI
package infrastructure
- list the tables available for NCBI Entrez Gene metadata
- list the tables available for NCBI Entrez Gene metadata
require(DBI) hh <- org.Hs.eg_dbconn() dbListTables(hh)
[1] "accessions" "alias" "chrlengths" [4] "chromosome_locations" "chromosomes" "cytogenetic_locations" [7] "ec" "ensembl" "ensembl2ncbi" [10] "ensembl_prot" "ensembl_trans" "gene_info" [13] "genes" "go" "go_all" [16] "go_bp" "go_bp_all" "go_cc" [19] "go_cc_all" "go_mf" "go_mf_all" [22] "kegg" "map_counts" "map_metadata" [25] "metadata" "ncbi2ensembl" "omim" [28] "pfam" "prosite" "pubmed" [31] "refseq" "sqlite_stat1" "ucsc" [34] "unigene" "uniprot"
- query for genes annotated as sarcoma oncogenes
SQL <- "select * from gene_info gi, genes g where gi._id=g._id and gene_name like '%arcoma%oncogene%'" sops <- dbGetQuery(hh,SQL) head(sops)
_id gene_name symbol 1 314 v-raf murine sarcoma 3611 viral oncogene homolog ARAF 2 559 v-raf murine sarcoma viral oncogene homolog B1 BRAF 3 1156 v-crk sarcoma virus CT10 oncogene homolog (avian) CRK 4 1157 v-crk sarcoma virus CT10 oncogene homolog (avian)-like CRKL 5 1834 feline sarcoma oncogene FES 6 1859 Gardner-Rasheed feline sarcoma viral (v-fgr) oncogene homolog FGR _id gene_id 1 314 369 2 559 673 3 1156 1398 4 1157 1399 5 1834 2242 6 1859 2268
- find the LYN gene
sops[grep("LYN",sops$symbol),]
- so we see it's in the 13th row and the gene id is 4067
sops[grep("LYN",sops$symbol),]
- the we can use
get()
to get the probe set identifier
get("4067",revmap(hu6800ENTREZID))
[1] "M16038_at"
- display the expression data of the gene by disease group
boxplot(exprs(Golub_Merge)["M16038_at",]~Golub_Merge$ALL.AML,col=c("darkgreen","gold"),notch=T)
- and the ggplot2 version
require(ggplot2) gmdata <- data.frame(expr=exprs(Golub_Merge)["M16038_at",],ALL.AML=Golub_Merge$ALL.AML) ggplot(gmdata,aes(x=ALL.AML,y=expr,fill=ALL.AML)) + geom_boxplot(notch=T) + scale_fill_manual(values=c("darkgreen","gold")) + xlab("") ggsave("ggboxpl.png")
Friday, July 5, 2013
Ubuntu - Replace string in all files in one directory
use du
grep -rl 'string' somedir/ | xargs sed -i 's/string/string2/g'
Monday, July 1, 2013
Subscribe to:
Posts
(
Atom
)