my personal notepad - for all the things I used to write on pieces of paper ( which where never seen again esp. when I needed them ) ...
Monday, October 21, 2013
Sunday, October 20, 2013
Thursday, August 8, 2013
Friday, July 26, 2013
cdc - now available: 2012 BRFSS data
Data and documentation are available here:
http://www.cdc.gov/brfss/annual_data/annual_2012.html
And this is the link to the
maps 1984 - 2012
Monday, July 22, 2013
R ggbio - visualize a single chromosome
plotIdeogram()
is a wrapper function, synonyms toplotSingleChrom()
- requires packages
rtracklayer, GenomicRanges, BioGenerics, parallel
- takes a name of a genome as argument and downloads the cytoband table from UCSC
require(ggbio) p <- plotIdeogram()
1: hg19 2: hg18 3: hg17 4: hg16 5: vicPac1 6: dasNov3 7: otoGar3 8: papHam1 9: papAnu2 10: felCat5 11: felCat4 12: felCat3 13: panTro4 14: panTro3 15: panTro2 16: panTro1 17: bosTau7 18: bosTau6 19: bosTau4 20: bosTau3 21: bosTau2 22: canFam3 23: canFam2 24: canFam1 25: turTru2 26: loxAfr3 27: musFur1 28: nomLeu3 29: nomLeu2 30: nomLeu1 31: gorGor3 32: cavPor3 33: eriEur1 34: equCab2 35: equCab1 36: dipOrd1 37: triMan1 38: calJac3 39: calJac1 40: pteVam1 41: myoLuc2 42: mm10 43: mm9 44: mm8 45: mm7 46: micMur1 47: hetGla2 48: hetGla1 49: monDom5 50: monDom4 51: monDom1 52: ponAbe2 53: ailMel1 54: susScr3 55: susScr2 56: ochPri2 57: ornAna1 58: oryCun2 59: rn5 60: rn4 61: rn3 62: rheMac3 63: rheMac2 64: proCap1 65: oviAri1 66: sorAra1 67: choHof1 68: speTri2 69: saiBol1 70: sorAra1 71: sarHar1 72: echTel1 73: tupBel1 74: macEug2 75: cerSim1 76: gadMor1 77: melUnd1 78: galGal4 79: galGal3 80: galGal2 81: latCha1 82: fr3 83: fr2 84: fr1 85: petMar2 86: petMar1 87: anoCar2 88: anoCar1 89: oryLat2 90: geoFor1 91: oreNil2 92: chrPic1 93: gasAcu1 94: tetNig2 95: tetNig1 96: melGal1 97: xenTro3 98: xenTro2 99: xenTro1 100: taeGut1 101: danRer7 102: danRer6 103: danRer5 104: danRer4 105: danRer3 106: ci2 107: ci1 108: braFlo1 109: strPur2 110: strPur1 111: apiMel2 112: apiMel1 113: anoGam1 114: droAna2 115: droAna1 116: droEre1 117: droGri1 118: dm3 119: dm2 120: dm1 121: droMoj2 122: droMoj1 123: droPer1 124: dp3 125: dp2 126: droSec1 127: droSim1 128: droVir2 129: droVir1 130: droYak2 131: droYak1 132: caePb2 133: caePb1 134: cb3 135: cb1 136: ce10 137: ce6 138: ce4 139: ce2 140: caeJap1 141: caeRem3 142: caeRem2 143: priPac1 144: aplCal1 145: sacCer3 146: sacCer2 147: sacCer1
Selection: 1
Loading…
Done
use chr1 automatically
- plot chromosom 1 from the hg19 genome (chromosom 1 is the default)
p <- plotIdeogram(genome="hg19") p
- plot second chromosome
p <- plotIdeogram(genome="hg19",subchr="chr2" ) p
- get rid of the cytoband
p <- plotIdeogram(genome="hg19",subchr="chr2",cytoband=F) p
- or change aspect ratio (you can change it also afterwards via
+ theme(aspect.ratio= )
p <- plotIdeogram(genome="hg19",subchr="chr2",aspect.ratio=1/10 ) p
- have a closer look at the structure of
p
attributes(p)
$.S3Class [1] "gg" "ggplot" $class [1] "ideogram" attr(,"package") [1] "ggbio" $names [1] "data" "layers" "scales" "mapping" "theme" [6] "coordinates" "facet" "plot_env" "labels" $xlab [1] "" $main [1] "" $ideogram $xlabel [1] FALSE $ideogram.data GRanges with 62 ranges and 2 metadata columns: seqnames ranges strand | name gieStain <Rle> <IRanges> <Rle> | <factor> <factor> [1] chr2 [ 0, 4400000] * | p25.3 gneg [2] chr2 [ 4400000, 7100000] * | p25.2 gpos50 [3] chr2 [ 7100000, 12200000] * | p25.1 gneg [4] chr2 [12200000, 16700000] * | p24.3 gpos75 [5] chr2 [16700000, 19200000] * | p24.2 gneg ... ... ... ... ... ... ... [58] chr2 [225200000, 226100000] * | q36.2 gneg [59] chr2 [226100000, 231000000] * | q36.3 gpos100 [60] chr2 [231000000, 235600000] * | q37.1 gneg [61] chr2 [235600000, 237300000] * | q37.2 gpos50 [62] chr2 [237300000, 243199373] * | q37.3 gneg --- seqlengths: chr1 chr10 chr11 chr12 chr13 chr14 ... chr6 chr7 chr8 chr9 chrX chrY NA NA NA NA NA NA ... NA NA NA NA NA NA $subchr [1] "chr2" $aspect.ratio [1] 0.1 $color [1] "red" $fill [1] "red" $alpha [1] 0.7 $size [1] 1 $zoom.offset [1] 0.1
- so we see: data are stored within
p
, so it is not necessary to download them again
- change aspect ratio (you can change it also afterwards via
+ theme(aspect.ratio= )
md <- attr(p,"ideogram.data") plotIdeogram(md, aspect.ratio = 0.3)
- so you can always download data and load it later using the
getIdeogram()
function, which has three arguments:
genome
must be one of the result fromucscGenomes()$db
subchr
defines a subset (character vectors)
cytobands
must beTRUE
orFALSE
- you can also use the underlying functions available in the
rtracklayer
package
ucscGenomes()
returns a data frame with information about the available UCSC genomes (identifier, specie, date, official name)
head(ucscGenomes())
db species date name 1 hg19 Human Feb. 2009 Genome Reference Consortium GRCh37 2 hg18 Human Mar. 2006 NCBI Build 36.1 3 hg17 Human May 2004 NCBI Build 35 4 hg16 Human Jul. 2003 NCBI Build 34 5 vicPac1 Alpaca Jul. 2008 Broad Institute VicPac1.0 6 dasNov3 Armadillo Dec. 2011 Broad Institute DasNov3
- plot ideogram of hg19 (data are contained in the
biovizBase
package)
data(hg19IdeogramCyto, package="biovizBase" )
plotIdeogram(hg19IdeogramCyto)
- add
zoom.region
plotIdeogram(hg19IdeogramCyto,zoom.region=c(1e7,5e7) )
- change color of the zoom region (border and filling)
plotIdeogram(hg19IdeogramCyto,zoom.region=c(1e7,5e7),fill="blue",color="blue" )
- change zoom offset (
zoom.offset
) and size of the frame (size
)
p <- plotIdeogram(hg19IdeogramCyto,zoom.region=c(1e7,5e7),fill="blue",color="blue",zoom.offset=4,size=2) p
- visualize chromosome 2 instead
p + xlim(GRanges("chr2", IRanges(1e7, 5e7)))
- add labels on the x-axis
plotIdeogram(hg19IdeogramCyto,zoom.region=c(1e7,5e7),fill="blue",color="blue",zoom.offset=4,size=2,xlabel=T)
- plot overview
autoplot(seqinfo(hg19Ideogram)[paste("chr",c(14:20,"X","Y"),sep="")])
Friday, July 19, 2013
R ggbio - tracks
- container for a plot you want to align
- constructor
tracks()
- first we produce two plots:
require(ggbio) require(gridExtra) df1 <- data.frame(time = 1:100, score = sin((1:100)/20)*10) p1 <- ggplot(data = df1, aes(x = time, y = score)) + geom_line() df2 <- data.frame(time = 30:120, score = sin((30:120)/20)*10, value = rnorm(120-30 + 1)) p2 <- ggplot(data = df2, aes(x = time, y = score)) + geom_line() + geom_point(size = 2, aes(color = value)) grid.arrange(p1,p2)
- the two plots have different scale on the x-axis
- we want to align them on exactly the same x-axis to make it more easy to compare each other
tracks(p1,p2)
- now add a
theme
(included in theggbio
package)
tracks(time1=p1,time2=p2) + xlim(1,40) + theme_tracks_sunset()
- add a title
tracks(time1=p1,time2=p2,title="My title")
- set background colors (for each of the single plots)
tracks(time1=p1,time2=p2,track.plot.color=c("darkred","darkgreen"))
- set background color (for the whole tracks)
tracks(time1=p1,time2=p2,track.bg.color="darkred")
- set color of the label borders and the filling
tracks(time1=p1,time2=p2,label.bg.color="darkred",label.bg.fill="lightblue")
- set color and size of the label text and the width of the labels
tracks(time1=p1,time2=p2,label.text.color="midnightblue",label.text.cex=2,label.width=unit(2.5,"cm"))
- other zoom methods:
- scale along with the limits of a
GRange
- scale along with the limits of a
require(GenomicRanges) gr <- GRanges("chr", IRanges(1, 40)) tracks(time1 = p1, time2 = p2) + xlim(gr)
- scale along with the limits of a
IRange
require(GenomicRanges) gr <- GRanges("chr", IRanges(1, 40)) tracks(time1 = p1, time2 = p2) + xlim(ranges(gr))
- change limits afterwards
require(GenomicRanges) gr <- GRanges("chr", IRanges(1, 40)) trks <- tracks(time1 = p1, time2 = p2) + xlim(ranges(gr)) xlim(trks)
[1] 1 40
xlim(trks) <- c(1,30)
trks
Thursday, July 18, 2013
R - multiple comparisons
along with chapter 16 of Draghici: Statistics and Data Analysis for Microarrays Using R and Bioconductor
- main function in R to obtain corrected p-vals
p.adjust
- to get available methods type:
p.adjust.methods
[1] "holm" "hochberg" "hommel" "bonferroni" "BH" [6] "BY" "fdr" "none"
- simulate 10 raw p-vals on gene level
raw.p.vals <- runif(10,0,0.5) names(raw.p.vals) <- paste("gene",1:length(raw.p.vals),sep="") raw.p.vals
gene1 gene2 gene3 gene4 gene5 gene6 gene7 0.30128864 0.45386933 0.11451989 0.17423164 0.36300090 0.32262681 0.24504471 gene8 gene9 gene10 0.27808264 0.08773618 0.49969999
- get the corrected vals for example
- Holm's stepwise correction
- Holm's stepwise correction
p.adjust(raw.p.vals,method="holm")
gene1 gene2 gene3 gene4 gene5 gene6 gene7 gene8 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 gene9 gene10 0.8773618 1.0000000
- False Discovery Rate
p.adjust(raw.p.vals,method="fdr")
gene1 gene2 gene3 gene4 gene5 gene6 gene7 gene8 0.4537511 0.4997000 0.4537511 0.4537511 0.4537511 0.4537511 0.4537511 0.4537511 gene9 gene10 0.4537511 0.4997000
- the
multtest
package (available from bioconductor) contains also a function for correction (mt.rawp2adjp()
)
require(multtest) mymethods <- c("Bonferroni", "Holm", "Hochberg", "SidakSS", "SidakSD","BH", "BY","ABH","TSBH") mt.rawp2adjp(raw.p.vals,proc=mymethods)
$adjp rawp Bonferroni Holm Hochberg SidakSS SidakSD BH BY [1,] 0.08773618 0.8773618 0.8773618 0.4997 0.6007872 0.6007872 0.4537511 1 [2,] 0.11451989 1.0000000 1.0000000 0.4997 0.7036615 0.6653358 0.4537511 1 [3,] 0.17423164 1.0000000 1.0000000 0.4997 0.8525712 0.7837949 0.4537511 1 [4,] 0.24504471 1.0000000 1.0000000 0.4997 0.9398532 0.8602188 0.4537511 1 [5,] 0.27808264 1.0000000 1.0000000 0.4997 0.9615519 0.8602188 0.4537511 1 [6,] 0.30128864 1.0000000 1.0000000 0.4997 0.9722682 0.8602188 0.4537511 1 [7,] 0.32262681 1.0000000 1.0000000 0.4997 0.9796633 0.8602188 0.4537511 1 [8,] 0.36300090 1.0000000 1.0000000 0.4997 0.9890001 0.8602188 0.4537511 1 [9,] 0.45386933 1.0000000 1.0000000 0.4997 0.9976397 0.8602188 0.4997000 1 [10,] 0.49969999 1.0000000 1.0000000 0.4997 0.9990176 0.8602188 0.4997000 1 ABH TSBH_0.05 [1,] NA 0.4537511 [2,] NA 0.4537511 [3,] NA 0.4537511 [4,] NA 0.4537511 [5,] NA 0.4537511 [6,] NA 0.4537511 [7,] NA 0.4537511 [8,] NA 0.4537511 [9,] NA 0.4997000 [10,] NA 0.4997000 $index [1] 9 3 4 7 8 1 6 5 2 10 $h0.ABH [1] NA $h0.TSBH h0.TSBH_0.05 10 Warnmeldung: In min(which(diff(h0.m, na.rm = TRUE) > 0), na.rm = TRUE) : kein nicht-fehlendes Argument für min; gebe Inf zurück
Wednesday, July 17, 2013
Ubuntu - merge several pdf
gs -dNOPAUSE -sDEVICE=pdfwrite -sOUTPUTFILE=app.pdf -dBATCH application.pdf 201302cv.pdf
Sunday, July 7, 2013
R - Representation of microarray data
along with chapter 7 of Draghici: Statistics and Data Analysis for Microarrays Using R and Bioconductor
- using the data of T. Golub's paper from 1999 on leukemia classification
- the data is contained in a package and can be installed
source("http://bioconductor.org/biocLite.R") biocLite("golubEsets")
- then load the package and the data
require(golubEsets)
data(Golub_Merge)
Golub_Merge
ExpressionSet (storageMode: lockedEnvironment) assayData: 7129 features, 72 samples element names: exprs protocolData: none phenoData sampleNames: 39 40 ... 33 (72 total) varLabels: Samples ALL.AML ... Source (11 total) varMetadata: labelDescription featureData: none experimentData: use 'experimentData(object)' pubMedIds: 10521349 Annotation: hu6800
Golub_Merge
is of classExpressionSet
- the
ExpressionSet
class is part of theBiobase
package
- this class is designed to combine several different sources of information into one single structure
- is input for many Bioconductor functions
- it consists of:
- expression data from microarray experiments (
assayData
)
- meta data describing samples in experiments (
phenoData
)
- annotations and meta-data about the features on the chip or technology used for the experiment (
featureData,annotation
)
- information related to the protocol used for processing each sample (and usually extracted from manufacturer files,
protocolData
)
- and a exible structure to describe the experiment (
experimentData
)
- expression data from microarray experiments (
- so we can get experiment-level metadata along with the pubmed ID
experimentData(Golub_Merge)
Experiment data Experimenter name: Golub TR et al. Laboratory: Whitehead Contact information: Title: ALL/AML discrimination URL: www-genome.wi.mit.edu/mpr/data_set_ALL_AML.html PMIDs: 10521349 Abstract: A 133 word abstract is available. Use 'abstract' method.
- show the first part of the abstract
substr(abstract(Golub_Merge),1,102)
[1] "Although cancer classification has improved over the past 30 years, there has been no general approach"
- one get get the dimension of the expression data
dim(exprs(Golub_Merge))
- look at the five rows of the first 5 columns
exprs(Golub_Merge)[1:5,1:5]
39 40 42 47 48 AFFX-BioB-5_at -342 -87 22 -243 -130 AFFX-BioB-M_at -200 -248 -153 -218 -177 AFFX-BioB-3_at 41 262 17 -163 -28 AFFX-BioC-5_at 328 295 276 182 266 AFFX-BioC-3_at -224 -226 -211 -289 -170
- retrieve information on experimental phenotypes (again we look only at the first five samples/rows)
pData(Golub_Merge)[1:5,]
Samples ALL.AML BM.PB T.B.cell FAB Date Gender pctBlasts Treatment 39 39 ALL BM B-cell <NA> F NA <NA> 40 40 ALL BM B-cell <NA> 5/16/1980 F NA <NA> 42 42 ALL BM B-cell <NA> <NA> F NA <NA> 47 47 ALL BM B-cell <NA> 9/5/1986 M NA <NA> 48 48 ALL BM B-cell <NA> 2/28/1992 F NA <NA> PS Source 39 0.78 DFCI 40 0.68 DFCI 42 0.42 DFCI 47 0.81 DFCI 48 0.94 DFCI
- how are the assay reporters named?
featureNames(Golub_Merge)[1:5]
[1] "AFFX-BioB-5_at" "AFFX-BioB-M_at" "AFFX-BioB-3_at" "AFFX-BioC-5_at" [5] "AFFX-BioC-3_at"
- how are the samples named?
sampleNames(Golub_Merge)
[1] "39" "40" "42" "47" "48" "49" "41" "43" "44" "45" "46" "70" "71" "72" "68" [16] "69" "67" "55" "56" "59" "52" "53" "51" "50" "54" "57" "58" "60" "61" "65" [31] "66" "63" "64" "62" "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" [46] "12" "13" "14" "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" [61] "27" "34" "35" "36" "37" "38" "28" "29" "30" "31" "32" "33"
- show the distribution of the primary outcome
table(Golub_Merge$ALL.AML)
ALL AML 47 25
R - annotation of a microarray platform
along with chapter 7 of Draghici: Statistics and Data Analysis for Microarrays Using R and Bioconductor
- use the
annotation()
accessor to get the type of microarray used
annotation(Golub_Merge)
- so we use the
hu6800.db
package, which provides detailed information about the hu6800 platform. The package is updated biannually.
- install the package via
biocLite("hu6800.db")
- if you want to see what objects this package supports type
ls("package:hu6800.db")
[1] "hu6800" "hu6800ACCNUM" "hu6800ALIAS2PROBE" [4] "hu6800CHR" "hu6800CHRLENGTHS" "hu6800CHRLOC" [7] "hu6800CHRLOCEND" "hu6800.db" "hu6800_dbconn" [10] "hu6800_dbfile" "hu6800_dbInfo" "hu6800_dbschema" [13] "hu6800ENSEMBL" "hu6800ENSEMBL2PROBE" "hu6800ENTREZID" [16] "hu6800ENZYME" "hu6800ENZYME2PROBE" "hu6800GENENAME" [19] "hu6800GO" "hu6800GO2ALLPROBES" "hu6800GO2PROBE" [22] "hu6800MAP" "hu6800MAPCOUNTS" "hu6800OMIM" [25] "hu6800ORGANISM" "hu6800ORGPKG" "hu6800PATH" [28] "hu6800PATH2PROBE" "hu6800PFAM" "hu6800PMID" [31] "hu6800PMID2PROBE" "hu6800PROSITE" "hu6800REFSEQ" [34] "hu6800SYMBOL" "hu6800UNIGENE" "hu6800UNIPROT"
- to get the mappings between manufacturer identifiers and gene abbreviations you can use
hu6800SYMBOL
andmget()
mget(ls(hu6800SYMBOL)[1:5],hu6800SYMBOL )
$A28102_at [1] "GABRA3" $AB000114_at [1] "OMD" $AB000115_at [1] "IFI44L" $AB000220_at [1] "SEMA3C" $AB000381_s_at [1] "GML"
- you can also use the
DBI
package infrastructure
- list the tables available for NCBI Entrez Gene metadata
- list the tables available for NCBI Entrez Gene metadata
require(DBI) hh <- org.Hs.eg_dbconn() dbListTables(hh)
[1] "accessions" "alias" "chrlengths" [4] "chromosome_locations" "chromosomes" "cytogenetic_locations" [7] "ec" "ensembl" "ensembl2ncbi" [10] "ensembl_prot" "ensembl_trans" "gene_info" [13] "genes" "go" "go_all" [16] "go_bp" "go_bp_all" "go_cc" [19] "go_cc_all" "go_mf" "go_mf_all" [22] "kegg" "map_counts" "map_metadata" [25] "metadata" "ncbi2ensembl" "omim" [28] "pfam" "prosite" "pubmed" [31] "refseq" "sqlite_stat1" "ucsc" [34] "unigene" "uniprot"
- query for genes annotated as sarcoma oncogenes
SQL <- "select * from gene_info gi, genes g where gi._id=g._id and gene_name like '%arcoma%oncogene%'" sops <- dbGetQuery(hh,SQL) head(sops)
_id gene_name symbol 1 314 v-raf murine sarcoma 3611 viral oncogene homolog ARAF 2 559 v-raf murine sarcoma viral oncogene homolog B1 BRAF 3 1156 v-crk sarcoma virus CT10 oncogene homolog (avian) CRK 4 1157 v-crk sarcoma virus CT10 oncogene homolog (avian)-like CRKL 5 1834 feline sarcoma oncogene FES 6 1859 Gardner-Rasheed feline sarcoma viral (v-fgr) oncogene homolog FGR _id gene_id 1 314 369 2 559 673 3 1156 1398 4 1157 1399 5 1834 2242 6 1859 2268
- find the LYN gene
sops[grep("LYN",sops$symbol),]
- so we see it's in the 13th row and the gene id is 4067
sops[grep("LYN",sops$symbol),]
- the we can use
get()
to get the probe set identifier
get("4067",revmap(hu6800ENTREZID))
[1] "M16038_at"
- display the expression data of the gene by disease group
boxplot(exprs(Golub_Merge)["M16038_at",]~Golub_Merge$ALL.AML,col=c("darkgreen","gold"),notch=T)
- and the ggplot2 version
require(ggplot2) gmdata <- data.frame(expr=exprs(Golub_Merge)["M16038_at",],ALL.AML=Golub_Merge$ALL.AML) ggplot(gmdata,aes(x=ALL.AML,y=expr,fill=ALL.AML)) + geom_boxplot(notch=T) + scale_fill_manual(values=c("darkgreen","gold")) + xlab("") ggsave("ggboxpl.png")
Friday, July 5, 2013
Ubuntu - Replace string in all files in one directory
use du
grep -rl 'string' somedir/ | xargs sed -i 's/string/string2/g'
Monday, July 1, 2013
Thursday, June 20, 2013
ls - show only folders
use du
du ~/ --max-depth=1
22576 /home/mandy/.mozilla 4 /home/mandy/Öffentlich 2956 /home/mandy/.config 28 /home/mandy/.icons 16 /home/mandy/.adobe 4714980 /home/mandy/klinik 4 /home/mandy/Vorlagen 4 /home/mandy/fserver 12 /home/mandy/.dbus 4 /home/mandy/Dokumente 19628 /home/mandy/Downloads 120 /home/mandy/.rstudio-desktop 36 /home/mandy/.compiz 5258420 /home/mandy/ggmbh 2940 /home/mandy/.local 44 /home/mandy/.emacs.d 8 /home/mandy/.ssh 8 /home/mandy/.gnome2 44 /home/mandy/.gnupg 4 /home/mandy/Videos 356 /home/mandy/Arbeitsfläche 52 /home/mandy/.gconf 15245532 /home/mandy/fileserver 4 /home/mandy/.hplip 95084 /home/mandy/.cpan 4 /home/mandy/Musik 64 /home/mandy/.macromedia 4 /home/mandy/Ubuntu One 8590040 /home/mandy/servercn 628 /home/mandy/Bilder 4 /home/mandy/.gnome2_private 4 /home/mandy/server170 110976 /home/mandy/R 107880 /home/mandy/.cache 36 /home/mandy/.cpanm 28 /home/mandy/.shutter 92 /home/mandy/.thumbnails 34173652 /home/mandy/
Wednesday, June 19, 2013
R - Grammar of Graphics Figure 2.1
par(mar=c(1,1,1,1)) openplotmat() elpos <- coordinates(3) fromto <- matrix(ncol=2,byrow = T,data=c(1,2,2,3)) nr <- nrow(fromto) arrpos <- matrix(ncol=2,nrow=nr) for(i in 1:nr){ arrpos[i,] <- straightarrow(to=elpos[fromto[i,2],], from=elpos[fromto[i,1],], lwd = 2, arr.pos = 0.68, arr.length = 0.5) } textrect(elpos[1,],0.1,0.09,lab="Source",shadow.col = NULL) textellipse(elpos[2,],0.1,0.09,lab="Make a pie",shadow.col = NULL) textrect(elpos[3,],0.1,0.09,lab="Render",shadow.col = NULL) text(arrpos[1,1]-0.07,arrpos[1,2]-0.08,"Data") text(arrpos[2,1]-0.06,arrpos[2,2]-0.08,"Graphics")
if installing perl modules via CPAN does not work
- check if proxy configured correctly(o conf init /proxy/)
- maybe you should remove files in ~/.cpan/sources/modules
Tuesday, June 18, 2013
R - Grammar of Graphics Figure 1.1 redone with ggplot
An Example (1.4)
- first we have to get the data which are birth and death rates of the year 1990, therefore we use the world bank data (and of course, there is a package
WDI
providing direct access)
- so load the package and download a list of indicators (
WDIcache
)
- the resulting data frame contains indicator and name, and additional information (data description, source)
require(WDI) indicators <- as.data.frame(WDIcache()$series) names(indicators)
[1] "indicator" "name" "description" [4] "sourceDatabase" "sourceOrganization"
- then we have to find our two parameters of interest: the crude birth and the crude death rate:
- we use
grep
on the column name to find them
- we use
grep("crude",indicators$name)
[1] 6553 6554
- so we know that there are two lines containing the string crude in the name variable, so let's show them
indicators[grep("crude",indicators$name),]
indicator name 6553 SP.DYN.CBRT.IN Birth rate, crude (per 1,000 people) 6554 SP.DYN.CDRT.IN Death rate, crude (per 1,000 people) description 6553 Crude birth rate indicates the number of live births occurring during the year, per 1,000 population estimated at midyear. Subtracting the crude death rate from the crude birth rate provides the rate of natural increase, which is equal to the rate of population change in the absence of migration. 6554 Crude death rate indicates the number of deaths occurring during the year, per 1,000 population estimated at midyear. Subtracting the crude death rate from the crude birth rate provides the rate of natural increase, which is equal to the rate of population change in the absence of migration. sourceDatabase 6553 World Development Indicators 6554 World Development Indicators sourceOrganization 6553 (1) United Nations Population Division. World Population Prospects, (2) United Nations Statistical Division. Population and Vital Statistics Reprot (various years), (3) Census reports and other statistical publications from national statistical offices, (4) Eurostat: Demographic Statistics, (5) Secretariat of the Pacific Community: Statistics and Demography Programme, and (6) U.S. Census Bureau: International Database. 6554 (1) United Nations Population Division. World Population Prospects, (2) United Nations Statistical Division. Population and Vital Statistics Reprot (various years), (3) Census reports and other statistical publications from national statistical offices, (4) Eurostat: Demographic Statistics, (5) Secretariat of the Pacific Community: Statistics and Demography Programme, and (6) U.S. Census Bureau: International Database.
- now we now the indicators - and we can download the desired data (and just in case: we download them for the years 1980–2012)
rate.data <- WDI(country="all",indicator=indicators[grep("crude",indicators$name),1],start=1980,end=2012) rate.data.1990 <- rate.data[rate.data$year==1990,] head(rate.data.1990)
iso2c country year SP.DYN.CBRT.IN 11 1A Arab World 1990 34.71824 44 1W World 1990 25.85034 77 4E East Asia & Pacific (developing only) 1990 22.84601 110 7E Europe & Central Asia (developing only) 1990 17.90782 143 8S South Asia 1990 32.91213 176 AD Andorra 1990 NA SP.DYN.CDRT.IN 11 8.262026 44 9.266023 77 6.997506 110 10.065765 143 10.703840 176 NA
- now we have to remove aggregations from these data (such as the European Union, Middle East, etc)
- we use
grepl
on theiso2c
column (which does the same asgrep
but returns logical values (of course you can also usegrep
again))
- the
iso2c
is a two character, standardized country code
- we eliminate all rows with
iso2c
starting with X, equals EU, containing a number, or equals ZQ, ZJ, ZG, or ZF
- we use
rate.data.1990 <- rate.data.1990[!grepl("^X|EU|\\d|Z[QJGF]",rate.data.1990$iso2c,perl=T),] head(rate.data.1990)
iso2c country year SP.DYN.CBRT.IN SP.DYN.CDRT.IN 176 AD Andorra 1990 NA NA 209 AE United Arab Emirates 1990 25.916 2.794 242 AF Afghanistan 1990 52.449 22.062 275 AG Antigua and Barbuda 1990 20.100 6.800 308 AL Albania 1990 24.610 5.909 341 AM Armenia 1990 21.215 7.738
- so that's the data frame
- now the graphics
require(ggplot2) require(gridExtra) require(scales) rate.data.1990$lab <- ifelse(rbinom(size=1,n=nrow(rate.data.1990),prob=0.2)==1,rate.data.1990$country,"") ggplot(rate.data.1990,aes(x=SP.DYN.CBRT.IN,y=SP.DYN.CDRT.IN)) + stat_density2d(aes(colour= ..level..),bins=6,h=c(11,9),geom="density2d") + scale_x_continuous("Birth Rate", limits = c(0,60),breaks=seq(0,60,by=10),expand=c(0,1)) + scale_y_continuous("Death Rate", limits = c(0,30),breaks=seq(0,30,by=10),expand=c(0,1)) + scale_colour_gradientn(colours=c("blue","green","lightgreen","red"),guide="none") + geom_text(aes(label=lab),size=3,position = "jitter") + geom_abline(intersect=0,slope=1) + coord_fixed() + annotate("text",x=20,y=20,angle=45, label="Zero Population Growth",vjust=-0.5,size=4) + theme( panel.background=element_blank(), panel.border=element_rect(colour="black",fill="transparent"), axis.text = element_text(colour="black"), axis.ticks = element_line(colour="black"), axis.line = element_line(colour="black") )
- note:
- the plot in chapter 1.4 of the book uses an epanechnikov kernel, whereas ggplot uses a normal one
- I choose randomly about 20% of the country names as labels (because I was to lazy to pick up exactly those used in the book)
- the plot in chapter 1.4 of the book uses an epanechnikov kernel, whereas ggplot uses a normal one
Sunday, June 16, 2013
R - First steps working with microarray data
Working with Microarray Data
along with chapter 19 Lewis, R for Medicine and Biologyraw CEL files
- the bioconductor packages affy and simpleaffy allow to work with raw microarray data
- also carry out quality control and simple data analysis
#source("http://bioconductor.org/biocLite.R") #biocLite("affy") #biocLite("simpleaffy")
org_babel_R_eoe
- get some raw data
download.file(url="ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE5nnn/GSE5090/suppl/GSE5090_RAW.tar",destfile="arraydata.tar")
versuche URL 'ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE5nnn/GSE5090/suppl/GSE5090_RAW.tar' ftp data connection made, file length 63569920 bytes URL geöffnet ================================================= downloaded 60.6 Mb
- read in data
setwd("~/affyfiles") data.raw <- read.affy("covdesc") setwd("~/")
- the function read.affy() gives back a object of class AffyBatch (which extends the eSet class - Biobase package)
- the cdfName slot: Affymetrix array plotform used
- nrow slot: number of rows in the Affymetrix chip
data.raw@cdfName
[1] "HG-U133A"
data.raw@nrow
Rows 712
data.raw@ncol
Cols 712
- so the HG-U133A array has 712 rows and 712 columns -> 506944 spots per chip
- obtain probe-level data from each individual spot using the exprs() function
- e.g. intesity data for the last four spots on the first array of the data.raw object
- e.g. intesity data for the last four spots on the first array of the data.raw object
intensity(data.raw)[506940:506944,1]
506940 506941 506942 506943 506944 109.0 10380.5 175.3 10044.3 158.0
Normalizing/Quality Control
- QC methods:
- density plots of PM intensities
- RNA degradation plots
- MvA plots
- density plots of PM intensities
- MvA plots
- assess how similar probe-level data is between samples in the same subgroup, detection of any bias in intensities between samples by assessment of the curve produced
- M represents log fold change between two samples (y-axis)
- A represents the mean absolute log expression between samples (x-axis)
setwd("~/affyfiles") require(simpleaffy) require(affy) data.raw <- read.affy("covdesc") setwd("../")
MAplot(data.raw[c(1,4)],pairs=T)
- normalize data and plot again
data.norm <- normalize(data.raw) MAplot(data.norm[c(1,4)],pairs=T)
- assess how similar probe-level data is between samples in the same subgroup, detection of any bias in intensities between samples by assessment of the curve produced
From probe-level data to expression data
- create expression levels from the raw data
- create a PMA detection call and the associated p-values from the first array
# data.mas5 <- call.exprs(data.raw,"mas5") # sample1 <- detection.p.val(data.raw[,1],calls=T) cbind(sample1$call[1:10],sample1$pval[1:10])
Warnmeldung: In data.raw[c(1, 4)] : The use of abatch[i,] and abatch[i] is deprecated. Please use abatch[,i] instead. X11cairo 2 [,1] [,2] [1,] "P" "0.00261701434183198" [2,] "P" "0.00486279317241156" [3,] "P" "0.00564280932389143" [4,] "P" "0.00418036711173471" [5,] "A" "0.378184514727312" [6,] "P" "0.00306673915877408" [7,] "P" "0.00564280932389143" [8,] "A" "0.0894050784997029" [9,] "P" "0.00116478895452843" [10,] "A" "0.0813367142140589"
qc()
generates commonly used quality control metrics incl:
- average background
- scale factor
- number of genes called present
- 3' to 5' end ratios of beta-actin and GADPH genes
- average background
data.qc <- qc(data.raw,data.mas5)
plot(data.qc)
create a list of differentially expressed genes
data.mas5
is an instance of theAffyBatch
class, so we could applyAffyBatch
functions
exprs()
returns expression-level data different probes
- e.g. expression levels computed by the Mas 5.0 algorithm for the first 10 probes on the array for four control samples
- e.g. expression levels computed by the Mas 5.0 algorithm for the first 10 probes on the array for four control samples
exprs(data.mas5)[1:10,1:4]
GSM114834.CEL GSM114840.CEL GSM114841.CEL GSM114842.CEL 1007_s_at 7.623914 7.767622 7.799031 7.991894 1053_at 4.748147 5.127184 5.149059 4.975323 117_at 5.552385 6.005922 5.376999 5.947452 121_at 7.789410 7.900269 7.845280 8.188128 1255_g_at 1.254626 3.157348 3.842650 3.800311 1294_at 6.594926 6.445978 7.237356 6.657093 1316_at 5.118859 5.160522 4.894248 4.714638 1320_at 5.110600 5.188031 5.133438 5.185311 1405_i_at 6.766398 6.799344 5.578731 5.805039 1431_at 2.900998 3.487406 4.493393 3.641552
- the function
parwise.comparison()
generates fold changes, t-tests, and means for pairs of experimental groups
pairwise.comparison()
takes as input anexprSet
object, value is an instance of thePairComp
class
pairwise.filter()
takes as input anPairComp
object, filter regarding to:
- minimum expression level
- minimum number of arrays in which a gene is called present across all groups or by group
- minimum fold change
- maximum t-test p-val for a gene
- minimum expression level
- example: filter results of pairwise comparison using:
- genes must be called present in at least three samples in one of the groups
- t-test p-val must be less then 0.01
- fold change between means of the two groups must be at least 1.5
- genes must be called present in at least three samples in one of the groups
pair <- pairwise.comparison(data.mas5,group="disease",members=c("control", "polycystic_ovary_syndrome"),spots=data.raw) pair.filt <- pairwise.filter(pair,min.present.no=3,present.by.group=T,fc=log2(1.5),tt=0.05)
[1] "Checking member control in group: ' disease '" [1] "Checking member polycystic_ovary_syndrome in group: ' disease '"
- the
pair.filt
object is also an instance of thePairComp
class
- how many genes have been returned?:
nrow(pair.filt@means)
[1] 54
pair.filt@means[1:10,1:2]
control polycystic_ovary_syndrome 200879_s_at 7.198598 6.585865 200951_s_at 4.417044 5.126597 200974_at 10.134294 9.549187 201242_s_at 7.949515 8.556699 201468_s_at 8.909398 9.676864 201496_x_at 6.486854 5.289261 201497_x_at 9.179405 8.257583 202040_s_at 6.770394 7.379680 202104_s_at 6.888542 6.274534 202274_at 6.554181 5.746637
- view the PMA detection calls using the
calls
slot
pair.filt@calls[1:10,]
GSM114834.CEL.present GSM114840.CEL.present GSM114841.CEL.present 200879_s_at "P" "P" "P" 200951_s_at "P" "P" "A" 200974_at "P" "P" "P" 201242_s_at "P" "P" "P" 201468_s_at "P" "P" "P" 201496_x_at "P" "P" "P" 201497_x_at "P" "P" "P" 202040_s_at "P" "P" "P" 202104_s_at "P" "P" "P" 202274_at "P" "A" "P" GSM114842.CEL.present GSM114843.CEL.present GSM114844.CEL.present 200879_s_at "P" "P" "P" 200951_s_at "P" "A" "A" 200974_at "P" "P" "P" 201242_s_at "P" "P" "P" 201468_s_at "P" "P" "P" 201496_x_at "P" "P" "P" 201497_x_at "P" "P" "P" 202040_s_at "P" "P" "P" 202104_s_at "P" "A" "P" 202274_at "A" "P" "P" GSM114845.CEL.present GSM114846.CEL.present GSM114847.CEL.present 200879_s_at "P" "P" "P" 200951_s_at "A" "P" "A" 200974_at "P" "P" "P" 201242_s_at "P" "P" "P" 201468_s_at "P" "P" "P" 201496_x_at "P" "P" "A" 201497_x_at "P" "P" "P" 202040_s_at "P" "P" "P" 202104_s_at "M" "P" "P" 202274_at "P" "P" "A" GSM114848.CEL.present GSM114849.CEL.present GSM114850.CEL.present 200879_s_at "P" "P" "P" 200951_s_at "A" "A" "A" 200974_at "P" "P" "P" 201242_s_at "P" "P" "P" 201468_s_at "P" "P" "P" 201496_x_at "P" "P" "P" 201497_x_at "P" "P" "P" 202040_s_at "P" "P" "P" 202104_s_at "M" "P" "P" 202274_at "A" "P" "A" GSM114851.CEL.present GSM114852.CEL.present GSM114853.CEL.present 200879_s_at "P" "P" "P" 200951_s_at "A" "A" "P" 200974_at "P" "P" "P" 201242_s_at "P" "P" "P" 201468_s_at "P" "P" "P" 201496_x_at "P" "P" "A" 201497_x_at "P" "P" "A" 202040_s_at "P" "P" "P" 202104_s_at "P" "P" "P" 202274_at "P" "P" "A" GSM114854.CEL.present GSM114855.CEL.present 200879_s_at "P" "P" 200951_s_at "A" "A" 200974_at "P" "P" 201242_s_at "P" "P" 201468_s_at "P" "P" 201496_x_at "M" "P" 201497_x_at "M" "P" 202040_s_at "P" "P" 202104_s_at "P" "M" 202274_at "A" "P"
- fold change between the two groups for each gene
fc
slot
pair.filt@fc[1:10]
200879_s_at 200951_s_at 200974_at 201242_s_at 201468_s_at 201496_x_at 0.6127330 -0.7095537 0.5851071 -0.6071841 -0.7674657 1.1975929 201497_x_at 202040_s_at 202104_s_at 202274_at 0.9218226 -0.6092852 0.6140086 0.8075434
- t-test p-vals
pair.filt@tt[1:10]
200879_s_at 200951_s_at 200974_at 201242_s_at 201468_s_at 201496_x_at 0.045650045 0.023617284 0.044967617 0.046132070 0.037101434 0.024593573 201497_x_at 202040_s_at 202104_s_at 202274_at 0.047164613 0.001084647 0.017975104 0.033348173
Wednesday, June 12, 2013
R - change wilcard pattern into regular expression
- the command glob2rx takes a string containing wildcards (such as * or ?) into an equivalent regular expression
glob2rx("pa??ern")
^pa..ern$
- the arguments trim.head and trim.tail could be set to determine whether or not the leading "^" or the trailing "$" should be trimmed from the result
glob2rx("*pa??ern",trim.head=TRUE)
pa..ern$
glob2rx("pa??e.n*",trim.tail=TRUE)
Monday, June 10, 2013
R - first steps with GEOquery
Along with chapter 18 of Lewis, R for Medicine and Biology
- Installing the packages
source("http://bioconductor.org/biocLite.R") biocLite("simpleaffy") biocLite("GEOquery")
GEOquery
- Gene Expression Omnibus Repository (GEO)
- public repository
- allows submitting of high-throughput experimental data for free access by others
- includes single- and dual-channel microarrays, measuring mRNA, miRNA, genomic DNA (incl. arrayCGH and SNP) and protein abundance
- also contains a collection of web-based tools for querying and downloading of datasets
- for R there exists the GEOquery package
- public repository
- sample dataset: Polycystic ovary syndrome: adipose tissue
- get data
library(GEOquery) gds <- getGEO("GDS2084") Meta(gds)
Using locally cached version of GDS2084 found here: /tmp/Rtmpa2mlCM/GDS2084.soft.gz $channel_count [1] "1" $dataset_id [1] "GDS2084" "GDS2084" $description [1] "Analysis of omental adipose tissues of morbidly obese patients with polycystic ovary syndrome (PCOS). PCOS is a common hormonal disorder among women of reproductive age, and is characterized by hyperandrogenism and chronic anovulation. PCOS is associated with obesity." [2] "control" [3] "polycystic ovary syndrome" $email [1] "geo@ncbi.nlm.nih.gov" $feature_count [1] "22283" $institute [1] "NCBI NLM NIH" $name [1] "Gene Expression Omnibus (GEO)" $order [1] "none" $platform [1] "GPL96" $platform_organism [1] "Homo sapiens" $platform_technology_type [1] "in situ oligonucleotide" $pubmed_id [1] "17062763" $ref [1] "Nucleic Acids Res. 2005 Jan 1;33 Database Issue:D562-6" $reference_series [1] "GSE5090" $sample_count [1] "15" $sample_id [1] "GSM114841,GSM114844,GSM114845,GSM114849,GSM114851,GSM114854,GSM114855" [2] "GSM114834,GSM114842,GSM114843,GSM114847,GSM114848,GSM114850,GSM114852,GSM114853" $sample_organism [1] "Homo sapiens" $sample_type [1] "RNA" $title [1] "Polycystic ovary syndrome: adipose tissue" $type [1] "Expression profiling by array" "disease state" [3] "disease state" $update_date [1] "Mar 21 2007" $value_type [1] "count" $web_link [1] "http://www.ncbi.nlm.nih.gov/geo"
- display individual expression values
- first column (ID_REF): probe ID assigned by Affymetrix for each probe
- second column (IDENTIFIER): ID for the corresponding transcript
- remaining columns: expression values returned for each array
- first column (ID_REF): probe ID assigned by Affymetrix for each probe
Table(gds)[1:10,1:5]
ID_REF IDENTIFIER GSM114841 GSM114844 GSM114845 1 1007_s_at DDR1 222.6 252.7 219.3 2 1053_at RFC2 35.5 24.5 23.4 3 117_at HSPA6 41.5 53.3 31.3 4 121_at PAX8 229.8 419.6 274.5 5 1255_g_at GUCA1A 14.3 13 29.6 6 1294_at UBA7 150.8 116 89.9 7 1316_at THRA 29.7 35.4 53 8 1320_at PTPN21 35.1 44.8 28.8 9 1405_i_at CCL5 47.8 53.2 41.7 10 1431_at CYP2E1 22.5 24.9 38.7
- get disease status associated with each sample
Columns(gds)[,1:2]
sample disease.state 1 GSM114841 control 2 GSM114844 control 3 GSM114845 control 4 GSM114849 control 5 GSM114851 control 6 GSM114854 control 7 GSM114855 control 8 GSM114834 polycystic ovary syndrome 9 GSM114842 polycystic ovary syndrome 10 GSM114843 polycystic ovary syndrome 11 GSM114847 polycystic ovary syndrome 12 GSM114848 polycystic ovary syndrome 13 GSM114850 polycystic ovary syndrome 14 GSM114852 polycystic ovary syndrome 15 GSM114853 polycystic ovary syndrome
- information about source of samples
Columns(gds)[,3]
[1] "Value for GSM114841: EP3_adipose_control; src: Omental adipose tissue" [2] "Value for GSM114844: EP23_adipose_control; src: Omental adipose tissue" [3] "Value for GSM114845: EP31_adipose_control_rep1; src: Omental adipose tissue" [4] "Value for GSM114849: EP37_adipose_control; src: Omental adipose tissue" [5] "Value for GSM114851: EP49_adipose_control; src: Omental adipose tissue" [6] "Value for GSM114854: EP69_adipose_control; src: Omental adipose tissue" [7] "Value for GSM114855: EP71_adipose_control; src: Omental adipose tissue" [8] "Value for GSM114834: EP1_adipose_pcos_rep1; src: Omental adipose tissue" [9] "Value for GSM114842: EP10_adipose_pcos; src: Omental adipose tissue" [10] "Value for GSM114843: EP18_adipose_pcos; src: Omental adipose tissue" [11] "Value for GSM114847: EP32_adipose_pcos; src: Omental adipose tissue" [12] "Value for GSM114848: EP34_adipose_pcos; src: Omental adipose tissue" [13] "Value for GSM114850: EP47_adipose_pcos; src: Omental adipose tissue" [14] "Value for GSM114852: EP55_adipose_pcos; src: Omental adipose tissue" [15] "Value for GSM114853: EP66_adipose_pcos; src: Omental adipose tissue"
- get data
- get information in more detail for the first sample (GSM114841)
gsm <- getGEO("GSM114841") Meta(gsm)
Using locally cached version of GSM114841 found here: /tmp/Rtmpa2mlCM/GSM114841.soft $biomaterial_provider_ch1 [1] "Ramón y Cajal Hospital, Madrid, Spain" $channel_count [1] "1" $characteristics_ch1 [1] "Morbidly obese control subject" $contact_address [1] "ARTURO DUPERIER" $contact_city [1] "MADRID" $contact_country [1] "Spain" $contact_email [1] "bperal@iib.uam.es" $contact_fax [1] "34 91 5854401" $contact_institute [1] "INSTITUTO DE INVESTIGACIONES BIOMEDICAS, CSIC-UAM" $contact_name [1] "BELEN,,PERAL" $contact_phone [1] "34 91 5854478" $contact_state [1] "MADRID" $`contact_zip/postal_code` [1] "28029" $data_processing [1] "MAS 5.0, scaled to 100 and RMA" $data_row_count [1] "22283" $description [1] "Total RNA was extracted from omental adipose tissue from a control subject" $geo_accession [1] "GSM114841" $label_ch1 [1] "Biotin" $last_update_date [1] "Jun 16 2006" $molecule_ch1 [1] "total RNA" $organism_ch1 [1] "Homo sapiens" $platform_id [1] "GPL96" $series_id [1] "GSE5090" $source_name_ch1 [1] "Omental adipose tissue" $status [1] "Public on Jun 17 2006" $submission_date [1] "Jun 16 2006" $supplementary_file [1] "ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM114nnn/GSM114841/suppl/GSM114841.CEL.gz" [2] "ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM114nnn/GSM114841/suppl/GSM114841.EXP.gz" $taxid_ch1 [1] "9606" $title [1] "EP3_adipose_control" $type [1] "RNA"
- look at the additional data
Columns(gsm)
ID_REF VALUE Signal intensity - MAS 5.0, scaled to 100 and RMA ABS_CALL Presence/absence of gene transcript in sample; the call in an absolute analysis that indicates if the transcript was present (P), absent (A), marginal (M), or no call (NC) Detection p-value p-value that indicates the significance level of the detection call - so there are four columns
- probe ID
- expression value (output from a Mas 5.0 analysis software)
- Detection Call
- p-val (Detection p-val)
- probe ID
Table(gsm)[500:510,]
Column 1 ID_REF 2 VALUE 3 ABS_CALL 4 Detection p-value Description 1 2 Signal intensity - MAS 5.0, scaled to 100 and RMA 3 Presence/absence of gene transcript in sample; the call in an absolute analysis that indicates if the transcript was present (P), absent (A), marginal (M), or no call (NC) 4 p-value that indicates the significance level of the detection call ID_REF VALUE ABS_CALL Detection p-value 500 33307_at 47.8 P 0.039365 501 33304_at 35.3 A 0.339558 502 33197_at 58.4 P 0.017001 503 33148_at 21.4 P 0.019304 504 33132_at 15.8 A 0.189687 505 32837_at 274 P 0.000959 506 32836_at 319.8 P 0.006532 507 32811_at 213.3 P 0.024711 508 32723_at 35 P 0.004863 509 32699_s_at 43.9 A 0.162935 510 32625_at 67.1 A 0.11716
- look at the additional data
- get a series record
- contains lists for GPL platform object and the GSM sample record object
- retrieve GSE data structure for our example in one go
gse <- getGEO("GSE5090") gse
Found 1 file(s) GSE5090_series_matrix.txt.gz Using locally cached version: /tmp/Rtmpa2mlCM/GSE5090_series_matrix.txt.gz Using locally cached version of GPL96 found here: /tmp/Rtmpa2mlCM/GPL96.soft $GSE5090_series_matrix.txt.gz ExpressionSet (storageMode: lockedEnvironment) assayData: 22283 features, 17 samples element names: exprs protocolData: none phenoData sampleNames: GSM114834 GSM114840 ... GSM114855 (17 total) varLabels: title geo_accession ... data_row_count (30 total) varMetadata: labelDescription featureData featureNames: 1007_s_at 1053_at ... AFFX-TrpnX-M_at (22283 total) fvarLabels: ID GB_ACC ... Gene Ontology Molecular Function (16 total) fvarMetadata: Column Description labelDescription experimentData: use 'experimentData(object)' Annotation: GPL96
- contains lists for GPL platform object and the GSM sample record object
Ubuntu - Samba file sharing
- install samba
- config file in: etc/samba (examples)
- create the corresponding smb user: sudo smbpasswd -a user
Simple Example (German) here
Saturday, April 13, 2013
R - Hardin/Hilbe Generalized Linear Models and Extensions Chapter2 - R code
various choices of variance functions
- function in 2.2 to illustrate the the effect of the assumed variance function (using the identity link for all functions)
x <- 1:3 y <- c(1,2,9) glm(y~x,family=gaussian(link=identity))
Call: glm(formula = y ~ x, family = gaussian(link = identity)) Coefficients: (Intercept) x -4 4 Degrees of Freedom: 2 Total (i.e. Null); 1 Residual Null Deviance: 38 Residual Deviance: 6 AIC: 16.59
glm(y~x,family=poisson(link=identity))
Call: glm(formula = y ~ x, family = poisson(link = identity)) Coefficients: (Intercept) x -2.4 3.2 Degrees of Freedom: 2 Total (i.e. Null); 1 Residual Null Deviance: 9.052 Residual Deviance: 1.69 AIC: 14.36
glm(y~x,family=Gamma(link=identity))
Call: glm(formula = y ~ x, family = Gamma(link = identity)) Coefficients: (Intercept) x -1.799 2.744 Degrees of Freedom: 2 Total (i.e. Null); 1 Residual Null Deviance: 2.537 Residual Deviance: 0.4385 AIC: 14.6
glm(y~x,family=inverse.gaussian(link=identity))
Call: glm(formula = y ~ x, family = inverse.gaussian(link = identity)) Coefficients: (Intercept) x -1.370 2.353 Degrees of Freedom: 2 Total (i.e. Null); 1 Residual Null Deviance: 0.8611 Residual Deviance: 0.1181 AIC: 13.48
model on page 16 to illustrate the use of an offset
- data can be downloaded here as zip http://stata-press.com/data/hh3/glmext3.zip or here as tar.Z http://stata-press.com/data/hh3/glmext3.tar.Z
- first build the model
require(foreign) medpar <- read.dta("glmext3/medpar.dta") m <- glm(los ~ hmo + white + type2 + type3, data=medpar, family=poisson(link="log")) summary(m)
Warnmeldung: In read.dta("glmext3/medpar.dta") : Faktorbeschriftungen von Stata-5-Dateien können nicht gelesen werden Call: glm(formula = los ~ hmo + white + type2 + type3, family = poisson(link = "log"), data = medpar) Deviance Residuals: Min 1Q Median 3Q Max -5.3063 -1.8259 -0.6319 1.0081 15.3827 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.33293 0.02721 85.744 < 2e-16 *** hmo -0.07155 0.02394 -2.988 0.00281 ** white -0.15387 0.02741 -5.613 1.99e-08 *** type2 0.22165 0.02105 10.529 < 2e-16 *** type3 0.70948 0.02614 27.146 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 8901.1 on 1494 degrees of freedom Residual deviance: 8142.7 on 1490 degrees of freedom AIC: 13868 Number of Fisher Scoring iterations: 5
- Wald test whether the coefficient on white is equal to -0.2
require(survey) regTermTest(m,test.terms = "white",null=-0.2,method="Wald")
Wald test for white in glm(formula = los ~ hmo + white + type2 + type3, family = poisson(link = "log"), data = medpar) F = 2.831664 on 1 and 1490 df: p= 0.092632
- define model with offset variable white=-0.2
- do the likelihood-ratio test
m1 <- glm(los ~ hmo + type2 + type3 + offset(-0.2*white), data=medpar, family=poisson(link="log")) anova(m1,m,test="Chisq")
Analysis of Deviance Table Model 1: los ~ hmo + type2 + type3 + offset(-0.2 * white) Model 2: los ~ hmo + white + type2 + type3 Resid. Df Resid. Dev Df Deviance Pr(>Chi) 1 1491 8145.5 2 1490 8142.7 1 2.8657 0.09049 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Sunday, April 7, 2013
R - drop unused levels from factors
- drops unused levels from factors
- can be used on a factor or a data frame (which makes just sense when the data frame contains any factor)
- syntax: droplevels(df)
- create an example data frame
exdf <- data.frame(x=factor(1:10,labels=letters[1:10]),y=rbinom(10,1,0.5))
exdf
x y 1 a 0 2 b 1 3 c 1 4 d 0 5 e 1 6 f 1 7 g 0 8 h 0 9 i 1 10 j 1
- subset exdf by y==1
exdf <- exdf[exdf$y==1,]
summary(exdf$x)
a b c d e f g h i j 0 1 1 0 1 1 0 0 1 1
- we see that all levels of y are present in the summary although several are not used anymore
exdf <- droplevels(exdf)
summary(exdf$x)
b c e f i j 1 1 1 1 1 1
R - abbreviate strings automatically
- abbreviate strings automatically
- option minlength sets the number of characters
- abbreviate in a way that they remain unique
data(armd.wide,package="nlmeU")
names(armd.wide)
[1] "subject" "lesion" "line0" "visual0" "visual4" "visual12" [7] "visual24" "visual52" "treat.f" "miss.pat"
abbreviate(names(armd.wide))
subject lesion line0 visual0 visual4 visual12 visual24 visual52 "sbjc" "lesn" "lin0" "vsl0" "vsl4" "vs12" "vs24" "vs52" treat.f miss.pat "trt." "mss."
Sunday, March 17, 2013
R Knoblauch/Maloney - MPDiR - Figure 2.1
- rebuild some graphics from the book Knoblauch/Maloney: Modeling Psychophysical Data
- from chapter 2 (Modeling), plotting age dependence of threshold for each level combination of Mtype and Sex using ggplot2 (page 27)
- instead of the two panels we use two smooth() layers, loess is the default method for group sizes < 1000, so we only have to tell ggplot to use "lm" as method for one layer
- we set se to F (se=F) in both of them which prevents the convidence intervals from being plotted
- size, linetype, and colour are the corresponding parameters to lwd, lty, and col respectively
- a nice list of line types and shapes of points
- to plot a separate figure for every combination we use facet_wrap()
library(MPDiR) data(Motion) library(ggplot2) ggplot(Motion, aes(x=LnAge,y=LnThresh)) + geom_point() + geom_smooth(method="lm",se=F) + geom_smooth(se=F, size=2, linetype=2, colour="black") + facet_grid(Sex ~ Mtype,as.table=T)
Subscribe to:
Posts
(
Atom
)