Sunday, July 7, 2013

R - Representation of microarray data

along with chapter 7 of Draghici: Statistics and Data Analysis for Microarrays Using R and Bioconductor
  • using the data of T. Golub's paper from 1999 on leukemia classification
  • the data is contained in a package and can be installed
source("http://bioconductor.org/biocLite.R")
biocLite("golubEsets")
  • then load the package and the data
require(golubEsets)
data(Golub_Merge)
Golub_Merge  
ExpressionSet (storageMode: lockedEnvironment)
assayData: 7129 features, 72 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: 39 40 ... 33 (72 total)
  varLabels: Samples ALL.AML ... Source (11 total)
  varMetadata: labelDescription
featureData: none
experimentData: use 'experimentData(object)'
  pubMedIds: 10521349 
Annotation: hu6800

  • Golub_Merge is of class ExpressionSet
  • the ExpressionSet class is part of the Biobase package
  • this class is designed to combine several different sources of information into one single structure
  • is input for many Bioconductor functions
  • it consists of:
    • expression data from microarray experiments (assayData)
    • meta data describing samples in experiments (phenoData)
    • annotations and meta-data about the features on the chip or technology used for the experiment (featureData,annotation)
    • information related to the protocol used for processing each sample (and usually extracted from manufacturer files, protocolData)
    • and a exible structure to describe the experiment (experimentData)

  • so we can get experiment-level metadata along with the pubmed ID
experimentData(Golub_Merge)
Experiment data
  Experimenter name: Golub TR et al. 
  Laboratory: Whitehead 
  Contact information: 
 
  Title: ALL/AML discrimination 
  URL: www-genome.wi.mit.edu/mpr/data_set_ALL_AML.html 
  PMIDs: 10521349 

  Abstract: A 133 word abstract is available. Use 'abstract' method.

  • show the first part of the abstract
substr(abstract(Golub_Merge),1,102)
[1] "Although cancer classification has improved over the past 30 years, there has been no general approach"
  • one get get the dimension of the expression data
dim(exprs(Golub_Merge))
  • look at the five rows of the first 5 columns
exprs(Golub_Merge)[1:5,1:5]
                 39   40   42   47   48
AFFX-BioB-5_at -342  -87   22 -243 -130
AFFX-BioB-M_at -200 -248 -153 -218 -177
AFFX-BioB-3_at   41  262   17 -163  -28
AFFX-BioC-5_at  328  295  276  182  266
AFFX-BioC-3_at -224 -226 -211 -289 -170
  • retrieve information on experimental phenotypes (again we look only at the first five samples/rows)
pData(Golub_Merge)[1:5,]
Samples ALL.AML BM.PB T.B.cell  FAB      Date Gender pctBlasts Treatment
39      39     ALL    BM   B-cell <NA>                F        NA      <NA>
40      40     ALL    BM   B-cell <NA> 5/16/1980      F        NA      <NA>
42      42     ALL    BM   B-cell <NA>      <NA>      F        NA      <NA>
47      47     ALL    BM   B-cell <NA>  9/5/1986      M        NA      <NA>
48      48     ALL    BM   B-cell <NA> 2/28/1992      F        NA      <NA>
     PS Source
39 0.78   DFCI
40 0.68   DFCI
42 0.42   DFCI
47 0.81   DFCI
48 0.94   DFCI

  • how are the assay reporters named?
featureNames(Golub_Merge)[1:5]
[1] "AFFX-BioB-5_at" "AFFX-BioB-M_at" "AFFX-BioB-3_at" "AFFX-BioC-5_at"
[5] "AFFX-BioC-3_at"
  • how are the samples named?
sampleNames(Golub_Merge)
 [1] "39" "40" "42" "47" "48" "49" "41" "43" "44" "45" "46" "70" "71" "72" "68"
[16] "69" "67" "55" "56" "59" "52" "53" "51" "50" "54" "57" "58" "60" "61" "65"
[31] "66" "63" "64" "62" "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11"
[46] "12" "13" "14" "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26"
[61] "27" "34" "35" "36" "37" "38" "28" "29" "30" "31" "32" "33"
  • show the distribution of the primary outcome
table(Golub_Merge$ALL.AML)
ALL AML 
 47  25

No comments :

Post a Comment