Sunday, June 16, 2013

R - First steps working with microarray data

Working with Microarray Data

along with chapter 19 Lewis, R for Medicine and Biology

raw CEL files

  • the bioconductor packages affy and simpleaffy allow to work with raw microarray data
  • also carry out quality control and simple data analysis
#source("http://bioconductor.org/biocLite.R")
#biocLite("affy")
#biocLite("simpleaffy")
org_babel_R_eoe
  • get some raw data

download.file(url="ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE5nnn/GSE5090/suppl/GSE5090_RAW.tar",destfile="arraydata.tar")
versuche URL 'ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE5nnn/GSE5090/suppl/GSE5090_RAW.tar'
ftp data connection made, file length 63569920 bytes
URL geƶffnet
=================================================
downloaded 60.6 Mb
  • read in data
setwd("~/affyfiles")
 data.raw <- read.affy("covdesc")
setwd("~/")
  • the function read.affy() gives back a object of class AffyBatch (which extends the eSet class - Biobase package)
  • the cdfName slot: Affymetrix array plotform used
  • nrow slot: number of rows in the Affymetrix chip
data.raw@cdfName
[1] "HG-U133A"
data.raw@nrow
Rows 
 712
data.raw@ncol

Cols 
 712
  • so the HG-U133A array has 712 rows and 712 columns -> 506944 spots per chip
  • obtain probe-level data from each individual spot using the exprs() function
    • e.g. intesity data for the last four spots on the first array of the data.raw object

intensity(data.raw)[506940:506944,1]
506940  506941  506942  506943  506944 
  109.0 10380.5   175.3 10044.3   158.0

Normalizing/Quality Control

  • QC methods:
    • density plots of PM intensities
    • RNA degradation plots
    • MvA plots
  • MvA plots

    • assess how similar probe-level data is between samples in the same subgroup, detection of any bias in intensities between samples by assessment of the curve produced
    • M represents log fold change between two samples (y-axis)
    • A represents the mean absolute log expression between samples (x-axis)
    setwd("~/affyfiles")
    require(simpleaffy)
    require(affy)
    data.raw <- read.affy("covdesc")
    setwd("../")
    
    MAplot(data.raw[c(1,4)],pairs=T)
    

    • normalize data and plot again
    data.norm <- normalize(data.raw)
    MAplot(data.norm[c(1,4)],pairs=T)
    

From probe-level data to expression data

  • create expression levels from the raw data
  • create a PMA detection call and the associated p-values from the first array
#  data.mas5 <- call.exprs(data.raw,"mas5")
#  sample1 <- detection.p.val(data.raw[,1],calls=T)
  cbind(sample1$call[1:10],sample1$pval[1:10])
Warnmeldung:
In data.raw[c(1, 4)] :
  The use of abatch[i,] and abatch[i] is deprecated. Please use abatch[,i] instead.
X11cairo 
       2
      [,1] [,2]                 
 [1,] "P"  "0.00261701434183198"
 [2,] "P"  "0.00486279317241156"
 [3,] "P"  "0.00564280932389143"
 [4,] "P"  "0.00418036711173471"
 [5,] "A"  "0.378184514727312"  
 [6,] "P"  "0.00306673915877408"
 [7,] "P"  "0.00564280932389143"
 [8,] "A"  "0.0894050784997029" 
 [9,] "P"  "0.00116478895452843"
[10,] "A"  "0.0813367142140589"
  • qc() generates commonly used quality control metrics incl:
    • average background
    • scale factor
    • number of genes called present
    • 3' to 5' end ratios of beta-actin and GADPH genes

data.qc <- qc(data.raw,data.mas5)
plot(data.qc)



create a list of differentially expressed genes


  • data.mas5 is an instance of the AffyBatch class, so we could apply AffyBatch functions
  • exprs() returns expression-level data different probes
    • e.g. expression levels computed by the Mas 5.0 algorithm for the first 10 probes on the array for four control samples
exprs(data.mas5)[1:10,1:4]
GSM114834.CEL GSM114840.CEL GSM114841.CEL GSM114842.CEL
1007_s_at      7.623914      7.767622      7.799031      7.991894
1053_at        4.748147      5.127184      5.149059      4.975323
117_at         5.552385      6.005922      5.376999      5.947452
121_at         7.789410      7.900269      7.845280      8.188128
1255_g_at      1.254626      3.157348      3.842650      3.800311
1294_at        6.594926      6.445978      7.237356      6.657093
1316_at        5.118859      5.160522      4.894248      4.714638
1320_at        5.110600      5.188031      5.133438      5.185311
1405_i_at      6.766398      6.799344      5.578731      5.805039
1431_at        2.900998      3.487406      4.493393      3.641552
  • the function parwise.comparison() generates fold changes, t-tests, and means for pairs of experimental groups
  • pairwise.comparison() takes as input an exprSet object, value is an instance of the PairComp class
  • pairwise.filter() takes as input an PairComp object, filter regarding to:
    • minimum expression level
    • minimum number of arrays in which a gene is called present across all groups or by group
    • minimum fold change
    • maximum t-test p-val for a gene

  • example: filter results of pairwise comparison using:
    • genes must be called present in at least three samples in one of the groups
    • t-test p-val must be less then 0.01
    • fold change between means of the two groups must be at least 1.5
pair <- pairwise.comparison(data.mas5,group="disease",members=c("control", "polycystic_ovary_syndrome"),spots=data.raw)
pair.filt <- pairwise.filter(pair,min.present.no=3,present.by.group=T,fc=log2(1.5),tt=0.05)
[1] "Checking member control in group: ' disease '"
[1] "Checking member polycystic_ovary_syndrome in group: ' disease '"
  • the pair.filt object is also an instance of the PairComp class
  • how many genes have been returned?:
nrow(pair.filt@means)
[1] 54
pair.filt@means[1:10,1:2]
control polycystic_ovary_syndrome
200879_s_at  7.198598                  6.585865
200951_s_at  4.417044                  5.126597
200974_at   10.134294                  9.549187
201242_s_at  7.949515                  8.556699
201468_s_at  8.909398                  9.676864
201496_x_at  6.486854                  5.289261
201497_x_at  9.179405                  8.257583
202040_s_at  6.770394                  7.379680
202104_s_at  6.888542                  6.274534
202274_at    6.554181                  5.746637
  • view the PMA detection calls using the calls slot
pair.filt@calls[1:10,]
GSM114834.CEL.present GSM114840.CEL.present GSM114841.CEL.present
200879_s_at "P"                   "P"                   "P"                  
200951_s_at "P"                   "P"                   "A"                  
200974_at   "P"                   "P"                   "P"                  
201242_s_at "P"                   "P"                   "P"                  
201468_s_at "P"                   "P"                   "P"                  
201496_x_at "P"                   "P"                   "P"                  
201497_x_at "P"                   "P"                   "P"                  
202040_s_at "P"                   "P"                   "P"                  
202104_s_at "P"                   "P"                   "P"                  
202274_at   "P"                   "A"                   "P"                  
            GSM114842.CEL.present GSM114843.CEL.present GSM114844.CEL.present
200879_s_at "P"                   "P"                   "P"                  
200951_s_at "P"                   "A"                   "A"                  
200974_at   "P"                   "P"                   "P"                  
201242_s_at "P"                   "P"                   "P"                  
201468_s_at "P"                   "P"                   "P"                  
201496_x_at "P"                   "P"                   "P"                  
201497_x_at "P"                   "P"                   "P"                  
202040_s_at "P"                   "P"                   "P"                  
202104_s_at "P"                   "A"                   "P"                  
202274_at   "A"                   "P"                   "P"                  
            GSM114845.CEL.present GSM114846.CEL.present GSM114847.CEL.present
200879_s_at "P"                   "P"                   "P"                  
200951_s_at "A"                   "P"                   "A"                  
200974_at   "P"                   "P"                   "P"                  
201242_s_at "P"                   "P"                   "P"                  
201468_s_at "P"                   "P"                   "P"                  
201496_x_at "P"                   "P"                   "A"                  
201497_x_at "P"                   "P"                   "P"                  
202040_s_at "P"                   "P"                   "P"                  
202104_s_at "M"                   "P"                   "P"                  
202274_at   "P"                   "P"                   "A"                  
            GSM114848.CEL.present GSM114849.CEL.present GSM114850.CEL.present
200879_s_at "P"                   "P"                   "P"                  
200951_s_at "A"                   "A"                   "A"                  
200974_at   "P"                   "P"                   "P"                  
201242_s_at "P"                   "P"                   "P"                  
201468_s_at "P"                   "P"                   "P"                  
201496_x_at "P"                   "P"                   "P"                  
201497_x_at "P"                   "P"                   "P"                  
202040_s_at "P"                   "P"                   "P"                  
202104_s_at "M"                   "P"                   "P"                  
202274_at   "A"                   "P"                   "A"                  
            GSM114851.CEL.present GSM114852.CEL.present GSM114853.CEL.present
200879_s_at "P"                   "P"                   "P"                  
200951_s_at "A"                   "A"                   "P"                  
200974_at   "P"                   "P"                   "P"                  
201242_s_at "P"                   "P"                   "P"                  
201468_s_at "P"                   "P"                   "P"                  
201496_x_at "P"                   "P"                   "A"                  
201497_x_at "P"                   "P"                   "A"                  
202040_s_at "P"                   "P"                   "P"                  
202104_s_at "P"                   "P"                   "P"                  
202274_at   "P"                   "P"                   "A"                  
            GSM114854.CEL.present GSM114855.CEL.present
200879_s_at "P"                   "P"                  
200951_s_at "A"                   "A"                  
200974_at   "P"                   "P"                  
201242_s_at "P"                   "P"                  
201468_s_at "P"                   "P"                  
201496_x_at "M"                   "P"                  
201497_x_at "M"                   "P"                  
202040_s_at "P"                   "P"                  
202104_s_at "P"                   "M"                  
202274_at   "A"                   "P"
  • fold change between the two groups for each gene fc slot
pair.filt@fc[1:10]
200879_s_at 200951_s_at   200974_at 201242_s_at 201468_s_at 201496_x_at 
  0.6127330  -0.7095537   0.5851071  -0.6071841  -0.7674657   1.1975929 
201497_x_at 202040_s_at 202104_s_at   202274_at 
  0.9218226  -0.6092852   0.6140086   0.8075434
  • t-test p-vals
pair.filt@tt[1:10]
200879_s_at 200951_s_at   200974_at 201242_s_at 201468_s_at 201496_x_at 
0.045650045 0.023617284 0.044967617 0.046132070 0.037101434 0.024593573 
201497_x_at 202040_s_at 202104_s_at   202274_at 
0.047164613 0.001084647 0.017975104 0.033348173

No comments :

Post a Comment