Working with Microarray Data
along with chapter 19 Lewis, R for Medicine and Biologyraw CEL files
- the bioconductor packages affy and simpleaffy allow to work with raw microarray data
- also carry out quality control and simple data analysis
#source("http://bioconductor.org/biocLite.R") #biocLite("affy") #biocLite("simpleaffy")
org_babel_R_eoe
- get some raw data
download.file(url="ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE5nnn/GSE5090/suppl/GSE5090_RAW.tar",destfile="arraydata.tar")
versuche URL 'ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE5nnn/GSE5090/suppl/GSE5090_RAW.tar' ftp data connection made, file length 63569920 bytes URL geƶffnet ================================================= downloaded 60.6 Mb
- read in data
setwd("~/affyfiles") data.raw <- read.affy("covdesc") setwd("~/")
- the function read.affy() gives back a object of class AffyBatch (which extends the eSet class - Biobase package)
- the cdfName slot: Affymetrix array plotform used
- nrow slot: number of rows in the Affymetrix chip
data.raw@cdfName
[1] "HG-U133A"
data.raw@nrow
Rows 712
data.raw@ncol
Cols 712
- so the HG-U133A array has 712 rows and 712 columns -> 506944 spots per chip
- obtain probe-level data from each individual spot using the exprs() function
- e.g. intesity data for the last four spots on the first array of the data.raw object
- e.g. intesity data for the last four spots on the first array of the data.raw object
intensity(data.raw)[506940:506944,1]
506940 506941 506942 506943 506944 109.0 10380.5 175.3 10044.3 158.0
Normalizing/Quality Control
- QC methods:
- density plots of PM intensities
- RNA degradation plots
- MvA plots
- density plots of PM intensities
- MvA plots
- assess how similar probe-level data is between samples in the same subgroup, detection of any bias in intensities between samples by assessment of the curve produced
- M represents log fold change between two samples (y-axis)
- A represents the mean absolute log expression between samples (x-axis)
setwd("~/affyfiles") require(simpleaffy) require(affy) data.raw <- read.affy("covdesc") setwd("../")
MAplot(data.raw[c(1,4)],pairs=T)
- normalize data and plot again
data.norm <- normalize(data.raw) MAplot(data.norm[c(1,4)],pairs=T) - assess how similar probe-level data is between samples in the same subgroup, detection of any bias in intensities between samples by assessment of the curve produced
From probe-level data to expression data
- create expression levels from the raw data
- create a PMA detection call and the associated p-values from the first array
# data.mas5 <- call.exprs(data.raw,"mas5") # sample1 <- detection.p.val(data.raw[,1],calls=T) cbind(sample1$call[1:10],sample1$pval[1:10])
Warnmeldung:
In data.raw[c(1, 4)] :
The use of abatch[i,] and abatch[i] is deprecated. Please use abatch[,i] instead.
X11cairo
2
[,1] [,2]
[1,] "P" "0.00261701434183198"
[2,] "P" "0.00486279317241156"
[3,] "P" "0.00564280932389143"
[4,] "P" "0.00418036711173471"
[5,] "A" "0.378184514727312"
[6,] "P" "0.00306673915877408"
[7,] "P" "0.00564280932389143"
[8,] "A" "0.0894050784997029"
[9,] "P" "0.00116478895452843"
[10,] "A" "0.0813367142140589"
qc()generates commonly used quality control metrics incl:
- average background
- scale factor
- number of genes called present
- 3' to 5' end ratios of beta-actin and GADPH genes
- average background
data.qc <- qc(data.raw,data.mas5)
plot(data.qc)
create a list of differentially expressed genes
data.mas5is an instance of theAffyBatchclass, so we could applyAffyBatchfunctions
exprs()returns expression-level data different probes
- e.g. expression levels computed by the Mas 5.0 algorithm for the first 10 probes on the array for four control samples
- e.g. expression levels computed by the Mas 5.0 algorithm for the first 10 probes on the array for four control samples
exprs(data.mas5)[1:10,1:4]
GSM114834.CEL GSM114840.CEL GSM114841.CEL GSM114842.CEL 1007_s_at 7.623914 7.767622 7.799031 7.991894 1053_at 4.748147 5.127184 5.149059 4.975323 117_at 5.552385 6.005922 5.376999 5.947452 121_at 7.789410 7.900269 7.845280 8.188128 1255_g_at 1.254626 3.157348 3.842650 3.800311 1294_at 6.594926 6.445978 7.237356 6.657093 1316_at 5.118859 5.160522 4.894248 4.714638 1320_at 5.110600 5.188031 5.133438 5.185311 1405_i_at 6.766398 6.799344 5.578731 5.805039 1431_at 2.900998 3.487406 4.493393 3.641552
- the function
parwise.comparison()generates fold changes, t-tests, and means for pairs of experimental groups
pairwise.comparison()takes as input anexprSetobject, value is an instance of thePairCompclass
pairwise.filter()takes as input anPairCompobject, filter regarding to:
- minimum expression level
- minimum number of arrays in which a gene is called present across all groups or by group
- minimum fold change
- maximum t-test p-val for a gene
- minimum expression level
- example: filter results of pairwise comparison using:
- genes must be called present in at least three samples in one of the groups
- t-test p-val must be less then 0.01
- fold change between means of the two groups must be at least 1.5
- genes must be called present in at least three samples in one of the groups
pair <- pairwise.comparison(data.mas5,group="disease",members=c("control", "polycystic_ovary_syndrome"),spots=data.raw) pair.filt <- pairwise.filter(pair,min.present.no=3,present.by.group=T,fc=log2(1.5),tt=0.05)
[1] "Checking member control in group: ' disease '" [1] "Checking member polycystic_ovary_syndrome in group: ' disease '"
- the
pair.filtobject is also an instance of thePairCompclass
- how many genes have been returned?:
nrow(pair.filt@means)
[1] 54
pair.filt@means[1:10,1:2]
control polycystic_ovary_syndrome 200879_s_at 7.198598 6.585865 200951_s_at 4.417044 5.126597 200974_at 10.134294 9.549187 201242_s_at 7.949515 8.556699 201468_s_at 8.909398 9.676864 201496_x_at 6.486854 5.289261 201497_x_at 9.179405 8.257583 202040_s_at 6.770394 7.379680 202104_s_at 6.888542 6.274534 202274_at 6.554181 5.746637
- view the PMA detection calls using the
callsslot
pair.filt@calls[1:10,]
GSM114834.CEL.present GSM114840.CEL.present GSM114841.CEL.present
200879_s_at "P" "P" "P"
200951_s_at "P" "P" "A"
200974_at "P" "P" "P"
201242_s_at "P" "P" "P"
201468_s_at "P" "P" "P"
201496_x_at "P" "P" "P"
201497_x_at "P" "P" "P"
202040_s_at "P" "P" "P"
202104_s_at "P" "P" "P"
202274_at "P" "A" "P"
GSM114842.CEL.present GSM114843.CEL.present GSM114844.CEL.present
200879_s_at "P" "P" "P"
200951_s_at "P" "A" "A"
200974_at "P" "P" "P"
201242_s_at "P" "P" "P"
201468_s_at "P" "P" "P"
201496_x_at "P" "P" "P"
201497_x_at "P" "P" "P"
202040_s_at "P" "P" "P"
202104_s_at "P" "A" "P"
202274_at "A" "P" "P"
GSM114845.CEL.present GSM114846.CEL.present GSM114847.CEL.present
200879_s_at "P" "P" "P"
200951_s_at "A" "P" "A"
200974_at "P" "P" "P"
201242_s_at "P" "P" "P"
201468_s_at "P" "P" "P"
201496_x_at "P" "P" "A"
201497_x_at "P" "P" "P"
202040_s_at "P" "P" "P"
202104_s_at "M" "P" "P"
202274_at "P" "P" "A"
GSM114848.CEL.present GSM114849.CEL.present GSM114850.CEL.present
200879_s_at "P" "P" "P"
200951_s_at "A" "A" "A"
200974_at "P" "P" "P"
201242_s_at "P" "P" "P"
201468_s_at "P" "P" "P"
201496_x_at "P" "P" "P"
201497_x_at "P" "P" "P"
202040_s_at "P" "P" "P"
202104_s_at "M" "P" "P"
202274_at "A" "P" "A"
GSM114851.CEL.present GSM114852.CEL.present GSM114853.CEL.present
200879_s_at "P" "P" "P"
200951_s_at "A" "A" "P"
200974_at "P" "P" "P"
201242_s_at "P" "P" "P"
201468_s_at "P" "P" "P"
201496_x_at "P" "P" "A"
201497_x_at "P" "P" "A"
202040_s_at "P" "P" "P"
202104_s_at "P" "P" "P"
202274_at "P" "P" "A"
GSM114854.CEL.present GSM114855.CEL.present
200879_s_at "P" "P"
200951_s_at "A" "A"
200974_at "P" "P"
201242_s_at "P" "P"
201468_s_at "P" "P"
201496_x_at "M" "P"
201497_x_at "M" "P"
202040_s_at "P" "P"
202104_s_at "P" "M"
202274_at "A" "P"
- fold change between the two groups for each gene
fcslot
pair.filt@fc[1:10]
200879_s_at 200951_s_at 200974_at 201242_s_at 201468_s_at 201496_x_at 0.6127330 -0.7095537 0.5851071 -0.6071841 -0.7674657 1.1975929 201497_x_at 202040_s_at 202104_s_at 202274_at 0.9218226 -0.6092852 0.6140086 0.8075434
- t-test p-vals
pair.filt@tt[1:10]
200879_s_at 200951_s_at 200974_at 201242_s_at 201468_s_at 201496_x_at 0.045650045 0.023617284 0.044967617 0.046132070 0.037101434 0.024593573 201497_x_at 202040_s_at 202104_s_at 202274_at 0.047164613 0.001084647 0.017975104 0.033348173



No comments :
Post a Comment