Monday, June 10, 2013

R - first steps with GEOquery

Along with chapter 18 of Lewis, R for Medicine and Biology

  • Installing the packages

    source("http://bioconductor.org/biocLite.R")
    biocLite("simpleaffy")
    biocLite("GEOquery")
    
    GEOquery
    
  • Gene Expression Omnibus Repository (GEO)

    • public repository
    • allows submitting of high-throughput experimental data for free access by others
    • includes single- and dual-channel microarrays, measuring mRNA, miRNA, genomic DNA (incl. arrayCGH and SNP) and protein abundance
    • also contains a collection of web-based tools for querying and downloading of datasets
    • for R there exists the GEOquery package
  • sample dataset: Polycystic ovary syndrome: adipose tissue

    • get data
    library(GEOquery)
    gds <- getGEO("GDS2084")
    Meta(gds)
    
    Using locally cached version of GDS2084 found here:
    /tmp/Rtmpa2mlCM/GDS2084.soft.gz
    $channel_count
    [1] "1"
    
    $dataset_id
    [1] "GDS2084" "GDS2084"
    
    $description
    [1] "Analysis of omental adipose tissues of morbidly obese patients with polycystic ovary syndrome (PCOS). PCOS is a common hormonal disorder among women of reproductive age, and is characterized by hyperandrogenism and chronic anovulation. PCOS is associated with obesity."
    [2] "control"                                                                                                                                                                                                                                                                     
    [3] "polycystic ovary syndrome"                                                                                                                                                                                                                                                   
    
    $email
    [1] "geo@ncbi.nlm.nih.gov"
    
    $feature_count
    [1] "22283"
    
    $institute
    [1] "NCBI NLM NIH"
    
    $name
    [1] "Gene Expression Omnibus (GEO)"
    
    $order
    [1] "none"
    
    $platform
    [1] "GPL96"
    
    $platform_organism
    [1] "Homo sapiens"
    
    $platform_technology_type
    [1] "in situ oligonucleotide"
    
    $pubmed_id
    [1] "17062763"
    
    $ref
    [1] "Nucleic Acids Res. 2005 Jan 1;33 Database Issue:D562-6"
    
    $reference_series
    [1] "GSE5090"
    
    $sample_count
    [1] "15"
    
    $sample_id
    [1] "GSM114841,GSM114844,GSM114845,GSM114849,GSM114851,GSM114854,GSM114855"          
    [2] "GSM114834,GSM114842,GSM114843,GSM114847,GSM114848,GSM114850,GSM114852,GSM114853"
    
    $sample_organism
    [1] "Homo sapiens"
    
    $sample_type
    [1] "RNA"
    
    $title
    [1] "Polycystic ovary syndrome: adipose tissue"
    
    $type
    [1] "Expression profiling by array" "disease state"                
    [3] "disease state"                
    
    $update_date
    [1] "Mar 21 2007"
    
    $value_type
    [1] "count"
    
    $web_link
    [1] "http://www.ncbi.nlm.nih.gov/geo"
    


    • display individual expression values
      • first column (ID_REF): probe ID assigned by Affymetrix for each probe
      • second column (IDENTIFIER): ID for the corresponding transcript
      • remaining columns: expression values returned for each array

    Table(gds)[1:10,1:5]
    
    ID_REF IDENTIFIER GSM114841 GSM114844 GSM114845
    1  1007_s_at       DDR1     222.6     252.7     219.3
    2    1053_at       RFC2      35.5      24.5      23.4
    3     117_at      HSPA6      41.5      53.3      31.3
    4     121_at       PAX8     229.8     419.6     274.5
    5  1255_g_at     GUCA1A      14.3        13      29.6
    6    1294_at       UBA7     150.8       116      89.9
    7    1316_at       THRA      29.7      35.4        53
    8    1320_at     PTPN21      35.1      44.8      28.8
    9  1405_i_at       CCL5      47.8      53.2      41.7
    10   1431_at     CYP2E1      22.5      24.9      38.7
    
    • get disease status associated with each sample
    Columns(gds)[,1:2]
    
    sample             disease.state
    1  GSM114841                   control
    2  GSM114844                   control
    3  GSM114845                   control
    4  GSM114849                   control
    5  GSM114851                   control
    6  GSM114854                   control
    7  GSM114855                   control
    8  GSM114834 polycystic ovary syndrome
    9  GSM114842 polycystic ovary syndrome
    10 GSM114843 polycystic ovary syndrome
    11 GSM114847 polycystic ovary syndrome
    12 GSM114848 polycystic ovary syndrome
    13 GSM114850 polycystic ovary syndrome
    14 GSM114852 polycystic ovary syndrome
    15 GSM114853 polycystic ovary syndrome
    
    • information about source of samples
    Columns(gds)[,3]  
    
    [1] "Value for GSM114841: EP3_adipose_control; src: Omental adipose tissue"      
     [2] "Value for GSM114844: EP23_adipose_control; src: Omental adipose tissue"     
     [3] "Value for GSM114845: EP31_adipose_control_rep1; src: Omental adipose tissue"
     [4] "Value for GSM114849: EP37_adipose_control; src: Omental adipose tissue"     
     [5] "Value for GSM114851: EP49_adipose_control; src: Omental adipose tissue"     
     [6] "Value for GSM114854: EP69_adipose_control; src: Omental adipose tissue"     
     [7] "Value for GSM114855: EP71_adipose_control; src: Omental adipose tissue"     
     [8] "Value for GSM114834: EP1_adipose_pcos_rep1; src: Omental adipose tissue"    
     [9] "Value for GSM114842: EP10_adipose_pcos; src: Omental adipose tissue"        
    [10] "Value for GSM114843: EP18_adipose_pcos; src: Omental adipose tissue"        
    [11] "Value for GSM114847: EP32_adipose_pcos; src: Omental adipose tissue"        
    [12] "Value for GSM114848: EP34_adipose_pcos; src: Omental adipose tissue"        
    [13] "Value for GSM114850: EP47_adipose_pcos; src: Omental adipose tissue"        
    [14] "Value for GSM114852: EP55_adipose_pcos; src: Omental adipose tissue"        
    [15] "Value for GSM114853: EP66_adipose_pcos; src: Omental adipose tissue"
    
  • get information in more detail for the first sample (GSM114841)

    gsm <- getGEO("GSM114841")
    Meta(gsm)
    
    Using locally cached version of GSM114841 found here:
    /tmp/Rtmpa2mlCM/GSM114841.soft
    $biomaterial_provider_ch1
    [1] "Ramón y Cajal Hospital, Madrid, Spain"
    
    $channel_count
    [1] "1"
    
    $characteristics_ch1
    [1] "Morbidly obese control subject"
    
    $contact_address
    [1] "ARTURO DUPERIER"
    
    $contact_city
    [1] "MADRID"
    
    $contact_country
    [1] "Spain"
    
    $contact_email
    [1] "bperal@iib.uam.es"
    
    $contact_fax
    [1] "34 91 5854401"
    
    $contact_institute
    [1] "INSTITUTO DE INVESTIGACIONES BIOMEDICAS, CSIC-UAM"
    
    $contact_name
    [1] "BELEN,,PERAL"
    
    $contact_phone
    [1] "34 91 5854478"
    
    $contact_state
    [1] "MADRID"
    
    $`contact_zip/postal_code`
    [1] "28029"
    
    $data_processing
    [1] "MAS 5.0, scaled to 100 and RMA"
    
    $data_row_count
    [1] "22283"
    
    $description
    [1] "Total RNA was extracted from omental  adipose tissue from a control subject"
    
    $geo_accession
    [1] "GSM114841"
    
    $label_ch1
    [1] "Biotin"
    
    $last_update_date
    [1] "Jun 16 2006"
    
    $molecule_ch1
    [1] "total RNA"
    
    $organism_ch1
    [1] "Homo sapiens"
    
    $platform_id
    [1] "GPL96"
    
    $series_id
    [1] "GSE5090"
    
    $source_name_ch1
    [1] "Omental adipose tissue"
    
    $status
    [1] "Public on Jun 17 2006"
    
    $submission_date
    [1] "Jun 16 2006"
    
    $supplementary_file
    [1] "ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM114nnn/GSM114841/suppl/GSM114841.CEL.gz"
    [2] "ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM114nnn/GSM114841/suppl/GSM114841.EXP.gz"
    
    $taxid_ch1
    [1] "9606"
    
    $title
    [1] "EP3_adipose_control"
    
    $type
    [1] "RNA"
    
    • look at the additional data
    Columns(gsm)
    
    ID_REF
    VALUESignal intensity - MAS 5.0, scaled to 100 and RMA
    ABS_CALLPresence/absence of gene transcript in sample; the call in an absolute analysis that indicates if the transcript was present (P), absent (A), marginal (M), or no call (NC)
    Detection p-valuep-value that indicates the significance level of the detection call
    • so there are four columns
      • probe ID
      • expression value (output from a Mas 5.0 analysis software)
      • Detection Call
      • p-val (Detection p-val)
    Table(gsm)[500:510,]
    
    Column
    1            ID_REF
    2             VALUE
    3          ABS_CALL
    4 Detection p-value
                                                                                                                                                                      Description
    1                                                                                                                                                                            
    2                                                                                                                           Signal intensity - MAS 5.0, scaled to 100 and RMA
    3 Presence/absence of gene transcript in sample; the call in an absolute analysis that indicates if the transcript was present (P), absent (A), marginal (M), or no call (NC)
    4                                                                                                         p-value that indicates the significance level of the detection call
            ID_REF VALUE ABS_CALL Detection p-value
    500   33307_at  47.8        P          0.039365
    501   33304_at  35.3        A          0.339558
    502   33197_at  58.4        P          0.017001
    503   33148_at  21.4        P          0.019304
    504   33132_at  15.8        A          0.189687
    505   32837_at   274        P          0.000959
    506   32836_at 319.8        P          0.006532
    507   32811_at 213.3        P          0.024711
    508   32723_at    35        P          0.004863
    509 32699_s_at  43.9        A          0.162935
    510   32625_at  67.1        A           0.11716
    
  • get a series record

    • contains lists for GPL platform object and the GSM sample record object
    • retrieve GSE data structure for our example in one go
    gse <- getGEO("GSE5090")
     gse
    
    Found 1 file(s)
    GSE5090_series_matrix.txt.gz
    Using locally cached version: /tmp/Rtmpa2mlCM/GSE5090_series_matrix.txt.gz
    Using locally cached version of GPL96 found here:
    /tmp/Rtmpa2mlCM/GPL96.soft
    $GSE5090_series_matrix.txt.gz
    ExpressionSet (storageMode: lockedEnvironment)
    assayData: 22283 features, 17 samples 
      element names: exprs 
    protocolData: none
    phenoData
      sampleNames: GSM114834 GSM114840 ... GSM114855 (17 total)
      varLabels: title geo_accession ... data_row_count (30 total)
      varMetadata: labelDescription
    featureData
      featureNames: 1007_s_at 1053_at ... AFFX-TrpnX-M_at (22283 total)
      fvarLabels: ID GB_ACC ... Gene Ontology Molecular Function (16 total)
      fvarMetadata: Column Description labelDescription
    experimentData: use 'experimentData(object)'
    Annotation: GPL96
    

No comments :

Post a Comment