How To Correctly Do Go Analysis Of Rnaseq Data

Install and load packages
Mapping reads to a reference genome
Count reads for each feature
QC and stats
Differential Expression
Gene Note
Gene Ready Enrichment

Learning Objectives

This course is an introduction to differential expression analysis from RNAseq data. It volition take you lot from the raw fastq files all the style to the list of differentially expressed genes, via the mapping of the reads to a reference genome and statistical assay using the limma package.

The Monash Bioinformatics Platform thanks and acknowledges QFAB (http://www.qfab.org) for original materials in this section.

Install and load packages

Most generic R packages are hosted on the Comprehensive R Archive Network (CRAN, http://cran.us.r-project.org/). To install ane of these packages, you lot would utilise install.packages("packagename"). You lot only demand to install a package once, then load it each time using library(packagename).

Bioconductor packages work a bit differently, and are non hosted on CRAN. Get to http://bioconductor.org/ to learn more about the Bioconductor project. To use any Bioconductor parcel, yous'll need a few "core" Bioconductor packages. Run the following commands to (one) download the installer script, and (2) install some core Bioconductor packages. Y'all'll need internet connectivity to do this, and it'll accept a few minutes, but information technology only needs to be done once.

            # Download the installer script source("http://bioconductor.org/biocLite.R")  # biocLite() is the bioconductor installer function. Run information technology without whatever # arguments to install the core packages or update any installed packages. This # requires internet connectivity and will take some time! biocLite()

Annotation

All packages required for this course have already been installed on our grooming server so you lot won't need to run this part today.

Mapping reads to a reference genome

Data pre-processing

To start with, delete all previously saved R objects and define your working directory for the RNAseq data analysis.

              # Delete all previously saved R objects rm(list=ls())

The information considered for the RNAseq function of the form have been downloaded from ArrayExpress (http://www.ebi.ac.britain/arrayexpress) and stand for to viii RNA sequencing libraries from Man brain and liver.

Raw sequencing information are commonly available in FASTQ format which is a well defined text-based format for storing both biological sequences (usually nucleotide sequences) and their respective quality scores. The raw data from this report have been downloaded (8Gb / fastq file) into the shared directory "~/data/RNAseq/raw_data".

              # define shared directory for RNAseq data RNAseqDATADIR <- "data/raw_data" #list the fastq files in the raw data directory dir(RNAseqDATADIR)

              ## [1] "ERR420388_mini_1.fastq.gz"    "ERR420388_mini_2.fastq.gz"    ## [three] "ERR420388_subsamp_1.fastq.gz" "ERR420388_subsamp_2.fastq.gz" ## [five] "experiment_design.txt"

The first step in a RNAseq assay is to run a quick quality check on your data, this volition give you an thought of the quality of your raw data in terms of number of reads per library, read length, average quality score along the reads, GC content, sequence duplication level, adaptors that might have not been removed correctly from the information etc.

The fastQC tool is quick and piece of cake to run and can exist downloaded from here: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/.

To ensure highest quality of the sequences for subsequent mapping and differential expression analysis steps, the reads can also be trimmed using the Trimmomatic tool (Lohse et al. 2012, http://www.usadellab.org/cms/?page=trimmomatic).

NOTE

For the scope of this course we will focus on the R-based steps and will assume that the data are fit for purpose.

Also the adjacent two sections (Mapping and Read counts) will be performed on small subsets of those files due to time and I/O limitations.

Mapping

Once the reads have been quality checked and trimmed, the next step is to map the reads to the reference genome (in our case the human genome "hg19"). This can be done with the Bioconductor parcel Rsubread (Y et al. 2013).

              library(Rsubread)

NOTE

Delight think that you lot can also perform those steps using command line tools (such as BWA and featureCounts) as described in the Unix crush class.

If the packet is not already installed on your organization, you can install it by typing:

                source("http://bioconductor.org/biocLite.R") biocLite("Rsubread")

Rsubread provides reference genome indices for the most common organisms: human and mouse. If you are working with a different organism you can build your own index using the buildindex command.

                # define the reference genome fasta file REF_GENOME <- "hg19.fa" # define the output directory for the Rsubread index # (admin notation: requires data/ref_data/download_hg19.sh to be run first) RSUBREAD_INDEX_PATH <- "data/ref_data" # ascertain the basename for the index RSUBREAD_INDEX_BASE <- "hg19" # check what is in the reference directory dir(RSUBREAD_INDEX_PATH)

NOTE

For the purpose of this course the alphabetize has already been built and the mapping will be done on small-scale subset of reads since this step tin can take up to a couple of hours per library. Please practise not run the command below.

                # build the index buildindex(basename=file.path(RSUBREAD_INDEX_PATH,RSUBREAD_INDEX_BASE), reference=REF_GENOME)

In one case the Rsubread alphabetize has been created yous tin map your reads to the genome by running the align command.

The code beneath volition exist used to map the reads for a specific library against the genome for which the alphabetize has been built.

NOTE

For the purpose of this course the mapping will be performed merely by the instructor and on small subset of reads since this step can take up to a couple of hours per library. Please do not run the command below.

                # list files in the raw data directory dir(RNAseqDATADIR) # define the fastq file with forward reads inputfilefwd <- file.path(RNAseqDATADIR,"ERR420388_subsamp_1.fastq.gz") # define the fastq file with reverse reads inputfilervs <- file.path(RNAseqDATADIR,"ERR420388_subsamp_2.fastq.gz")  # run the align control to map the reads align(alphabetize=file.path(RSUBREAD_INDEX_PATH,RSUBREAD_INDEX_BASE), readfile1=inputfilefwd, readfile2=inputfilervs, output_file="ERR420388.sam", output_format="SAM")

Note:

The nthreads parameter can be used in the marshal control to speed up the process and run the alignment using several CPUs in parallel.

The function propmapped returns the proportion of mapped reads in the output SAM file: total number of input reads, number of mapped reads and proportion of mapped reads. Let'south have a expect at a small SAM file as an instance:

                # ascertain the path to SAM file   outputsamfile <- "data/mapping/ERR420388.sam" propmapped(outputsamfile)

Count reads for each feature

Rsubread provides a read summarization function featureCounts, which takes as input the SAM or BAM files and assigns them to genomic features. This gives the number of reads mapped per factor, which can then be transformed into RPKM values (Read Per Killobase per Meg), normalised and tested for differential expression.

For the purpose of this class we volition use the alphabetize previously built and its corresponding GTF file to identify and count the reads mapped to each feature (gene).

            # Getting read counts using the index previously built mycounts<-featureCounts(outputsamfile, annot.ext=file.path(RSUBREAD_INDEX_PATH,"hg19.genes.gtf"), isGTFAnnotationFile=Truthful, isPairedEnd=Truthful)  # Checking your output object summary(mycounts) dim(mycounts$counts) caput(mycounts$annotation) mycounts$targets mycounts$stat

NOTE:

In-built annotations for mm9, mm10 and hg19 are likewise provided for users' convenience, but nosotros won't be using information technology for this course. Please do not run the control below

Getting read counts using the inbuilt hg19 notation:

mycounts.inbuilt <- featureCounts(myoutputfile, annot.inbuilt="hg19", isGTFAnnotationFile=FALSE, isPairedEnd=True)

For the purpose of this course the read summarisation footstep has already been performed for all libraries. You will need to load the corresponding RData file to become these read counts.

              MAPPINGDIR <- "data/mapping" # load the counts previously calculated load(file.path(MAPPINGDIR,"RawCounts.RData")) # check the presence of read counts for the 8 libraries summary(counts)

              ##            Length Class      Mode      ## counts     205616 -none-     numeric   ## annotation      2 data.frame list      ## targets         viii -none-     character

              counts$targets

              ## [ane] "/information/Intro_to_R/RNAseq/mapping/ERR420386.sam" ## [2] "/data/Intro_to_R/RNAseq/mapping/ERR420387.sam" ## [3] "/data/Intro_to_R/RNAseq/mapping/ERR420388.sam" ## [4] "/data/Intro_to_R/RNAseq/mapping/ERR420389.sam" ## [5] "/information/Intro_to_R/RNAseq/mapping/ERR420390.sam" ## [6] "/information/Intro_to_R/RNAseq/mapping/ERR420391.sam" ## [vii] "/data/Intro_to_R/RNAseq/mapping/ERR420392.sam" ## [8] "/data/Intro_to_R/RNAseq/mapping/ERR420393.sam"

You lot can and so print out these counts in a text file for time to come reference.

              # impress out counts table for every sample write.tabular array(counts$counts,file="~/raw_read_counts.txt",sep="\t", quote=F,append=F)

Note:

To save disk space on your machine, recollect to convert your SAM files to BAM format using samtools and to compress the fastq files. This tin be washed out of Rstudio, in a shell window:

samtools view -Shb myfile.sam -o myfile.bam

gzip myfile.fq

Visualisation of the BAM files yous created can also be done via a genome browser such every bit IGV (http://www.broadinstitute.org/igv/) after sorting and indexing of those files.

QC and stats

Rsubread provides the number of reads mapped to each factor which can then exist used for ploting quality control figures and for differential expression analysis.

QC figures of the mapped read counts tin can be plotted and investigated for potential outlier libraries and to confirm group of samples.

Before plotting QC figures information technology is useful to get the experiment design. This will allow labeling the information with the sample groups they belong to, or any other parameter of interest.

The experiment pattern file respective to this study has been downloaded from the ArrayExpress webpage and formatted as a tab separated file for this analysis purposes. You lot tin can detect information technology in the shared information directory.

            # define the experiment design file (tab separated text file is best) EXPMT_DESIGN_FILE <- file.path(RNAseqDATADIR,'experiment_design.txt') # read the experiment design file and salvage it into retention experiment_design<-read.tabular array(EXPMT_DESIGN_FILE,header=T,sep="\t") # # prepare the rownames to the sampleID to permit for ordering rownames(experiment_design) <- experiment_design$SampleID # society the design following the counts sample social club experiment_design.ord <- experiment_design[colnames(counts$counts),] # look at the design experiment_design.ord

            ##            SampleID    Source_Name     organism  sexual activity age tissue ## ERR420386 ERR420386 brain_sample_1 Homo_sapiens male  26  brain ## ERR420387 ERR420387 brain_sample_1 Homo_sapiens male  26  brain ## ERR420388 ERR420388 liver_sample_1 Homo_sapiens male  xxx  liver ## ERR420389 ERR420389 liver_sample_1 Homo_sapiens male  30  liver ## ERR420390 ERR420390 liver_sample_1 Homo_sapiens male  30  liver ## ERR420391 ERR420391 brain_sample_1 Homo_sapiens male  26  brain ## ERR420392 ERR420392 brain_sample_1 Homo_sapiens male  26  brain ## ERR420393 ERR420393 liver_sample_1 Homo_sapiens male  30  liver ##           Extract_Name Material_Type Assay_Name technical_replicate_group ## ERR420386       GCCAAT           RNA     Assay4                   group_2 ## ERR420387       ACAGTG           RNA     Assay2                   group_1 ## ERR420388       GTGAAA           RNA     Assay7                   group_4 ## ERR420389       GTGAAA           RNA     Assay8                   group_4 ## ERR420390       CTTGTA           RNA     Assay6                   group_3 ## ERR420391       ACAGTG           RNA     Assay1                   group_1 ## ERR420392       GCCAAT           RNA     Assay3                   group_2 ## ERR420393       CTTGTA           RNA     Assay5                   group_3

            # list the ordered samples for future use samples <- as.character(experiment_design.ord$SampleID) # create factors for time to come plotting group<-factor(experiment_design.ord$tissue) group

            ## [1] brain encephalon liver liver liver brain brain liver ## Levels: brain liver

            historic period<-factor(experiment_design.ord$historic period) age

            ## [ane] 26 26 thirty 30 thirty 26 26 30 ## Levels: 26 30

Basic QC plots

Density plots of log-intensity distribution of each library tin can be superposed on a single graph for a amend comparison between libraries and for identification of libraries with weird distribution. On the boxplots the density distributions of raw log-intensities are not expected to be identical but still non totally different.

Notation:

The png function volition create a png file where to salvage the plots created direct after, and will close this file when dev.off() is called.

To see your plots interactively, simply ommit those ii lines.

                # density plot of raw read counts (log10) png(file="~/Raw_read_counts_per_gene.density.png") logcounts <- log(counts$counts[,1],x)  d <- density(logcounts) plot(d,xlim=c(1,8),primary="",ylim=c(0,.45),xlab="Raw read counts per gene (log10)", ylab="Density") for (s in 2:length(samples)){   logcounts <- log(counts$counts[,s],10)    d <- density(logcounts)   lines(d) } dev.off()

Boxplots of the raw read counts after log10 transformation.

                ## box plots of raw read counts (log10) png(file="~/Raw_read_counts_per_gene.boxplot.png") logcounts <- log(counts$counts,10) boxplot(logcounts, main="", xlab="", ylab="Raw read counts per gene (log10)",axes=Faux) axis(2) axis(ane,at=c(i:length(samples)),labels=colnames(logcounts),las=two,cex.centrality=0.8) dev.off()

In club to investigate the relationship betwixt samples, hierarchical clustering can be performed using the heatmap function from the stats packet. In this example heatmap calculates a matrix of euclidean distances from the mapped read counts for the 100 most highly expressed genes.

                # select data for the 100 nigh highly expressed genes select = society(rowMeans(counts$counts), decreasing=Truthful)[1:100] highexprgenes_counts <- counts$counts[select,]  # heatmap with sample proper name on Ten-axis png(file="~/High_expr_genes.heatmap.png") heatmap(highexprgenes_counts, col=topo.colors(50), margin=c(10,6)) dev.off()

Yous will notice that the samples clustering does not follow the original club in the data matrix (alphabetical order "ERR420386" to "ERR420393"). They have been re-ordered according to the similarity of the 100 genes expression profiles. To sympathise what biological effect lies nether this clustering, one can use the samples notation for labeling (samples group, age, sexual activity etc).

                # heatmap with condition grouping as labels colnames(highexprgenes_counts)<- group # plot png(file="~/High_exprs_genes.heatmap.grouping.png") heatmap(highexprgenes_counts, col = topo.colors(50), margin=c(ten,half-dozen)) dev.off()

Exercise: Heatmap

Produce a heatmap for the 50 most highly expressed genes and annotate the samples with with their historic period.

Subset the read counts object for the 50 about highly expressed genes
Annotate the samples in the subset with their historic period (check order with design!)
Plot a heatmap with this subset of data, scaling genes and ordering both genes and samples

Principal Component Analysis

A Principal Component Analysis (PCA) tin can as well exist performed with these data using the cmdscale function (from the stats bundle) which performs a classical multidimensional scaling of a data matrix.

Reads counts demand to be transposed before existence analysed with the cmdscale functions, i.e. genes should be in columns and samples should exist in rows. This is the code for transposing and checking the data before further steps:

              # select information for the 1000 near highly expressed genes select = order(rowMeans(counts$counts), decreasing=True)[1:100] highexprgenes_counts <- counts$counts[select,] # annotate the data with condition group as labels colnames(highexprgenes_counts)<- group # transpose the information to accept variables (genes) equally columns data_for_PCA <- t(highexprgenes_counts) dim(data_for_PCA)

              ## [one]   viii 100

The cmdscale role will calculate a matrix of dissimilarities from your transposed data and will too provide information about the proportion of explained variance by computing Eigen values.

              ## calculate MDS (matrix of dissimilarities) mds <- cmdscale(dist(data_for_PCA), k=3, eig=TRUE)   # k = the maximum dimension of the space which the information are to exist represented in # eig = indicates whether eigenvalues should exist returned

The variable mds$eig provides the Eigen values for the start eight chief components:

              mds$eig

              ## [1] 9.490938e+13 1.099639e+13 ane.125271e+11 1.026586e+ten 1.500500e+07 ## [vi] 6.240239e+06 iii.206875e+06 2.285361e-03

Plotting this variable every bit a percentage volition help you lot determine how many components can explain the variability in your dataset and thus how many dimensions yous should exist looking at.

              # transform the Eigen values into percentage eig_pc <- mds$eig * 100 / sum(mds$eig) # plot the PCA png(file="~/PCA_PropExplainedVariance.png") barplot(eig_pc,      las=one,      xlab="Dimensions",       ylab="Proportion of explained variance (%)", y.axis=Aught,      col="darkgrey") dev.off()

In virtually cases, the first 2 components explain more than half the variability in the dataset and tin can be used for plotting. The cmdscale role run with default parameters volition perform a primary components analysis on the given data matrix and the plot function volition provide scatter plots for individuals representation.

              ## summate MDS mds <- cmdscale(dist(data_for_PCA)) # Performs MDS analysis

              #Samples representation png(file="~/PCA_Dim1vsDim2.png") plot(mds[,i], -mds[,2], type="n", xlab="Dimension i", ylab="Dimension two", master="") text(mds[,1], -mds[,ii], rownames(mds), cex=0.8)  dev.off()

The PCA plot of the first ii components evidence a clear separation of the Brain and Liver samples across the 1st dimension. Within each sample group we can also discover a split between the iv samples of each group, which seem to cluster in pair. This observation can be explained by some other gene of variability in the data, ordinarily batch effect or some other biological biais such equally age or sexual activity.

Practice: PCA

Produce a PCA plot from the read counts of the 500 most highly expressed genes and change the labels until you tin identify the reason for the dissever betwixt samples from the same tissue.

Get the read counts for the 500 most highly expressed genes
Transpose this matrix of read counts
Check the number of dimensions explaining the variability in the dataset
Run the PCA with an appropriate number of components
Annotate the samples with their age & re-run the PCA & plot the main components
Comment the samples with other clinical data & re-run the PCA & plot the master components until yous can split up the samples within each tissue group

Differential Expression

Before proceeding with differential expression analysis, it is useful to filter out very lowly expressed genes. This will assistance increasing the statistical power of the analysi while keeping genes of involvement. A common style to do this is past filtering out genes having less than ane count-per-million reads (cpm) in half the samples. The "edgeR" library provides the cpm function which can be used hither.

            # load required libraries library(edgeR)

            ## Loading required package: limma

            # get the expression counts from previous alignment pace mycounts <- counts$counts dim(mycounts)

            ## [1] 25702     8

            mycounts[one:five,i:3]

            ##    ERR420386 ERR420387 ERR420388 ## 1          2        56      5461 ## 2       3723      2270     27665 ## nine         14         3       148 ## 10         ane         1       373 ## 12        42        34     20969

            # filtering #Keep genes with least 1 count-per-one thousand thousand reads (cpm) in at least 4 samples isexpr <- rowSums(cpm(mycounts)>i) >= 4 tabular array(isexpr)

            ## isexpr ## Fake  True  ## 10686 15016

            mycounts <- mycounts[isexpr,] genes <- rownames(mycounts)  dim(mycounts)

            ## [1] 15016     viii

The limma package (since version iii.16.0) offers the voom function that will normalise read counts and apply a linear model to the normalised data before computing chastened t-statistics of differential expression.

            # load required libraries library(limma)  # check your samples grouping experiment_design.ord[colnames(mycounts),]$tissue == group

            ## [1] True TRUE True TRUE TRUE TRUE Truthful Truthful

            # create design matrix for limma design <- model.matrix(~0+group) # substitute "group" from the design cavalcade names colnames(pattern)<- gsub("group","",colnames(blueprint)) # check your design matrix blueprint

            ##   brain liver ## 1     ane     0 ## two     1     0 ## 3     0     1 ## 4     0     1 ## 5     0     1 ## 6     1     0 ## 7     1     0 ## viii     0     1 ## attr(,"assign") ## [1] 1 1 ## attr(,"contrasts") ## attr(,"contrasts")$group ## [1] "contr.treatment"

            # calculate normalization factors betwixt libraries nf <- calcNormFactors(mycounts) # # normalise the read counts with 'voom' office y <- voom(mycounts,design,lib.size=colSums(mycounts)*nf) # extract the normalised read counts counts.voom <- y$Eastward  # relieve normalised expression information into output dir write.table(counts.voom,file="~/counts.voom.txt",row.names=T,quote=F,sep="\t")  # fit linear model for each gene given a series of libraries fit <- lmFit(y,blueprint) # construct the dissimilarity matrix respective to specified contrasts of a set of parameters cont.matrix <- makeContrasts(liver-brain,levels=design) cont.matrix

            ##        Contrasts ## Levels  liver - brain ##   brain            -one ##   liver             1

            # compute estimated coefficients and standard errors for a given set of contrasts fit <- contrasts.fit(fit, cont.matrix)  # compute moderated t-statistics of differential expression past empirical Bayes moderation of the standard errors fit <- eBayes(fit) options(digits=3)  # check the output fit dim(fit)

            ## [one] 15016     1

The topTable function summarises the output from limma in a table format. Pregnant DE genes for a detail comparing can exist identified by selecting genes with a p-value smaller than a chosen cut-off value and/or a fold change greater than a chosen value in this table. By default the table will be sorted by increasing adjusted p-value, showing the most pregnant DE genes at the top.

            # gear up adapted pvalue threshold and log fold change threshold mypval=0.01 myfc=3  # get the coefficient proper name for the comparison  of interest colnames(fit$coefficients)

            ## [one] "liver - brain"

            mycoef="liver - brain" # become the output table for the 10 near meaning DE genes for this comparison topTable(fit,coef=mycoef)

            ##      logFC AveExpr    t  P.Value adj.P.Val    B ## 5265 10.45    7.33 86.7 1.89e-thirteen  1.37e-09 18.half dozen ## 3240 eleven.28    7.62 89.4 ane.46e-thirteen  ane.37e-09 18.5 ## 2335  6.82    eight.56 67.2 1.53e-12  2.93e-09 18.4 ## 213  11.77   x.69 69.3 1.19e-12  2.93e-09 xviii.3 ## 338  11.51    8.31 70.four 1.04e-12  2.93e-09 xviii.0 ## 2243 11.56    7.17 fourscore.0 3.64e-13  ane.37e-09 17.seven ## 716   7.49    6.75 61.0 3.38e-12  4.22e-09 17.vii ## 229  10.xc    vi.45 82.0 2.98e-thirteen  1.37e-09 17.4 ## 1571  9.64    6.62 63.8 2.33e-12  3.18e-09 17.iii ## 125  xi.54    vii.08 64.3 2.19e-12  3.18e-09 16.9

            # get the full table ("n = number of genes in the fit") limma.res <- topTable(fit,coef=mycoef,n=dim(fit)[1])  # get significant DE genes just (adjusted p-value < mypval) limma.res.pval <- topTable(fit,coef=mycoef,n=dim(fit)[one],p.val=mypval) dim(limma.res.pval)

            ## [one] 8183    6

            # become significant DE genes with low adjusted p-value high fold change limma.res.pval.FC <- limma.res.pval[which(abs(limma.res.pval$logFC)>myfc),] dim(limma.res.pval.FC)

            ## [1] 3044    half-dozen

            # write limma output table for significant genes into a tab delimited file filename = paste("~/DEgenes_",mycoef,"_pval",mypval,"_logFC",myfc,".txt",sep="") write.tabular array(limma.res.pval.FC,file=filename,row.names=T,quote=F,sep="\t")

Exercise: Limma

Get the number of DE genes between technical grouping 1 and technical group 2 (all Brain samples) with adj pvalue<0.01.

Create a new design matrix for limma with the technical replicate groups
Re-normalise the read counts with 'voom' part with new design matrix
Fit a linear model on these normalised data
Make the dissimilarity matrix respective to the new set up of parameters
Fit the contrast matrix to the linear model
Compute moderated t-statistics of differential expression
Become the output tabular array for the ten almost meaning DE genes for this comparison

Factor Notation

The note of EntrezGene IDs from RNAseq data can be washed using the BioMart database which contains many species including Human, Mouse, Zebrafish, Chicken and Rat.

            # get the Ensembl annotation for man genome library(biomaRt)

            ## Loading required packet: methods

            mart<- useDataset("hsapiens_gene_ensembl", useMart("ENSEMBL_MART_ENSEMBL",host="www.ensembl.org"))  # get entrez gene IDs from limma output table entrez_genes <- equally.character(rownames(limma.res.pval.FC)) length(entrez_genes)

            ## [1] 3044

            # interrogate the BioMart database to become gene symbol and clarification for these genes  detags.IDs <- getBM(  filters= "entrezgene",  attributes= c("entrezgene","hgnc_symbol","clarification"),  values= entrez_genes,  mart= mart )  dim(detags.IDs)

            ## [i] 2876    3

            head(detags.IDs)

            ##   entrezgene hgnc_symbol ## 1         10        NAT2 ## 2      10000        AKT3 ## three  100033415  SNORD116-3 ## 4  100033421  SNORD116-9 ## 5  100033427 SNORD116-15 ## 6  100033442  SNORD115-five ##                                                               clarification ## 1                N-acetyltransferase 2 [Source:HGNC Symbol;Acc:HGNC:7646] ## 2         AKT serine/threonine kinase iii [Source:HGNC Symbol;Acc:HGNC:393] ## 3  small nucleolar RNA, C/D box 116-3 [Source:HGNC Symbol;Acc:HGNC:33069] ## 4  small nucleolar RNA, C/D box 116-9 [Source:HGNC Symbol;Acc:HGNC:33075] ## five small nucleolar RNA, C/D box 116-15 [Source:HGNC Symbol;Acc:HGNC:33081] ## 6  small nucleolar RNA, C/D box 115-5 [Source:HGNC Symbol;Acc:HGNC:33024]

In many cases, several annotations ar available per entrez factor ID. This results in duplicate entries in the output table from getBM. The simplest manner to bargain with this result is to remove duplicates, although they can also be concatenated in some means.

Once the note has been obtained for all DE genes, this table can be merged with the output tabular array from limma for a complete result and an easier interpretation.

            # remove duplicates detags.IDs.matrix<-detags.IDs[-which(duplicated(detags.IDs$entrezgene)),] # select genes of interest merely rownames(detags.IDs.matrix)<-detags.IDs.matrix$entrezgene entrez_genes.annot <- detags.IDs.matrix[every bit.character(entrez_genes),] # join the two tables rownames(limma.res.pval.FC) <- limma.res.pval.FC$ID limma.res.pval.FC.annot <- cbind(entrez_genes.annot,limma.res.pval.FC) # check the annotated tabular array head(limma.res.pval.FC.annot)

            ##      entrezgene hgnc_symbol ## 5265       5265    SERPINA1 ## 3240       3240          HP ## 2335       2335         FN1 ## 213         213         ALB ## 338         338        APOB ## 2243       2243         FGA ##                                                      description logFC ## 5265 serpin family A member ane [Source:HGNC Symbol;Acc:HGNC:8941] 10.45 ## 3240              haptoglobin [Source:HGNC Symbol;Acc:HGNC:5141] 11.28 ## 2335            fibronectin 1 [Source:HGNC Symbol;Acc:HGNC:3778]  six.82 ## 213                    albumin [Source:HGNC Symbol;Acc:HGNC:399] xi.77 ## 338           apolipoprotein B [Source:HGNC Symbol;Acc:HGNC:603] 11.51 ## 2243   fibrinogen alpha chain [Source:HGNC Symbol;Acc:HGNC:3661] 11.56 ##      AveExpr    t  P.Value adj.P.Val    B ## 5265    7.33 86.vii i.89e-13  i.37e-09 eighteen.6 ## 3240    vii.62 89.4 1.46e-13  1.37e-09 eighteen.5 ## 2335    eight.56 67.2 1.53e-12  2.93e-09 18.4 ## 213    10.69 69.3 i.19e-12  2.93e-09 18.iii ## 338     viii.31 70.4 i.04e-12  two.93e-09 18.0 ## 2243    vii.17 80.0 3.64e-13  1.37e-09 17.7

Factor Set up Enrichment

Cistron Ontology (GO) enrichment is a method for investigating sets of genes using the Gene Ontology system of classification, in which genes are assigned to a particular set of terms for three major domains: cellular component, biological procedure and molecular function.

The GOstats bundle can test for both over and under representation of GO terms using the Hypergeometric examination. The output of the analysis is typically a ranked list of GO terms, each associated with a p-value.

The Hypergeometric test will crave both a list of selected genes (i.e. your DE genes) and a "universe" list (e.thousand. all genes annotated in the genome you lot are working with), all represented by their "EntrezGene" ID.

            # load the library library(GOstats)

            ## Loading required package: Biobase

            ## Loading required parcel: BiocGenerics

            ## Loading required package: parallel

            ##  ## Attaching parcel: 'BiocGenerics'

            ## The post-obit objects are masked from 'package:parallel': ##  ##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ, ##     clusterExport, clusterMap, parApply, parCapply, parLapply, ##     parLapplyLB, parRapply, parSapply, parSapplyLB

            ## The following object is masked from 'packet:limma': ##  ##     plotMA

            ## The following objects are masked from 'package:stats': ##  ##     IQR, mad, xtabs

            ## The following objects are masked from 'packet:base': ##  ##     anyDuplicated, suspend, as.data.frame, cbind, colnames, ##     practise.telephone call, duplicated, eval, evalq, Filter, Find, get, grep, ##     grepl, intersect, is.unsorted, lapply, lengths, Map, mapply, ##     match, mget, social club, paste, pmax, pmax.int, pmin, pmin.int, ##     Position, rank, rbind, Reduce, rownames, sapply, setdiff, ##     sort, tabular array, tapply, union, unique, unsplit, which, which.max, ##     which.min

            ## Welcome to Bioconductor ##  ##     Vignettes comprise introductory material; view with ##     'browseVignettes()'. To cite Bioconductor, see ##     'commendation("Biobase")', and for packages 'citation("pkgname")'.

            ## Loading required package: Category

            ## Loading required package: stats4

            ## Loading required package: AnnotationDbi

            ## Loading required package: IRanges

            ## Loading required packet: S4Vectors

            ##  ## Attaching parcel: 'S4Vectors'

            ## The following objects are masked from 'packet:base': ##  ##     colMeans, colSums, expand.grid, rowMeans, rowSums

            ## Loading required package: Matrix

            ##  ## Attaching parcel: 'Matrix'

            ## The following object is masked from 'packet:S4Vectors': ##  ##     expand

            ## Loading required package: graph

##

            ##  ## Attaching package: 'GOstats'

            ## The following object is masked from 'package:AnnotationDbi': ##  ##     makeGOGraph

            # Ascertain list of genes of involvement (DE genes - EntrezGene IDs) entrezgeneids <- equally.character(rownames(limma.res.pval.FC)) length(entrezgeneids)

            ## [one] 3044

            # Ascertain the universe universeids <- rownames(mycounts) length(universeids)

            ## [1] 15016

Before running the hypergeometric test with the hyperGTest function, you need to define the parameters for the examination (cistron lists, ontology, examination management) also as the annotation database to be used. The ontology to be tested can exist any of the three GO domains: biological process ("BP"), cellular component ("CC") or molecular function ("MF").

In the instance below we will test for over-represented biological processes in our list of differentially expressed genes.

            # define the p-value cut off for the hypergeometric test hgCutoff <- 0.05 params <- new("GOHyperGParams",annotation="org.Hs.eg",geneIds=entrezgeneids,universeGeneIds=universeids,ontology="BP",pvalueCutoff=hgCutoff,testDirection="over")

            ## Loading required package: org.Hs.eg.db

##

            ## Alarm in makeValidParams(.Object): removing geneIds not in ## universeGeneIds

            #  Run the test hg <- hyperGTest(params) # Check results hg

            ## Gene to Become BP  test for over-representation  ## 9211 Go BP ids tested (3527 accept p < 0.05) ## Selected gene set size: 1530  ##     Gene universe size: 12138  ##     Note package: org.Hs.eg

You lot can go the output table from the test for significant GO terms merely past adjusting the pvalues with the p.arrange function.

            ## Get the p-values of the test hg.pv <- pvalues(hg) ## Adjust p-values for multiple examination (FDR) hg.pv.fdr <- p.adjust(hg.pv,'fdr') ## select the GO terms with adjusted p-value less than the cut off sigGO.ID <- names(hg.pv.fdr[hg.pv.fdr < hgCutoff]) length(sigGO.ID)

            ## [1] 2518

            # get table from HyperG test result df <- summary(hg) # keep but significant Go terms in the table GOannot.table <- df[df[,ane] %in% sigGO.ID,] caput(GOannot.tabular array)

            ##       GOBPID   Pvalue OddsRatio ExpCount Count Size ## i GO:0042221 6.43e-79      iii.01      351   658 2783 ## ii GO:0050896 one.43e-69      ii.69      721  1041 5723 ## three GO:0044699 ane.12e-65      4.14     1174  1413 9312 ## 4 Get:0032501 3.05e-64      2.54      573   877 4542 ## 5 GO:0044707 6.88e-63      2.53      533   831 4226 ## vi Get:0044710 2.06e-lx      2.56      406   683 3224 ##                                    Term ## 1                  response to chemical ## 2                  response to stimulus ## three               single-organism procedure ## four      multicellular organismal process ## 5 unmarried-multicellular organism process ## half dozen     single-organism metabolic process

The Gene Ontology enrichment outcome can be saved in a text file or an html file for future reference.

            # Create text report of the significantly over-represented Become terms write.table(GOannot.table,file="~/GOterms_OverRep_BP.txt",sep="\t",row.names=F) # Create html report of all over-represented GO terms htmlReport(hg, file="~/GOterms_OverRep_BP.html")

Exercise: GOstats

Identify the GO terms in the Molecular Function domain that are over-represented (pvalue<0.01) in your list of DE genes.

Get your list of DE genes (Entrez Gene IDs)
Set the new parameters for the hypergeometric test
Run the exam and adjust the pvalues in the output object
Identify the significant Go terms at pvalue 0.01

Other softwares tin exist used to investigate over-represented pathways, such equally GeneGO (https://portal.genego.com/) and Ingenuity (http://world wide web.ingenuity.com/products/ipa). The reward of these softwares is that they maintain curated and upwards-to-date extensive databases. They also provide intuitive visualisation and network modelling tools.

You can relieve an image of your RNAseq assay before moving on to the next part of this grade.

              RDataFile <- "~/RNAseq_DE_analysis_with_R.RData" relieve.image(RDataFile)

Record package and version info with `sessionInfo()`

The sessionInfo part prints version information about R and any attached packages. Information technology'south a skilful do to always run this command at the end of your R session and tape it for the sake of reproducibility in the futurity.

              sessionInfo()

              ## R version 3.3.two (2016-ten-31) ## Platform: x86_64-pc-linux-gnu (64-chip) ## Running nether: Ubuntu 16.04.one LTS ##  ## locale: ##  [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C               ##  [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8     ##  [v] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8    ##  [7] LC_PAPER=en_AU.UTF-viii       LC_NAME=C                  ##  [9] LC_ADDRESS=C               LC_TELEPHONE=C             ## [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C        ##  ## attached base packages: ## [one] stats4    parallel  methods   stats     graphics  grDevices utils     ## [viii] datasets  base of operations      ##  ## other attached packages: ##  [1] Go.db_3.4.0          org.Hs.eg.db_3.four.0   GOstats_2.twoscore.0       ##  [4] graph_1.52.0         Category_2.xl.0      Matrix_1.2-eight         ##  [seven] AnnotationDbi_1.36.2 IRanges_2.8.ane        S4Vectors_0.12.1     ## [10] Biobase_2.34.0       BiocGenerics_0.20.0  biomaRt_2.30.0       ## [13] edgeR_3.16.v         limma_3.xxx.11        Rsubread_1.24.1      ##  ## loaded via a namespace (and not attached): ##  [i] Rcpp_0.12.9            bitops_1.0-6           tools_3.iii.2            ##  [4] digest_0.6.12          annotate_1.52.1        RSQLite_1.1-two          ##  [vii] evaluate_0.x          memoise_1.0.0          lattice_0.xx-34        ## [10] DBI_0.v-i              yaml_2.1.14            genefilter_1.56.0      ## [thirteen] stringr_1.i.0          knitr_1.15.one           locfit_1.5-9.1         ## [sixteen] rprojroot_1.2          grid_3.3.2             GSEABase_1.36.0        ## [xix] XML_3.98-1.5           RBGL_1.50.0            survival_2.40-one        ## [22] rmarkdown_1.iii          magrittr_1.5           backports_1.0.five        ## [25] htmltools_0.3.v        splines_3.iii.two          AnnotationForge_1.xvi.1 ## [28] xtable_1.8-two           stringi_1.1.2          RCurl_1.95-4.8