Smoking cigarettes and exposure to cigarette smoke is known to negatively affect a person's health.
We analyzed a publicly available dataset that is in the Gene Expression Omnibus (GEO) GSE8823 to learn about expression microarray analysis. This dataset was for an experiment in which investigators compared gene expression profiles of bronchoalveolar lavage fluid obtained from 11 non-smokers versus 13 smokers (mean smoking use was 36 pack-years). The scientific paper published to report the results of this experiment can be found here.
We used the information deposited by study authors into GEO to learn some basics about the experiment's study design, including what tissue was studied, the characteristics of individuals selected for the study.
Download Phenotype Data
From GEO, we also found out what microarray chip was used to obtain gene expression data (it was one called Affymetrix HG-U133 Plus 2.0). Next, we obtained 24 files, each corresponding to the image intensities captured across each chip for each individual.
For this tutorial, we analyzed data and provide it for you to explore gene expression data. If you are interested in analyzing gene expression microarray data on your own, take a look at raved, the pipeline we used to analyze the data shown here. You can perform a search in GEO and choose any microarray dataset that is of interest to you.
This app was created with RStudio's Shiny.
Univariate Analysis
Categorical Variables
Here, we can explore the phenotype data attributes. First, lets look at the distribution of all the variables using barplots. A barplot (or barchart) is one of the most common type of plot. It shows the relationship between a numerical variable and a categorical variable.
Using the dropdown menu, you can select the variable of choice to generate different barplots.
In the loaded barplot below, we can see the distribution of the main outcome variable, i.e. treatment, by frequency/counts and by percentage as a subset of the data. Here, we can see that there are slightly more smokers than non-smokers in the dataset.
Next, we can explore the phenotype data attributes of the selected variable with respect to the outcome variable groups - smoker versus non-smoker.
Using the dropdown menu, you can select the variable of choice to generate different barplots.
In the loaded barplots below, we can see the distribution of the treatment groups split by the sex of the samples. Here, we can see that the females are majorly non-smokers while in males, there are slightly more smokers than non-smokers.
Continuous Variables
Next, we can explore the distribution of continuous variables using histograms.
A histogram shows the distribution of any numerical data using a single variable as input. The variable is cut into multiple bins, where the height of the bin represents the number of observations per bin.
In the loaded plot below, we are looking at the distribution of the only continuous variable available in the dataset, i.e. age. Here, we can see that majority of samples are from donors between the age groups of 40-50 years old.
Bivariate Analysis
Continuous Variable vs. Categorical Variable
Next, we can explore the relationship between continuous and categorical variables of interest using boxplots.
Using the dropdown menu, you can select the variable of choice to generate different boxplots.
The boxplot gives summary of numerical values. The line in the middle denotes the median while the upper and lower lines denote upper (75th percentile) and lower (25th percentile) quartiles.
In the loaded boxplot below, we are looking at the distribution of age split by the treatment groups. Here, we can see that the median age of smokers is greater than that of non-smokers.
Normalize gene expression raw data using robust multi-array average (RMA) method.
PCA demonstrates information of the expression dataset in a reduced number of dimensions. Clustering and PCA plots enable to assess to what extent arrays resemble each other, and whether this corresponds to the known resemblances of the samples.
The log2-transformed/normalized intensity distributions of all samples (arrays) are expected to have the similar scale (i.e. the similar positions and widths of the boxes). Outlier detection is applied by computing a Kolmogorov-Smirnov statistic (Ka) between log-intensity distribution for one array and the pooled array data, where an array with a Ka beyond the upper whisker is designated as an outlier.
The intensity curves of all samples (arrays) are expected to have the similar shapes and ranges. Samples with deviated curves are likely to have problematic experiments. For example, high levels of background will shift an array’s distribution to the right. Lack of signal diminishes its right right tail. A bulge at the upper end of the intensity range often indicates signal saturation.
Volcano plot (probes with an adjusted p-value <0.05 are present in red)
Show top 50 probes sorted by p-values