epiChoose: Data-driven cell model choice: Beta

epiChoose is an application for quantifying the relatedness between cell lines and primary cells. For a Quick Start, have a look at the tutorial, or see the points here:

  1. See what cell lines and primary cells have been profiled in Data Overview.
  2. Explore relatedness between samples using specific gene selections (gene lists, ontologies, etc.) using Data Exploration.
  3. Select data type and samples to download (BAM or Bigwig) in Download.
  4. Explore integrating across epigenetic and expression data to define relatedness in Data Integration (this section is still to come).
  5. Explore pairwise analysis for THP-1/U937 conditions in Pairwise Comparisons.

For some background, please read below …


Overview

In an effort to make choosing a cell line model for a particular cell type or biology a more data-driven decision, Open Targets have profiled a number of commonly used cell lines and relevant primary cells. The following assays have been used for this profiling:

  • RNA-seq
  • ChIP-seq (H3K27ac, H3K4me3, H3K27me3, CTCF)
  • ATAC-seq
  • SNP6

The cell lines and cell types that have been profiled have been sub-divided by tissue:

  • Project 1 - Haematopoietic models: K562, KU812, MEG01, F36P, UT7, HEL9217
  • Project 2 - Monocyte-macrophage differentiation models: THP-1 and U937 (with PMA/VD3/LPS stimulation)
  • Project 3 - Lung models: A549 and BEAS-2B, along with NHBEs (human bronchial epithelial cells) from 3 donors
  • Project 4 - Liver models: HEP3B, HEPARG, HEPG2 (2D and 3D cultures), along with primary hepatoctyes
  • Project 5 - T-cell models: Jurkat and HUT-78 (+/- CD3 and CD28)
  • Project 6 - B-cell models: Ramos, Toledo, Raji, ARH77, Namalwa
  • Blueprint - For comparison, a large number of haematopoietic primary samples have been imported from Blueprint
  • ENCODE - Cell lines and primary cells that were profiled with the relevant assays were also imported

Method

One of the goals of this project is to integrate across epigenetic and expression data to better understand relatedness between cell types and their potential cell line models. In order to perform this intergration, each data type had to be processed so the information across data types can be compared. The workflow above shows how this was performed for the epigenetic (non-expression) data:

  1. For each data type, the cell lines and cell types to compare are chosen …
  2. Using the signal file for that particular data type and cell, the Area Under the Curve (AUC) is calculated across regions of interest (ROI).
  3. When this is done across all cells of interest, a matrix is produced for that data type where each row is a cell type / cell line, each column is an ROI, and the cells contain the AUC for that data type.
  4. Hence, each data type produces a matrix that can be compared and contrasted using further statistical and analysis techniques (e.g. PCA).
  5. The epigenetic data types will produce differently dimensioned matrices compared to RNA-seq (each column in an epigenetic matrix will be a regulatory region in the genome, whereas each column will be a gene for the RNA-seq matrix). Hence, in order to compare and/or combine epigenetic and expression data, the epigenetic data needs to be centred on genes. This process is outlined in Expression and Epigenetic Comparisons below.

Expression and Epigenetic Comparisons

Using the method above produces a set of matrices that are comparable across the epigenetic datasets but not against the expression (RNA-seq) data. i.e. the ROIs for the epigenetic data types could be inter- and intragenic regions (e.g. promoters and enhancers), whereas, by definition, the ROIs for the RNA-seq can only be the genes themselves.

In order to compare the epigenetic and expression data, columns of each of the data type matrices must be a gene, hence the epigenetic ROIs must be mapped to genes. The epigenetic assays that we have used (H3K27ac, H3K4me3, H3K27me3, CTCF, ATAC-seq) are the “footprint”, or markers, of active regulatory activity in that cell type. These features can affect genes either in close proximity or distal to the gene in question. Ideally, in order to map these active regulatory regions that have been pinpointed by these epigenetic assays, functional assays such as Chromosome Conformation Capture would be used. However, a good approximation can be made by summarising the data type signal across the (Ensembl) regulatory regions around a gene. Different ways of summarising the signal will suit different data types (see the next session, Selecting collapse-to-gene method per data type). However, whatever the summation method used, the end result is that each gene is given an epigenetic score, which can then be compared and contrasted with the expression value for that gene.


Selecting collapse-to-gene method per data type

We have implemented four different methods to summarise the epigenetic signal around a gene:

  1. Max region across gene body - Using a 2Kb window either side of the gene, which is the regulatory region with the greatest AUC value?
  2. TSS only - The AUC across the transcriptional start site (TSS-/+2Kb) only
  3. Sum of regions across gene body - The sum of the AUC for each regulatory region in the gene body (-/+2kB either side of the gene)
  4. Sum of 10 closest regions - The sum of the AUC for the 10 closest regulatory regions to the gene

By default, Max region across gene body is chosen for H4K27ac, H3K4me3, and H3K27me3. Because ATAC and CTCF are more diffuse signals around the gene, Sum of 10 closest regions is chosen as default for these two data types. Each collapse-to-gene method can be changed for each data type below:

This is an overview of all the tissue-types profiled as part of epiChoose. As well as the 6 projects (detailed in 'Introduction'), relevant cell lines and cell types have been imported from Blueprint and ENCODE. Either PCA or MDS can be used to examine genome-wide relatedness between the cell types.

This tab enables gene/gene-set analysis of cell line / cell type differences. More details about each parameter can be viewed on mouse-over. New plots are generated only when 'Update Parameters' (at bottom) is selected.