Next Generation Sequencing Data Analysis

Recent advantages in the Next Generation Sequencing (NGS) technology made it possible to sequence genomes of many individuals in shorter amounts of time and with less financial investments than ever before. This gave a start to several large projects on massive sequencing of patients data, including the 1000 Genomes Project Consortium, the Cancer Genome Atlas and the International Cancer Genome Consortium ENCODE.

On the other side of the spectrom, many even relatively small labs nowadays can afford sequencing their samples. All of this resulted in a dramatic increase in the amount of NGS data. Hence, appropriate bioinformatic approaches to analyze the wealth of data available are needed.

We develop and apply tools for the analysis for high throughput sequencing data, from processing of raw data and mapping of reads to downstream statistical and bioinformatics analysis of the data. Below are several examples of projects analyzed in the group:

Exome sequencing

Sequencing of only exonic regions of the genes reduces the costs of sequencing. And since mutations within this regions are very likely to affect normal gene functioning, the technology provides a great opportunity to search for SNPs responsible for diseases. This approach has already demonstrated its efficiency in detecting potential causes for Mendelian disorders, for example, for Miller Syndrome (Ng SB et al., 2010), Freeman-Sheldon Syndrome (Ng SB et al., 2009) and for Kabuki Syndrome (Ng SB et al., 2010b). Despite the high number of hallmark studies on exome sequencing data published in two last years, computational methods and pipelines for the exome NGS data analysis are still lacking “golden standards” for analyzing NGS data from particular platforms and for concrete biological applications. We develop a pipeline to analyze exome NGS data obtained using the ABI SOLiD and Illumina platforms. The aim of the project is to detect mutations causing a rare Mendelian disorder. The pipeline works with the raw reads as an input and returns a short list of candidate genes. The following steps of the data analysis are carried out:

  1. Read processing and quality control
  2. Sequence alignment in the color space, alignment quality statistics calculation
  3. Variant detection, annotation and filtering

Microbiome profiling

Deep sequencing can also be used to analyze multimicrobial colonization and to establish a microbiome profile of individual patients. This plays an important role, e.g. in the oral cavity, vaginal infections or cystic fibrosis, a genetic disease affecting the lungs, which is characterized by abnormal ion transport and consequently leads to viscous secretions which can easily harbour microbial infections. Microbiome profiling through NGS uses the variablity in certain region of microbial genomes, specifically the ribosomal DNA, to identify different species or genera. In the first step, microbial RNA is extracted from patients and then only the variable region is amplified and sequenced using NGS. Using this method, the various microbial species contributing to the infection in different stages or treatments of disease can be analyzed. We develop work flows to process sequences from NGS, including quality control, as well as for mapping sequences onto references databases. Furthermore, we develop methods to statistically analyze, classify and compare microbiome profiles of different patient groups, based on Illumina data.

Epigenetic Changes in Ageing Skin

In this project, we are using genome wide ChIP-Seq and RNA-Seq to identify epigentic changes in ageing skin samples, and study the effect of methylation changes on the expression of genes in cis and trans. In a collaboration project with Beyersdorf AG and the German Cancer Research Center, the ultimate aim is to identify alterations that are associated with skin aging, in order to develop new compounds to reverse or at least delay the ageing process.