Bioinformatik

Die Angewandte Bioinformatik ist eine der wissenschaftlichen Schlüsseldisziplinen des 21. Jahrhunderts. Modernste bioinformatische Ansätze erweisen sich als unverzichtbar bei der Entwicklung innovativer, sich auf die krankheitsspezifischen molekularen Veränderungen jedes einzelnen Patienten konzentrierender Therapien. TRONs Bioinformatik-Abteilung besteht aus einem multidisziplinären Team von Bioinformatikern, Biotechnologen, Mathematikern, Physikern und Wissenschaftlern aus weiteren verwandten Bereichen.

Die Teams Computational Medicine (CompMed) und Personalized Integrative Computational Genomics (PICG) entwickeln neuartige Methoden und Algorithmen zur Verarbeitung und Analyse von Sequenzierdaten im Kontext von Krebs, Autoimmunität und Infektionskrankheiten. Die Teams arbeíten hierbei sowohl für interne Forschungsprojekte, als auch mit externen Partnern als Kooperationsprojekte oder zur Anwendung in GxP-regulierten Bereichen.

Das Data Management Team, mit Expertise im Bereich High Performance Cluster (HPC) Infrastrukturen und Datenbanksystemen, baut Datenbankinfrastrukturen auf und stellt den reibungslosen NGS-Betrieb und die Rückverfolgbarkeit über ein selbst entwickeltes Laborinformations- und -management (LIMS) sicher. Durch unsere Kooperation mit dem Zentrum für Datenverarbeitung der Johannes Gutenberg-Universität Mainz haben wir Zugang zu einem der leistungsfähigsten HPC Cluster der Welt.

Das Single Cell Genomics (SCG) Team ist auf den Aufbau von standardisierten Abläufen für reproduzierbare Forschung auf Einzelzellebene spezialisiert. Zusammen mit Forschungsgruppen anderer Spezialisierungen erarbeiten sie, präzise und mit modernsten Methoden neue immunologische Erkenntnisse. Wie bei den Aktivitäten der CompMed und PICG, wird die Forschung und Entwicklung des SCG Teams für interne Studien eingesetzt, steht jedoch auch Kooperationspartnern zur Verfügung.

Verfügbare Tools, die von unserer Gruppe entwickelt und publiziert wurden, können über unser Github repository bezogen werden.

Unsere Technologien

ArtiFuse

Validierung von Tools für Fusionsgendetektion ohne simulierte Reads

Fusion genes, resulting from larger chromosomal rearrangements, can play an important role in the development of cancer. Investigating such events is hence not only essential in understanding cancer biology but may help identify therapeutic targets. Unfortunately, the performance of existing fusion detection tools cannot be evaluated due to the lack of known fusion events. In the past, simulated reads that form such fusion events during alignment have been used to assess the performance of the tools. However, read simulation cannot represent the biological complexity of RNA-seq data.

In this article, we present a method to introduce artificial fusion events into the chromosomal sequences of the human reference genome. Using a dedicated set of fusion detection tools on MCF7 samples, we compared our approach with read simulation data and show that only our tool, ArtiFuse, incorporates the biological variety of sequencing data.

The ArtiFuse approach can be used to benchmark the performance of published fusion detection tools and helps to build up a repertoire of high-quality tools for upcoming analyses.

Available tools developed and published by our group can be retrieved via our Github repository.

Sorn P, Holtsträter C, Löwer M, Sahin U, Weber D: ArtiFuse—computational validation of fusion gene detection tools without relying on simulated reads. Bioinformatics 2019, btz613
DOI; Europe PMC

Download the code from our Github repository

TRON CELL LINE PORTAL

TCLP: Ein online Tumorzelllinienkatalog der HLA Typ, Neo-Epitope Vorhersage, Virus und Genexpression integriert

Human cancer cell lines are an important resource for research and drug development. However, the available annotations of cell lines are sparse, incomplete, and distributed in multiple repositories. Re-analyzing publicly available raw RNA-Seq data, we determined the human leukocyte antigen (HLA) type and abundance, identified expressed viruses and calculated gene expression of 1,082 cancer cell lines. Using the determined HLA types, public databases of cell line mutations, and existing HLA binding prediction algorithms, we predicted antigenic mutations in each cell line. We integrated the results into a comprehensive knowledgebase. Using the Django web framework, we provide an interactive user interface with advanced search capabilities to find and explore cell lines and an application-programming interface to extract cell line information. The portal is available at http://celllines.tron-mainz.de.

Scholtalbers J, Boegel S, Bukur T, Byl M, Goerges S, Sorn P, Loewer M, Sahin U, Castle JC: TCLP: an online cancer cell line catalogue integrating HLA type, predicted neo-epitopes, virus and gene expression. Genome medicine 2015, 7:118.
DOI; PubMed

SEQ2HLA

HLA Typisierung von  RNA-Seq Sequenz Reads.

Boegel S, Löwer M, Schäfer M, Bukur T, Graaf J de, Boisguérin V, Türeci O, Diken M, Castle JC, Sahin U: HLA typing from RNA-Seq sequence reads. Genome medicine 2012, 4:102.
DOI; PubMed

Show abstract
We present a method, seq2HLA, for obtaining an individual’s HLA class I and II type and expression using standard NGS RNA-Seq data. It comprises mapping RNA-Seq reads against a reference database of HLA alleles, determining and reporting HLA type, confidence score and locus-specific expression level. We successfully applied seq2HLA to 50 CEU HapMap individuals previously HLA-typed, yielding 100X specificity and 94X sensitivity at a p-value of 0.1 for 2-digit HLA types. We determine HLA-type and expression for the previously un-typed Illumina Bbody Mmap tissues and a cohort of Korean lung cancer patients. Because the algorithm uses standard RNA-Seq reads and requires no change to lab protocols, it can be used for both existing datasets and future studies, thus adding a new dimension for HLA typing and biomarker studies.

Boegel S, Löwer M, Bukur T, Sahin U, Castle JC: A catalog of HLA type, HLA expression, and neo-epitope candidates in human cancer cell lines. Oncoimmunology 2014, 3:e954893.
DOI; PubMed

Show abstract
Cancer cell lines are a tremendous resource for cancer biology and therapy development. These multipurpose tools are commonly used to examine the genetic origin of cancers, to identify potential novel tumor targets, such as tumor antigens for vaccine devel-opment, and utilized to screen potential therapies in preclinical studies. Mutations, gene expression, and drug sensitivity have been determined for many cell lines using next-generation sequencing (NGS). However, the human leukocyte antigen (HLA) type and HLA expression of tumor cell lines, characterizations necessary for the development of cancer vaccines, have remained largely incomplete and, such information, when available, has been distributed in many publications. Here, we determine the 4-digit HLA type and HLA expression of 167 cancer and 10 non-cancer cell lines from publically available RNA-Seq data. We use standard NGS RNA-Seq short reads from “whole transcriptome” sequencing, map reads to known HLA types, and statistically determine HLA type, heterozygosity, and expression. First, we present previously unreported HLA Class I and II genotypes. Second, we determine HLA expression levels in each cancer cell line, providing insights into HLA downregulation and loss in cancer. Third, using these results, we provide a fundamental cell line “barcode” to track samples and prevent sample annotation swaps and contamination. Fourth, we integrate the cancer cell-line specific HLA types and HLA expression with available cell-line specific mutation information and existing HLA binding prediction algorithms to make a catalog of predicted antigenic mutations in each cell line. The compilation of our results are a fundamental resource for all researchers selecting specific cancer cell lines based on the HLA type and HLA expression, as well as for the development of immunotherapeutic tools for novel cancer treatment modalities.

Boegel S, Scholtalbers J, Löwer M, Sahin U, Castle JC: In Silico HLA Typing Using Standard RNA-Seq Sequence Reads. Methods in molecular biology (Clifton, N.J.) 2015, 1310:247-258.
DOI; PubMed

Show abstract
Next-generation sequencing (NGS) enables high-throughput transcriptome profiling using the RNA-Seq assay, resulting in billions of short sequence reads. Worldwide adoption has been rapid: many laboratories worldwide generate transcriptome sequence reads daily. Here, we describe methods for obtaining a sample’s human leukocyte antigen (HLA) class I and II types and HLA expression using standard NGS RNA-Seq sequence reads. We demonstrate the application using our algorithm, seq2HLA, and a publicly available RNA-Seq dataset from the Burkitt lymphoma cell line Raji.

Download

Galaxy workflow

Der neuste seq2hla Code befindet sich in unserem Github repository.

GALAXY LIMS

Galaxy LIMS für Next-Generation Sequencing

We have developed a laboratory information management system (LIMS) for a next-generation sequencing (NGS) laboratory within the existing Galaxy platform. The system provides lab technicians standard and customizable sample information forms, barcoded submission forms, tracking of input sample quality, multiplex-capable automatic flow cell design and automatically generated sample sheets to aid physical flow cell preparation. In addition, the platform provides the researcher with a user-friendly interface to create a request, submit accompanying samples, upload sample quality measurements and access to the sequencing results. As the LIMS is within the Galaxy platform, the researcher has access to all Galaxy analysis tools and workflows. The system reports requests and associated information to a message queuing system, such that information can be posted and stored in external systems, such as a wiki. Through an API, raw sequencing results can be automatically pre-processed and uploaded to the appropriate request folder. Developed for the Illumina HiSeq 2500 instrument, many features are directly applicable to other instruments.

Scholtalbers J, Rößler J, Sorn P, Graaf J de, Boisguérin V, Castle J, Sahin U: Galaxy LIMS for next-generation sequencing. Bioinformatics (Oxford, England) 2013, 29:1233-1234.
DOI
; PubMed

Download

Der neuste Galaxy LIMS Code befindet sich hier:

SOMATIC MUTATION FDR

Konfidenz-basierte Evaluierung und Priorisierung von somatischen Mutationen.

Next generation sequencing (NGS) has enabled high throughput discovery of somatic mutations. Detection depends on experimental design, lab platforms, parameters and analysis algorithms. However, NGS-based somatic mutation detection is prone to erroneous calls, with reported validation rates near 54% and congruence between algorithms less than 50%. Here, we developed an algorithm to assign a single statistic, a false discovery rate (FDR), to each somatic mutation identified by NGS. This FDR confidence value accurately discriminates true mutations from erroneous calls. Using sequencing data generated from triplicate exome profiling of C57BL/6 mice and B16-F10 melanoma cells, we used the existing algorithms GATK, SAMtools and SomaticSNiPer to identify somatic mutations. For each identified mutation, our algorithm assigned an FDR. We selected 139 mutations for validation, including 50 somatic mutations assigned a low FDR (high confidence) and 44 mutations assigned a high FDR (low confidence). All of the high confidence somatic mutations validated (50 of 50), none of the 44 low confidence somatic mutations validated, and 15 of 45 mutations with an intermediate FDR validated. Furthermore, the assignment of a single FDR to individual mutations enables statistical comparisons of lab and computation methodologies, including ROC curves and AUC metrics. Using the HiSeq 2500, single end 50 nt reads from replicates generate the highest confidence somatic mutation call set.

Löwer M, Renard BY, Graaf J de, Wagner M, Paret C, Kneip C, Türeci O, Diken M, Britten C, Kreiter S, Koslowski M, Castle JC, Sahin U: Confidence-based somatic mutation evaluation and prioritization. PLoS computational biology 2012, 8:e1002714.
DOI
; PubMed

rapmad

Robuste Analyse von Peptidmikroarray Daten.

Peptide microarrays offer an enormous potential as a screening tool for peptidomics experiments and have recently seen an increased field of application ranging from immunological studies to systems biology. By allowing the parallel analysis of thousands of peptides in a single run, they are suitable for high-throughput settings. Since data characteristics of peptide microarrays differ from DNA oligonucleotide microarrays, computational methods need to be tailored to these specifications to allow a robust and automated data analysis. While follow-up experiments can ensure the specificity of results, sensitivity cannot be recovered in later steps. Providing sensitivity is thus a primary goal of data analysis procedures. To this end, we created rapmad (Robust Alignment of Peptide MicroArray Data), a novel computational tool implemented in R. We evaluated rapmad in antibody reactivity experiments for several thousand peptide spots and compared it to two existing algorithms for the analysis of peptide microarrays. rapmad displays competitive and superior behavior to existing software solutions. Particularly, it shows substantially improved sensitivity for low intensity settings without sacrificing specificity. It thereby contributes to increasing the effectiveness of high throughput screening experiments. rapmad allows the robust and sensitive, automated analysis of high-throughput peptide array data.

Renard BY, Löwer M, Kühne Y, Reimer U, Rothermel A, Türeci O, Castle JC, Sahin U: rapmad: Robust analysis of peptide microarray data. BMC bioinformatics 2011, 12:324.
DOI
; PubMed

Download des Rapmad R-Package und der Datensätze

Mehr Technologien und Plattformen