Prof. Dr. Artur Andrzejak
  AG Parallel and Distributed Systems (PVS)

Facilitating automation of scalable data analysis

Data-centric studies in engineering and natural sciences consist of many steps of data manipulation and analysis. To automate such studies, a lot of project-specific scripting is needed to integrate, clean up, transform, and analyze data. Implementing scripts is time-consuming and can be challenging for programmers and even more for domain specialists, especially if very large datasets require parallel or distributed processing.  

 
We approach this problem by exploiting and extending methods for interactive data transformation and analysis (e.g. context-sensitive code recommendations) in connection with established frameworks for big-data processing and analysis.  
 
For example, in one of the projects we combined OpenRefine (  http://openrefine.org/) - a popular, easy-to-use yet not scalable tool for data cleaning and preprocessing - with Apache Spark (  https://spark.apache.org/), a state-of-the-art framework for processing massive data sets. As a result, users can quickly and interatively create a script for cleaning/preprocessing a data sample, and then use it (without code rewriting) to process terrabytes of data on a distributed/parallel cluster.
In another project we develop a tool (IDE plugin) which support Python Numpy/Pandas programmers by recommending context-sensitive code fragments covering a wide spectrum of data processing tasks.
 
We target as outcomes approaches and a prototypical tools for creating - in an interactive, user-friendly way - workflows for scalable data analysis studies in natural sciences and engineering disciplines. We further envision to transfer our results to other domains, in particular analyzing data and creating models for in the domains of software dependability and IT-security.

Prof. Till Bärnighausen
  Institute for Public Health

Data analysis for non-communicable diseases

Population aging, economic development, and urbanization have led to an epidemiological transition in low- and middle-income countries (LMICs) that has been characterized by a rapid rise in the prevalence of non-communicable diseases (NCDs). As NCD risk factors, including diabetes, hypertension, and hyperlipidemia, can be treated effectively at relatively low cost, an in-depth understanding of the relationship between socioeconomic factors and outcomes also requires a better understanding of how met need for care for these conditions varies across socioeconomic gradients. This is the first study to systematically collate and analyze individual-level data from large, nationally representative population-based surveys to ascertain how the prevalence and health system performance for NCD risk factors varies according to socioeconomic characteristics. We have cleaned and pooled micro-data (including both questionnaire and biomarker data) from over 40 LMICs with a total of over two million participants. This project will be in collaboration with colleagues from Harvard University and the University of Göttingen.

Cellphone Data from Health and Demographic Surveillance Sites

The Heidelberg Institute of Public Health collaborates closely with a health and demographic surveillance site (HDSS) in rural Burkina Faso (see:   http://www.crsn-nouna.bf). A team around Profs. Frank Tanser and Till Bärnighausen are planning to collect cell phone data from people living in the HDSS in early 2018 to study migration (including commuting) patterns in the area and their association with health and economic outcomes.

Prof. Roland Eils
  Eilslabs

Integrative Bioinformatics and Systems Biology

With the advent of next-generation sequencing technologies it is today feasible to determine the sequence of a human individual (e.g. a patient suffering of cancer) in a matter of days. Other, complementary technologies such as transcriptome, proteome or metabolome analysis deliver further huge amounts of data with an enormous potential value for precise diagnostics. Albeit the enormous technological advances in data generation, the integration of these data in order to generate new insights into complex biological function is still a major challenge and can only be achieved with interdisciplinary approaches.

Our division is developing computer-assisted methods for interpreting complex genomic and other biological data as well as methods for modeling and simulation of biological processes. Major activities include the development of integrated bioinformatics approaches for the interpretation and management of cancer genome and clinical data, the application of state-of-the-art technologies in automated live-cell imaging and image analysis, experimental and theoretical systems biology approaches addressing key cellular mechanisms and their distortions in cancer cells as well as the development of new synthetic biology tools to manipulate cellular processes.
Overarching aim is the development of new insights linking alterations in cancer or other diseased cells with biological functions in order to find new potential targets for improved diagnostics and treatment. To achieve this, the eilslabs are actively contributing to   DKFZ-HIPO, the Heidelberg Center for Personalized Oncology, which is co-directed by Roland Eils.

Prof. Michael Gertz
  Database Systems Research

Density-based Clustering

Density-based clustering continues to be an important research domain. Several new methods have been proposed, such as the AnyDBC clustering algorithm and CFSFDP clustering. In this practical, your task is to implement these algorithms carefully within the ELKI data mining framework, to allow a fair benchmark comparison with alternative algorithms that already exist in this tool. Careful programming is
necessary to choose memory efficient data structures, and where possible the implementation must use the query API of ELKI for acceleration and allow any distance function of ELKI to be used. Last but not least, this practical is expected to study the clustering runtime and quality of different related methods on large data sets that need to be carefully chosen (as there does not exist a suitable
benchmarking set, yet).

JavaFX Frontend for Data Mining

ELKI is a large open-source Java framework for data mining with index acceleration, which contains implements of many algorithms for clustering, outlier detection, and data indexing. The current user interface is a minimalistic command line builder,
automatically generated from metadata of algorithm parameters, and not very accessible to beginners.


This project aims at developing a JavaFX frontend for ELKI to make it easier to parameterize the data mining algorithms. Yet, the actual UI must continue to be generated from metadata, so that new algorithms can be added without having to modify the UI code every time. Therefore, this practical needs someone experienced with automatic UI generation, as we can not use FXML UI builders here.

Incremental Nearest Neighbor Search

The ELKI data mining framework contains various data index structures to facilitate efficient nearest neighbor search, such as R*-trees, M-trees, cover trees, ... that can be used to accelerate many data mining algorithms in clustering and outlier detection. Currently, the search requires the developer to specify the number of neighbors to find. For particular algorithms to be added to ELKI in the future, we
will need to modify the existing code to allow an incremental nearest neighbor search, where additional neighbors can be requested efficiently. In this practical, your responsibility is to understand the existing nearest-neighbor search code, and rewrite it in a way that efficiently allows the algorithms to continue search for further
neighbors. Ideally, you will develop an API that makes this easy to use, and an abstract (but efficient) base implementation that can be shared for different indexes.

Prof. Dieter Heermann
  AG Statistical Physics and Theoretical Biophysics

Project topic

The group pursues research at the forefront of physics and quantitative biology with emphasis on mathematical modelling and analysis of biological data, development of computational methods, systems biology, biophysics, and biomathematics. We develop and apply predictive models for biological and biophysical systems and their interactions at multiple scales, and create statistical methods for the analysis of the complex correlated data. We are actively engaged in joint projects with experimental biologists and physicists producing such data. Much of our current research is directed at combining genomic sequence, expression level information and regulation network information with structural information such as high resolution microscopy and chromosomes conformation capture data to develop models to predict biological function. Our efforts have focused on the development and application of biophysical and bioinformatics methods aimed at understanding the structural and energetic origins of chromosome interactions to reveal the underlying physical folding principles. Our work includes fundamental theoretical research and applications to problems of biological importance as well as the development of appropriate software to handle the vast amount of data.

Prof. Vincent Heuveline
  Engineering Mathematics and Computing Lab

Data Mining and Uncertainty Quantification for Medical Engineering

Advances in sensor technology and high-performance computing enable scientists to collect and generate extremely large data sets, usually measured in terabytes and petabytes. These data sets, obtained by means of observation, experiment, or numerical simulation, are not only very large but also highly complex in their structure. Exploring these data sets and discovering patterns and significant structures in them is a critical and highly challenging task that can only be addressed in an interdisciplinary framework combining mathematical modeling, numerical simulation and optimization, statistics, high-performance computing, and scientific visualization.

Besides the size and complexity of these data, quality is another crucial issue in guaranteeing reliable insights into the physical processes under consideration. The associated demands on the quality and reliability of experiments and numerical simulations necessitate the development of models and methods from mathematics and computer science that are able to quantify uncertainties for large amounts of data. Such uncertainties may derive, for example, from measurement errors, lack of knowledge about model parameters or inaccuracy in data processing.

In his group Prof. Dr. Vincent Heuveline makes use of stochastic mathematical models, high-performance computing, and hardware-aware computing to quantify the impact of uncertainties in large data sets and/or associated mathematical models and thus help to establish reliable insights in data mining. The main field of application are medical engineering and life sciences.

Prof. Ekaterina Kostina
  AG Numerical Optimization

Biological Modelling, Optimization & Model Discrimination

Mathematical models are of great importance in quantitative approaches in molecular and cell biology. They provide a scientific insight into processes, help to understand underlying biochemical phenomena and the functioning of biological systems, and maybe used to identifying the ways for a possible re-design of systems.  However, the results from simulation and optimization are only reliable if an underlying model precisely describe a given process. This implies models validated by experimental data with sufficiently good estimates for model
parameters. The development and quantitative validation of complex nonlinear models is a difficult task that requires the support by numerical methods of parameter estimation and the optimal design of experiments. 

Aim of the possible projects will be the development of optimization-based methods for modelling of biological systems including methods for parameter estimation and design of optimal experiments for information gain about parameters and for model
discrimination. 

The topics under research include robust techniques being able to deal with data corrupted by outliers, treatment of switching phenomena caused by modeling fast time scales, the treatment of large data sets, sparse parameter recovery to avoid
over-parametrization, methods for extraction information-rich data from large data sets.

Prof. Anna Marciniak-Czochra
  AG Applied Analysis and Modelling in Biosciences

Current Research devoted to Biological Applications

  • Self-organisation and regeneration in developmental biology systems (with Thomas Holstein)
  • Early carcinigenesis, role of growth factors and mutualism (with Marek Kimmel)
  • Stem cells differentiation, in partcular influence of the aging processes on the dynamics of stem cells and hematopoietic reconstitution after chemotherapy and transplantation of stem cells (with Wolfgang Wagner and Anthony Ho)
  • Mnfluence of Heat Shock Proteins (HSPs) on neoplastic cell transformations (with Zuzanna Szymanska and Maciej Zylicz)
  • Dynamics of Tumour Necrosis Factor (TNF alpha) in macrophages (with Alexei Gratchev)
  • Innate immunity response and its influence on viral replication and spread in laboratory systems (with Marek Kimmel and Philipp Getto)

Prof. Karsten Rippe
  AG Chromatine Networks

Analyzing single cell sequencing data to model the pathophenotype of leukemias

Cells in a tumor sample from a patient frequently display heterogeneous phenotypes. One source of heterogeneity lies in differences of the epigenetic programs that are active in a given cell. The resulting cell types respond differently (or not at all) to a particular set of mutations, environmental signals or therapeutic drugs (1). However, in transcriptome and (epi)genome data obtained from bulk cell populations these biologically important differences among individual cells are averaged out. Accordingly, single cell sequencing (sc-seq) methods are emerging as an important new approach to dissect the cellular heterogeneity of tumor samples (2).


We have started to determine genome-wide activity patterns (promoters, enhancers, transcribed genes) by single cells RNA-seq and single cell ATAC-seq analysis of blood samples from healthy donors and patient samples for chronic lymphocytic leukemia, acute myeloid leukemia and multiple myeloma. To interpret theses data we develop and apply computational methods to define the cell type populations of a given sample comprising single cell sequencing profiles from 5,000-10,000 cells from its molecular signature determined by sc-seq. The resulting cell population maps are exploited to characterize deregulation, disease progression and drug response in leukemia. The lab project will be part of this work and address one or more of the following objectives: (i) Identifying cell types based on single cell activity patterns. (ii) Identifying a selected set of molecular markers that represent relevant cell states and that can be read out by immunostaining (for intracellular markers) or cell sorting (for surface markers) for further analysis. (iii) Linking the different cell types identified based on their molecular signatures to functionally relevant leukemic cell states, e.g., responding vs. non-responding cells after drug treatment.

References:

(1) Easwaran, H., Tsai, H. C. & Baylin, S. B. Cancer epigenetics: tumor heterogeneity, plasticity of stem-like
states, and drug resistance. Mol Cell 54, 716-727 (2014).

(2) Navin, N. E. The first five years of single-cell cancer genomics and beyond. Genome Res 25, 1499-1507
(2015).

Prof. Filip Sadlo
  AG Visual Computing

Data Analysis by means of Feature Extraction, Direct Representation, and Exploration
 

There is hardly any activity in science and engineering that would not involve the generation, processing, and analysis of data. Such data can be divided into two categories: discrete data, and continuous (field) data. Examples for the former type include text and networks, whereas prominent examples for the latter are flow fields and generally phenomena that can be described with differential equations.
 
The analysis of continuous field data is a central focus in scientific visualization. Visualization plays an important role in many areas that involve simulation and data gathering, by developing concepts and techniques that help reveal the essential structure of the resulting data, in particular when the fields become large, complex, and high-dimensional, and when the research questions that shall be answered on the basis of the data become complex or not clearly defined. Examples for such concepts and techniques are direct visualization, feature extraction, and exploration. Features are often based on (local) mathematical descriptions, and identify structures in the fields. Due to their automatic extraction, they enable the analysis of large data, in particular in cases where the researcher already has a hypothesis or research question, or aims at comparing different data, e.g., originating from different simulation runs or from simulation and measurement. If, on the other hand, the researcher needs to obtain overview, or have a look at the continuous properties, direct visualization can be a prominent tool. Finally, and important in many cases, data need to be explored to understand their interrelation, to come up with hypotheses, and to bring them into relation with already extracted features.
 
Our research group is active in all three areas. For example, we develop feature extraction techniques to understand fluid flow, including vortex analysis, topological analysis that reveals different flow regimes, and analysis of coupled physical processes such as diffusion. Such techniques can help understand, blood flow, transport of oxygen and nutrients, development of tissue, and many other. In direct visualization, we are active in the development of volume rendering techniques, e.g., for revealing structure in computed tomography scans, and in extending those to involved physical mechanisms such as stresses and deformation. Finally, in exploratory (interactive) visualization, we, for example, develop techniques to analyze the dynamics of inertial particles, such as dust in airways, and help reveal the structure of higher-dimensional data, such as phase spaces of differential equations describing chemical processes. Overall, we have a focus on the data with respect to the underlying physics, its spatiotemporal structure, and concise but exact representation.

Prof. Alexander Zipf
  AG Geoinformatics/GIScience

 

(1) Predicting OSM landuse through VGI and remote sensing data using machine learning
(2) Data Mining for global OSM History Quality Analytics based on big data technology

Geographic Information has become big in recent years as the spatial and temporal resolution of data increases considerably and global coverage is the goal. Vast amounts of unstructured and spatially attributed data are continuously generated and available on the web, from technical sensors on the earth, remote sensing and produced by humans. Prominent examples include Volunteered Geographic Information like OpenStreetMap (OSM) or from Social Media.

With our background in GIScience and using methods such as geocomputation, data mining and machine learning, we extract precious knowledge from such datasets, e.g. by finding latent patterns and regularities to answer research questions on geographical phenomena relevant for society or environment. In combination with official geographical information from public administrations these have become an important asset e.g. for disaster management purposes. Processing, analyzing and aggregating these different kind of data sources enables e.g. humanitarian aid organizations and emergency responders to obtain a comprehensive view of the specific catastrophe on site to name one application. Our overall goal is to integrate, improve and enrich geographic information such as OSM or to derive new information layers e.g. through data fusion and machine learning.

The candidates may develop and apply methods from Spatial Data Mining and Machine Learning in selected domains and research questions related to our projects like e.g. the work at HeiGIT HeiGIT . the OSM History Analytics Plattform (  http://ohsome.org ), based on big data technology like cloud based processing and analysis using big data frameworks like Apache Spark / Ignite etc or   DeepVGI – Deep Learning Volunteered Geographic Information,   LandSense – A Citizen Observatory and Innovation Marketplace for Land Use and Land Cover Monitoring, etc.