Genomic data management research at politecnico di milano


Polimi image
DEIB image

Genomics big data challenge

NGS


In the coming decade, Next Generation Sequencing (NGS) will offer a fast and inexpensive technology (few hours and few hundreds of dollars) to read the whole human genome.


The data management community is not ready for NGS: data are managed by a variety of tools focused on specific data extractions and transformations, with specific physical formats and no focus on interoperability.

DNA image

The human genome is a sequence of 3 billions of dna nucleotides.

The potential for their data querying, analysis and sharing may be considered as the biggest and most important "big data problem" of mankind.

Distributed heterogeneous data


Heterogeneous data image

Genometric Query Language

GMQL


Genometric query language (GMQL) a biologist-focused query language

Enabling queries over hundreds of datasets and thousand of samples

DNA image

A new holistic and abstract approach that uses cloud computing technologies and an interoperable data format to store and query tens of datasets, thousands of samples and several millions of DNA regions in order to discover interesting dna regions and their relationships. Thanks to algebraic operations on both DNA regions and metadata, GMQL genome-wide queries are able to find interesting regions by combining mutations, expression or regulation experiments

GMQL features

GMQL


Regions and data image

Co-existence of multiple type of regions and their meta-data.

High level operations image

Easy composition of high-level operations.

Powerful processing image

Powerful genome-wide processing.

Regions and data image

Co-existence of multiple types of experimental and annotation regions and their meta-data


GMQL supports efficient high-level query processing of thousands of experimental data samples, produced with a variety of experimental methods and encoded in a variety of data formats, together with their biological and clinical metadata descriptions as well as multiple annotation data. It supports big data analysis combining hundreds of samples with millions of regions.



Regions and data image
High level operations image

Easy expression and composition of high-level operations


Current tools for “big data” (e.g. BioPig, GQL) operate on reads and typically operate “sample by sample”. Other tools which are used to combine samples and working on their regions (e.g. BEDTools, BEDOPS) operate only on bed files - with a “scripting style” of programming.


GMQL instead, focused on assisting knowledge extraction, is meant to operate on higher level data obtained after raw data preprocessing and feature calling, rather than on raw data directly. This offers the advantage of not interfering with the variety of data preprocessing tools and pipelines that are already in place in the different research centers, as well as of directly benefiting of their output, thanks to the interoperability and data integration support.



How a standard gmql query works

Example: Identification of distal bindings in transcription regulatory regions

GMQL Query image
Powerful processing image

Powerful genome-wide processing


Public data can be used not only by themselves, but also together with in house produced experimental data, for integrated evaluations and comparisons with increased support. GMQL allows a powerful genome-wide processing and leverages the high-level data, including variant calling, gene expression and region enrichment (i.e. peak) calling data, which are increasingly available within public large data collections (e.g. 1000 Genomes Project, TCGA, and ENCODE).

Genomic space abstraction

Genomic data model (GDM)


Our Genomic Data model (GDM) provides abstractions for DNA regions of the sample and for metadata describing the sample's properties for a simplified structured outcome and ideal format for data analysis. MAP operations, through reference region R, extract and standardize genomic features expressed in distinct dataset.

Abstraction image

Architecture

Genometric query system architecture


Architecture operating upon cloud computing systems based on Hadoop and operating on four different layers: services, orchestration & translation, engines , management of files and data sources and including several integrated components for data management.

A service-oriented API

for queries submitting, their execution monitoring and results retrieving.

The orchestrator

submitting PIG instructions with appropriate settings so as to optimize performances.

The indexer

providing metadata indexing using Lucene.

The compiler

translating GMQL code to Pig Apache scripts.

The repository

supporting the ad-hoc transformation of data from native formats to GDM after sample selection through indexing.

gendata2020 image

Input viewer

Genome space analysis


We are studying user-friendly interfaces to further facilitate the user interaction with our system to express GMQL queries and to search genomic data patterns of interest by visually drawing them in a genome browser. We developed an input tool for browsing metadata, which anticipates the effect of sample selections, so as to visualize the number and metadata of selected samples.

Input Viewer image

Input Viewer image

Output viewer

Genome space analysis


We also developed a output client-based tool for analyzing genome spaces, i.e. tables of region and sample data generated through GMQL. The tool supports heat map visualization, ordering, and clustering or bi-clustering.

GeCo

Data-Driven Genomic Computing


Data-Driven Genomic Computing (GeCo) is focused on genomic data tertiary analysis and integration, as a new data-driven science based on a simple driving principle: data should express high-level properties of DNA regions and samples, and high-level data management languages should support answering biological questions expressed with simple, powerful, orthogonal abstractions.

GenData 2020


GenData 2020 was a PRIN Project financed by MIUR (March 2013 - February 2016), coordinated by Politecnico di Milano, and involving 9 research centers in Italian universities (Politecnico di Milano, Università di Bergamo, Università di Milano, Università di Torino, Università di Bologna, Università di Roma 1, Università di Roma 3, Università di Salerna, Università della Calabria) for exploring the research problems in data-centered genomic computing.

About us

The team till 2016. For current team please check web site "The GeCo team"


Stefano Ceri image

Stefano Ceri

Professor

DEIB, Politecnico di Milano

Marco Masseroli image

Marco Masseroli

Associate Professor

DEIB, Politecnico di Milano

Vahid Jalili image

Vahid Jalili

Completed PhD cum Laude in 2015

DEIB, Politecnico di Milano

Abdulrahman Kaitoua image

Abdulrahman Kaitoua

PhD Student

DEIB, Politecnico di Milano

Fernando Paluzzi image

Fernando Paluzzi

PhD Student

will complete PhD at IEO

Pietro Pinoli image

Pietro Pinoli

PhD Student

DEIB, Politecnico di Milano

Yuriy Vaskin image

Yuriy Vaskin

PhD Student

will complete PhD at Novosibirsk State University

Francesco Venco image

Francesco Venco

Completed PhD cum Laude in 2015

DEIB, Politecnico di Milano

Collaborators


Gianpaolo Cugola image

Gianpaolo Cugola

Professor

DEIB, Politecnico di Milano

Matteo Matteucci image

Matteo Matteucci

Associate Professor

DEIB, Politecnico di Milano

Heiko Muller image

Heiko Muller

Senior Researcher

IIT@SEMM

Chiara Leonardi image

Chiara Leonardi

Research Fellow

DEIB, Politecnico di Milano

Partners


Polimi image
DEIM image
Gendata2020 image
IIT image
IEO image

Contacts


Stefano Ceri: ceri AT elet DOT polimi DOT it

Marco Masseroli: masseroli AT elet DOT polimi DOT it

via Ponzio 34/5, Milan

+39 02 23993400