GenoMetric Query Language

Genometric Query Language

GMQL

Genometric query language (GMQL) a biologist-focused query language

Enabling queries over hundreds of datasets and thousand of samples

A new holistic and abstract approach that uses cloud computing technologies and an interoperable data format to store and query tens of datasets, thousands of samples and several millions of DNA regions in order to discover interesting dna regions and their relationships. Thanks to algebraic operations on both DNA regions and metadata, GMQL genome-wide queries are able to find interesting regions by combining mutations, expression or regulation experiments

GMQL features

GMQL

Regions and data image

Co-existence of multiple type of regions and their meta-data.

High level operations image

Easy composition of high-level operations.

Powerful processing image

Powerful genome-wide processing.

Co-existence of multiple types of experimental and annotation regions and their meta-data

GMQL supports efficient high-level query processing of thousands of experimental data samples, produced with a variety of experimental methods and encoded in a variety of data formats, together with their biological and clinical metadata descriptions as well as multiple annotation data. It supports big data analysis combining hundreds of samples with millions of regions.

Easy expression and composition of high-level operations

Current tools for “big data” (e.g. BioPig, GQL) operate on reads and typically operate “sample by sample”. Other tools which are used to combine samples and working on their regions (e.g. BEDTools, BEDOPS) operate only on bed files - with a “scripting style” of programming.

GMQL instead, focused on assisting knowledge extraction, is meant to operate on higher level data obtained after raw data preprocessing and feature calling, rather than on raw data directly. This offers the advantage of not interfering with the variety of data preprocessing tools and pipelines that are already in place in the different research centers, as well as of directly benefiting of their output, thanks to the interoperability and data integration support.

How a standard gmql query works

Example: Identification of distal bindings in transcription regulatory regions

Powerful genome-wide processing

Public data can be used not only by themselves, but also together with in house produced experimental data, for integrated evaluations and comparisons with increased support. GMQL allows a powerful genome-wide processing and leverages the high-level data, including variant calling, gene expression and region enrichment (i.e. peak) calling data, which are increasingly available within public large data collections (e.g. 1000 Genomes Project, TCGA, and ENCODE).

Genomic space abstraction

Genomic data model (GDM)

Our Genomic Data model (GDM) provides abstractions for DNA regions of the sample and for metadata describing the sample's properties for a simplified structured outcome and ideal format for data analysis. MAP operations, through reference region R, extract and standardize genomic features expressed in distinct dataset.

Abstraction image

Input viewer

Genome space analysis

We are studying user-friendly interfaces to further facilitate the user interaction with our system to express GMQL queries and to search genomic data patterns of interest by visually drawing them in a genome browser. We developed an input tool for browsing metadata, which anticipates the effect of sample selections, so as to visualize the number and metadata of selected samples.

Input Viewer image

Output viewer

Genome space analysis

We also developed a output client-based tool for analyzing genome spaces, i.e. tables of region and sample data generated through GMQL. The tool supports heat map visualization, ordering, and clustering or bi-clustering.

GeCo

Data-Driven Genomic Computing

Data-Driven Genomic Computing (GeCo) is focused on genomic data tertiary analysis and integration, as a new data-driven science based on a simple driving principle: data should express high-level properties of DNA regions and samples, and high-level data management languages should support answering biological questions expressed with simple, powerful, orthogonal abstractions.
GeCo project is supported by the ERC Advanced Grant 693174 "Data-Driven Genomic Computing (GeCo)".

GenData 2020

GenData 2020 was a PRIN Project financed by MIUR (March 2013 - February 2016), coordinated by Politecnico di Milano, and involving 9 research centers in Italian universities (Politecnico di Milano, Università di Bergamo, Università di Milano, Università di Torino, Università di Bologna, Università di Roma 1, Università di Roma 3, Università di Salerna, Università della Calabria) for exploring the research problems in data-centered genomic computing.