GenoMetric Query Language (GMQL)

Next Generation Sequencing (NGS) technologies are producing data at increasing speed and reducing costs, therefore managing NGS data is quickly becoming the biggest "big data" problem of mankind. In this context, GMQL provides a next-generation query language for querying NGS data.

The GenoMetric Query Language operates upon aligned genomic data in a variety of data formats; it provides parallel computation in the cloud, thereby supporting queries over thousands of samples, such as the ones provided by ENCODE and TCGA consortia. The language's name indicates its ability to compute massive operations on genomic regions, which take into account region relative positions and distances.

GMQL can be used independently or within a server-based architecture based on Apache Hadoop and Apache Pig which runs on cloud-based systems and has several components, including an orchestrator and a language compiler. In the GMQL toolkit, it is possible to download a GMQL implementation and run GMQL queries, either "locally" or within a Hadoop Distributed File System.

GMQL is supported by the PRIN project GenData 2020.

GMQL potential can be tested here through a set of predefined parametric queries on ENCODE and Epigenomics Roadmap data.

Newer enhanced version of GMQL is available here.