Portable Microhaplotype Object (PMO)
Website describing Portable Microhaplotype Object development
Targeted Amplicon Analysis
Introduction
Amplicon can be defined as a piece of DNA/RNA that has been amplified. When referring to amplicon data though we are referring to targeted amplicon data which is when a specific target of interest has been amplified by a set of primers and sequenced from a set of samples. Data from an amplicon can be referred to a microhaplotype (as opposed to a full haplotype which would be a full chromosome, a microhaplotype typical covers more than 1 variant and is approximately <=300bp, which is it’s first definition(Oldoni, Kidd, and Podini 2019) but doesn’t have to meet those criteria precisely)
Goals
Targeted amplicon sequencing has been a data analysis technique that has been well established and made most popular by microbiome analysis where a piece of 16S rRNA is amplified to help classify a microbiome mixture by defining taxon data and abundance.
The goal of a developing a amplicon file format, especially with Plasmodium in mind, is easing the comparison between experimental runs within and across labs.
Sources of data
Data can be generated by targeted amplicon sequencing experiments or by doing local assembly on WGS data.
Proposed standards for targeted amplicon data and metadata
With generation of amplicon sequencing data for Plasmodium accelerating, there is a timely opportunity to create shared resources to disseminate, reuse, and analyze these data. However, there is currently no standard for lossless representation microhaplotypes derived from these approaches nor for associated laboratory, bioinformatic, and clinical metadata.
A standardized format for microhaplotype data would facilitate data sharing, including development of appropriate repositories, along with transparency and reproducibility of analysis. Standardization at this central step in analysis would also allow for alignment of downstream tools, increasing incentives to develop robust, reusable software and allow cross-study analyses.
We propose a single, relational data structure using JSON as a portable file. This approach allows for design which is efficient, lightweight, and flexible, organizing metadata together with genetic data.
See current alpha developmental of file format here
Goals of file format
- Help individual researchers / groups organize their data
- Format for sharing and reporting data to aid in standardization, transparency, and reproducibility
- Align downstream analysis tools so no need to reshape data differently for each application
- Make data publicly available for analysis
- Facilitate cross-study analysis and tools e.g. “next-malaria”
Standards
Previous standards have been defined for genomic data and the goal of the file format is to abide by these standards to ease the use of the data format and to compliy with standardization.
The Genomic Standards Consortium
The Genomic Standards Consortium
The Genomic Standards Consortium (GSC) is an open-membership working body formed in September 2005. The aim of the GSC is making genomic data discoverable. The GSC enables genomic data integration, discovery and comparison through international community-driven standards.
They are listed under the findability, accessibility, interoperability, and reusability (FAIR) sharing
https://fairsharing.org/GSC
MIxS
GSC have developed several standards in order to try to standardized the way we describe genomics/sequencing data and have developed what is called “Minimum Information about any (X) Sequence” (MIxS) specification
https://github.com/GenomicsStandardsConsortium/mixs/
Without specific guidelines, most genomic, metagenomic and marker gene sequences in databases are sparsely annotated with the information required to guide data integration, comparative studies and knowledge generation. Even with complex keyword searches, it is currently impossible to reliably retrieve sequences that have originated from certain environments or particular locations on Earth—for example, all sequences from ‘soil’ or ‘freshwater lakes’ in a certain region of the world. Because public databases of the International Nucleotide Sequence Database Collaboration (INSDC; comprising DNA Data Bank of Japan (DDBJ), the European Nucleotide Archive (EBI-ENA) and GenBank depend on author-submitted information to enrich the value of sequence data sets, we argue that the only way to change the current practice is to establish a standard of reporting that requires contextual data to be deposited at the time of sequence submission. The adoption of such a standard would elevate the quality, accessibility and utility of information that can be collected from INSDC or any other data repository.
Microbial specific
Microbiome world has long had to deal with targeted amplicon analysis primarily on 16s RNA sub-unit and more recently on Multilocus sequence typing (MLST)
ESS-DIVE amplicon file formating
As such they have already developed some standards on creating a generalized targeted amplicon file, standards set by Environmental System Science Data Infrastructure for a Virtual Ecosystem (ESS-DIVE), https://github.com/ess-dive-community, https://ess-dive.lbl.gov/
Their standards try to follow closely with Minimum Information about any Sequence (MIxS) standards
biom amplicon file formatting
Other considerations are also the biom format which is written in HDF5 to help handle storing data in binary format, this is used by QIME along with several other common bacterial genomics tools