Portable Microhaplotype Object (PMO)
Multiplexed targeted sequencing is now widely used to generate data for the most informative genomic regions of organisms, but the lack of an appropriate data standard has hindered data sharing, reuse, and downstream analysis. Here, we provide details for an extensible standard and related convenience utilities to store lossless, compact representations of phased, processed target sequences (microhaplotypes) along with an efficient relational ontology in a portable JSON file.
Why targeted sequencing needs a data standard
Targeted amplicon sequencing is now established as a sensitive and efficient means of obtaining relevant information about a wide variety of organisms. Applications are broad and expanding, including microbiome analysis, pathogen identification, detection of antimicrobial resistance, and tracking the spread of viruses, bacteria, and eukaryotic pathogens.
Many of these applications utilize the full sequences provided by individual reads because they contain multiple, phased variants (microhaplotypes)(Oldoni, Kidd, and Podini 2019) - information that is lost when decomposing these data into independent variants such as SNPs. This information is particularly valuable when sequencing samples containing organisms with more than one sequence per target, such as mixed bacterial samples, commonly polyclonal pathogens (e.g., Plasmodium) (Tessema et al. 2022),(LaVerriere et al. 2022),(Jacob et al. 2021),(Kattenberg Johanna Helena et al. 2023),(Aranda-Díaz et al. 2025),(Sadler et al. 2024), and diploid or polyploid organisms. Thus, data formats designed for small variants that do not preserve full sequences, such as the popular variant call format (VCF), are not well suited to store microhaplotype data.
Goals of the PMO data standard
- Provide a structured and flexible framework to help individual researchers and groups to organize their data in a findable and accessible way.
- Create a standard for data sharing, including for repositories, academic reports, and public health entities, to aid in interoperability, transparency, and reproducibility
- Maximize data reuse by lowering the barriers for making data publicly available in a standardized format.
- Provide a consistent format to allow harmonization of downstream analysis tools across data sets and minimize the need for tedious and error-prone tasks such as data reshaping
Our approach to development
The microbiome community has created data standards for a single locus, including BIOM and ESS-DIVE. Here, we extend these standards to an arbitrary number of loci in a framework extensible to any type of targeted sequence data. The format is lossless, allowing recovery of full sequence data, while achieving data compression of ~6x and up to ~80x with additional compression using standard tools (e.g., gzip). Optional fields allow data generators with domain expertise to include additionally processed sequence data such as variants with masked domains, e.g. for highly error prone areas such as tandem repeats. Notably, the framework provides a robust relational ontology for sample, laboratory, and bioinformatic metadata in addition to sequence data, mitigating the common problem of partially or completely orphaned data. A full ontology has been built out for Plasmodium, leveraging existing fields where available, and the modular structure can be flexibly extended to other biological systems, including those containing multiple types of organisms. All data are encoded in a standard JSON file, enhancing portability and ease-of-interpretation. The end result is a design which is efficient, lightweight, and flexible, organizing metadata together with genetic data. Finally, we have created a set of convenience utilities to make it easy to create, manipulate, share, import, and export PMO files.