Portable Microhaplotype Object (PMO)
  • Home
  • Format Info
    • Development of Format
    • PMO fields overview
    • PMO Examples
    • Format Overview For Developers
  • Tools Installation
    • pmotools-python installation
  • pmotools-python usages
    • Command line interface

    • pmotools-python
    • Command line interface to pmotools-python with pmotools-python
    • Extracting out of PMO
    • Extracting allele tables using pmotools-python
    • Subset PMO
    • Subsetting from a PMO using pmotools-python
    • Getting sub info from PMO
    • Getting basic info out of PMO using pmotools-python
    • Getting panel info out of PMO using pmotools-python
    • Handling Multiple PMOs
    • Handling multiple PMOs pmotools-python
    • Validating PMO files
    • Validating PMOs pmotools-python

    • Python interface
    • Getting basic info out of a PMO
    • Creating a PMO File
  • Resources
    • References
    • Documentation
    • Documentation Source Code
    • Comment or Report an issue for Documentation

    • pmotools-python
    • pmotools-python Source Code
    • Comment or Report an issue for pmotools-python

Contents

  • Why targeted sequencing needs a data standard
  • Goals of the PMO data standard
  • Our approach to development

Portable Microhaplotype Object (PMO)

Multiplexed targeted sequencing is now widely used to generate data for the most informative genomic regions of organisms, but the lack of an appropriate data standard has hindered data sharing, reuse, and downstream analysis. Here, we provide details for an extensible standard and related convenience utilities to store lossless, compact representations of phased, processed target sequences (microhaplotypes) along with an efficient relational ontology in a portable JSON file.

Why targeted sequencing needs a data standard

Targeted amplicon sequencing is now established as a sensitive and efficient means of obtaining relevant information about a wide variety of organisms. Applications are broad and expanding, including microbiome analysis, pathogen identification, detection of antimicrobial resistance, and tracking the spread of viruses, bacteria, and eukaryotic pathogens.

Many of these applications utilize the full sequences provided by individual reads because they contain multiple, phased variants (microhaplotypes)(Oldoni, Kidd, and Podini 2019) - information that is lost when decomposing these data into independent variants such as SNPs. This information is particularly valuable when sequencing samples containing organisms with more than one sequence per target, such as mixed bacterial samples, commonly polyclonal pathogens (e.g., Plasmodium) (Tessema et al. 2022),(LaVerriere et al. 2022),(Jacob et al. 2021),(Kattenberg Johanna Helena et al. 2023),(Aranda-Díaz et al. 2025),(Sadler et al. 2024), and diploid or polyploid organisms. Thus, data formats designed for small variants that do not preserve full sequences, such as the popular variant call format (VCF), are not well suited to store microhaplotype data.

Goals of the PMO data standard

  • Provide a structured and flexible framework to help individual researchers and groups to organize their data in a findable and accessible way.
  • Create a standard for data sharing, including for repositories, academic reports, and public health entities, to aid in interoperability, transparency, and reproducibility
  • Maximize data reuse by lowering the barriers for making data publicly available in a standardized format.
  • Provide a consistent format to allow harmonization of downstream analysis tools across data sets and minimize the need for tedious and error-prone tasks such as data reshaping

Our approach to development

The microbiome community has created data standards for a single locus, including BIOM and ESS-DIVE. Here, we extend these standards to an arbitrary number of loci in a framework extensible to any type of targeted sequence data. The format is lossless, allowing recovery of full sequence data, while achieving data compression of ~6x and up to ~80x with additional compression using standard tools (e.g., gzip). Optional fields allow data generators with domain expertise to include additionally processed sequence data such as variants with masked domains, e.g. for highly error prone areas such as tandem repeats. Notably, the framework provides a robust relational ontology for sample, laboratory, and bioinformatic metadata in addition to sequence data, mitigating the common problem of partially or completely orphaned data. A full ontology has been built out for Plasmodium, leveraging existing fields where available, and the modular structure can be flexibly extended to other biological systems, including those containing multiple types of organisms. All data are encoded in a standard JSON file, enhancing portability and ease-of-interpretation. The end result is a design which is efficient, lightweight, and flexible, organizing metadata together with genetic data. Finally, we have created a set of convenience utilities to make it easy to create, manipulate, share, import, and export PMO files.

References

Aranda-Díaz, Andrés, Eric Neubauer Vickers, Kathryn Murie, Brian Palmer, Nicholas Hathaway, Inna Gerlovina, Simone Boene, et al. 2025. “Sensitive and Modular Amplicon Sequencing of Plasmodium Falciparum Diversity and Resistance for Research and Public Health.” Sci. Rep. 15 (March): 10737.
Jacob, Christopher G, Nguyen Thuy-Nhien, Mayfong Mayxay, Richard J Maude, Huynh Hong Quang, Bouasy Hongvanthong, Viengxay Vanisaveth, et al. 2021. “Genetic Surveillance in the Greater Mekong Subregion and South Asia to Support Malaria Control and Elimination.” Elife 10 (August).
Kattenberg Johanna Helena, Fernandez-Miñope Carlos, van Dijk Norbert J., Llacsahuanga Allcca Lidia, Guetens Pieter, Valdivia Hugo O., Van geertruyden Jean-Pierre, et al. 2023. “Malaria Molecular Surveillance in the Peruvian Amazon with a Novel Highly Multiplexed Plasmodium Falciparum AmpliSeq Assay.” Microbiology Spectrum 0 (0): e00960–22.
LaVerriere, Emily, Philipp Schwabl, Manuela Carrasquilla, Aimee R Taylor, Zachary M Johnson, Meg Shieh, Ruchit Panchal, et al. 2022. “Design and Implementation of Multiplexed Amplicon Sequencing Panels to Serve Genomic Epidemiology of Infectious Disease: A Malaria Case Study.” Mol. Ecol. Resour. 22 (6): 2285–2303.
Oldoni, Fabio, Kenneth K Kidd, and Daniele Podini. 2019. “Microhaplotypes in Forensic Genetics.” Forensic Sci. Int. Genet. 38 (January): 54–69.
Sadler, Jacob M, Alfred Simkin, Valery P K Tchuenkam, Isabela Gerdes Gyuricza, Abebe A Fola, Kevin Wamae, Ashenafi Assefa, et al. 2024. “Application of a New Highly Multiplexed Amplicon Sequencing Tool to Evaluate Plasmodium Falciparum Antimalarial Resistance and Relatedness in Individual and Pooled Samples from Dschang, Cameroon.” Front. Parasitol. 3: 1509261.
Tessema, Sofonias K, Nicholas J Hathaway, Noam B Teyssier, Maxwell Murphy, Anna Chen, Ozkan Aydemir, Elias M Duarte, et al. 2022. “Sensitive, Highly Multiplexed Sequencing of Microhaplotypes from the Plasmodium Falciparum Heterozygome.” J. Infect. Dis. 225 (April): 1227–37.
Source Code
---
title: "Portable Microhaplotype Object (PMO)"
---


```{r setup, echo=F}
source("common.R")
```

Multiplexed targeted sequencing is now widely used to generate data for the most informative genomic regions of organisms, but the lack of an appropriate data standard has hindered data sharing, reuse, and downstream analysis. Here, we provide details for an extensible standard and related convenience utilities to store lossless, compact representations of phased, processed target sequences (microhaplotypes) along with an efficient relational ontology in a portable JSON file.


## Why targeted sequencing needs a data standard

**Targeted amplicon sequencing** is now established as a sensitive and efficient means of obtaining relevant information about a wide variety of organisms. Applications are broad and expanding, including microbiome analysis, pathogen identification, detection of antimicrobial resistance, and tracking the spread of viruses, bacteria, and eukaryotic pathogens. 

Many of these applications utilize the full sequences provided by individual reads because they contain multiple, phased variants (**microhaplotypes**)[@Oldoni2019-kb] - information that is lost when decomposing these data into independent variants such as SNPs. This information is particularly valuable when sequencing samples containing organisms with more than one sequence per target, such as mixed bacterial samples, commonly polyclonal pathogens (e.g., Plasmodium) [@Tessema2022-fg],[@LaVerriere2022-ya],[@Jacob2021-ib],[@Kattenberg_Johanna_Helena2023-jw],[@Aranda-Diaz2025-yg],[@Sadler2024-jw], and diploid or polyploid organisms. Thus, data formats designed for small variants that do not preserve full sequences, such as the popular variant call format (VCF), are not well suited to store microhaplotype data. 


## Goals of the PMO data standard  

*  Provide a **structured and flexible framework** to help individual researchers and groups to organize their data in a findable and accessible way.
*  Create a standard for **data sharing**, including for repositories, academic reports, and public health entities, to aid in interoperability, transparency, and reproducibility
*  Maximize **data reuse** by lowering the barriers for making data publicly available in a standardized format. 
*  Provide a consistent format to allow **harmonization of downstream analysis tools** across data sets and minimize the need for tedious and error-prone tasks such as data reshaping

## Our approach to development  

The microbiome community has created data standards for a single locus, including BIOM and ESS-DIVE. Here, we extend these standards to an arbitrary number of loci in a framework extensible to any type of targeted sequence data. The format is lossless, allowing recovery of full sequence data, while achieving data compression of ~6x and up to ~80x with additional compression using standard tools (e.g., gzip). Optional fields allow data generators with domain expertise to include additionally processed sequence data such as variants with masked domains, e.g. for highly error prone areas such as tandem repeats. Notably, the framework provides a robust relational ontology for sample, laboratory, and bioinformatic metadata in addition to sequence data, mitigating the common problem of partially or completely orphaned data. A full ontology has been built out for Plasmodium, leveraging existing fields where available, and the modular structure can be flexibly extended to other biological systems, including those containing multiple types of organisms. All data are encoded in a standard JSON file, enhancing portability and ease-of-interpretation. The end result is a design which is efficient, lightweight, and flexible, organizing metadata together with genetic data. Finally, we have created a set of convenience utilities to make it easy to create, manipulate, share, import, and export PMO files.