Portable Microhaplotype Object (PMO)
  • Home
  • Format Info
    • Overview of Format

    • PMO fields overview
    • PMO within a Data Analysis Ecosystem
    • History of Format Development

    • History of how PMO Format was derived
    • Overview of Format For Bioinformaticians

    • PMO Examples
    • Format Overview For Developers
  • Tools Installation
    • pmotools-python installation
  • pmotools-python usages

    • Python interface
    • Building a PMO
    • Example of how to build a PMO
    • Interacting with a PMO file
    • Getting basic info out of a PMO
    • Command line interface

    • pmotools-python
    • Command line interface to pmotools-python with pmotools-python
    • Extracting out of PMO
    • Extracting allele tables using pmotools-python
    • Subset PMO
    • Subsetting from a PMO using pmotools-python
    • Getting sub info from PMO
    • Getting basic info out of PMO using pmotools-python
    • Getting panel info out of PMO using pmotools-python
    • Handling Multiple PMOs
    • Handling multiple PMOs pmotools-python
    • Validating PMO files
    • Validating PMOs pmotools-python
  • Resources
    • References
    • Documentation
    • Documentation Source Code
    • Comment or Report an issue for Documentation

    • pmotools-python
    • pmotools-python Source Code
    • Comment or Report an issue for pmotools-python

Contents

  • Setup
  • Read in Data
  • Project Information
  • Specimen Information
  • Panel Information
  • Library Sample Information
  • Sequencing Information
  • Bioinformatics Method
  • Bioinformatics Runs
  • Microhaplotype Info
  • Merge

Public Genomic Surveillance Data of Plasmodium falciparum from Five Countries

  • Show All Code
  • Hide All Code

This notebook demonstrates how publicly available genomic surveillance data from four countries (Eswatini, Namibia, South Africa, and Zambia) are reformatted into a Portable Microhaplotype Object (PMO) file for downstream analysis.

Data sources: Aranda-Díaz et al. (2025); Eloff et al. (2025); Nhlengethwa et al. (2025); Raman et al. (2025).

Setup

Here we import the necessary libraries, including those from the pmotools package that we will use to build the PMO.

Code
import pandas as pd
Code
from pmotools.pmo_builder.panel_information_to_pmo import panel_info_table_to_pmo, merge_panel_info_dicts
from pmotools.pmo_builder.metatable_to_pmo import library_sample_info_table_to_pmo, specimen_info_table_to_pmo
from pmotools.pmo_builder.mhap_table_to_pmo import mhap_table_to_pmo
from pmotools.pmo_builder.merge_to_pmo import merge_to_pmo
from pmotools.pmo_engine.pmo_writer import * 
import numpy as np
Code
from pmotools.pmo_builder import panel_information_to_pmo

Read in Data

To build a PMO we need to assemble several pieces of information. This notebook walks through that process step-by-step, including:

  • Project Information
  • Specimen Information
  • Panel Information
  • Library Sample Information
  • Sequencing Information
  • Bioinformatics Information
  • Microhaplotype Information

We use pmotools to construct each section and then merge them at the end to create the final PMO.

For any of the functions being used you can see more details using something like the example below:

Code
help(merge_to_pmo)
Help on function merge_to_pmo in module pmotools.pmo_builder.merge_to_pmo:

merge_to_pmo(specimen_info: list, library_sample_info: list, sequencing_info: list, panel_info: dict, mhap_info: dict, bioinfo_method_info: list, bioinfo_run_info: list, project_info: list, read_counts_by_stage_info: list | None = None)
    Merge components into PMO, replacing names with indeces.
    
    :param specimen_info (list): a list of all the specimens within this project
    :param library_sample_info (list) : a list of library samples within this project
    :param sequencing_info (list) : a list of sequencing info for this project
    :param panel_info (list) : a dictionary containing the panel and target information for this project
    :param mhap_info (list) : a dictionary containing the microhaplotypes within this project, both detected and representative
    :param bioinfo_method_info (list) : the bioinformatics pipeline/methods used to generated the amplicon analysis for this project
    :param bioinfo_run_info (list) : the runtime info for the bioinformatics pipeline used to generated the amplicon analysis for this project
    :param project_info (list) : the information about the projects stored in this PMO
    :param read_counts_by_stage_info (Optional[list]) : the read counts by stage information for this project
    
    :return: a json formatted PMO string.

First we need to read in the data as it was stored for the project. In this case, there were three separate tables

  • allele_data.txt.gz : this file contains the microhaplotype information
  • specimen_info.tsv : this file contains details of the speciment collected within the project
  • library_sample_info.tsv : this file contains the information on the sequencing performed on the specimens
Code
mhap_info_df = pd.read_csv('allele_data.txt.gz', sep='\t')
specimen_info_df = pd.read_csv('specimen_info.tsv', sep='\t')
library_sample_info = pd.read_csv('library_sample_info.tsv', sep='\t')

Below, we summarize the data by examining how many specimens were sequenced in replicate and therefore have multiple library samples associated with them.

Code
print("Number of specimens:",specimen_info_df['form participant_id'].nunique())
print("Number of library samples:",library_sample_info.SampleID.nunique())
print("\nReplicate summary:")
replicate_summary = library_sample_info.groupby('specimen_id').SampleID.size().value_counts().reset_index().rename(columns={'SampleID':'n_replicates'})
print(replicate_summary)
print("Number of specimens sequenced in replicate:",replicate_summary[replicate_summary.n_replicates>1]['count'].sum())
Number of specimens: 2025
Number of library samples: 2592

Replicate summary:
   n_replicates  count
0             1   1503
1             2    480
2             3     39
3             4      3
Number of specimens sequenced in replicate: 522

Project Information

In this section we enter information about the project. More information about the fields to include can be found here.

Code
pmo_project_info = [{
    "project_name":"RegGenE8",
    "project_description":"P. falciparum malaria surveillance in the Elimination 8.",
    "project_collector_chief_scientist":"Jennifer Smith",
    "project_type":"cross-sectional"
}]

Specimen Information

Now we transform the specimen table to comply with the PMO format. More information about the fields in this table can be found here.

Code
specimen_info_df.head()
form participant_id facility_district facility_province facility_name form date_diagnosis host_taxon_id specimen_taxon_id project_name country
0 650004 Sithobela Lubombo SITHOBELA RURAL HEALTH CENTER 2023-03-30 9606 5833 RegGenE8 Eswatini
1 650005 Sithobela Lubombo SITHOBELA RURAL HEALTH CENTER 2023-03-22 9606 5833 RegGenE8 Eswatini
2 650007 Sithobela Lubombo SITHOBELA RURAL HEALTH CENTER 2023-03-30 9606 5833 RegGenE8 Eswatini
3 650008 Sithobela Lubombo SITHOBELA RURAL HEALTH CENTER 2023-04-03 9606 5833 RegGenE8 Eswatini
4 650010 Sithobela Lubombo SITHOBELA RURAL HEALTH CENTER 2023-04-04 9606 5833 RegGenE8 Eswatini

Because a specimen could contain multiple taxon IDs, the specimen_taxon_id is stored as a list inside of PMO. Therefore, we need to convert the values in this column to a list now.

Code
specimen_info_df.specimen_taxon_id=[[taxon_id]for taxon_id in (specimen_info_df.specimen_taxon_id)]
Code
pmo_spec_info = specimen_info_table_to_pmo(
                            specimen_info_df, 
                            specimen_name_col='form participant_id',
                            specimen_taxon_id_col='specimen_taxon_id', 
                            host_taxon_id_col='host_taxon_id', 
                            collection_date_col='form date_diagnosis',
                            collection_country_col='country',
                            project_name_col='project_name',
                            geo_admin1_col='facility_province',
                            geo_admin2_col='facility_district',
                            geo_admin3_col='facility_name'
                           )

Panel Information

The panel information describes which panel was used to sequence the samples. At the time this data was generated, only the drug resistance data was publicly available, so the dataset has been subset accordingly, and the panel information tables include only those targets. In this project, two different pool combinations from the MAD4HatTeR panel were used. Because they all share the same reference genome, we first generate the genome information here:

Code
genome_info_dict = {
        'name':'3D7',
        'genome_version':'65',
        'taxon_id':[5833],
        'url':'https://plasmodb.org/a/service/raw-files/release-65/Pfalciparum3D7/fasta/data/PlasmoDB-65_Pfalciparum3D7_Genome.fasta'
}

Now we generate the panel information using the panel_info_table_to_pmo for each of the panels

  • a52_pools.tsv
  • ab2_pools.tsv
Code
a52panel_info = pd.read_csv('a52_pools.tsv', sep='\t')
ab2panel_info = pd.read_csv('ab2_pools.tsv', sep='\t')
Code
# A52
a52_panel_info = panel_info_table_to_pmo(
    a52panel_info,
    'A52',
    genome_info_dict, 
    target_name_col='amplicon',
)
Code
# AB2
ab2_panel_info = panel_info_table_to_pmo(
    ab2panel_info,
    'AB2',
    genome_info_dict, 
    target_name_col='amplicon',
)

Now we have all of the panels generated seperately, we can merge them all together using the merge_panel_info_dicts function.

Code
panels = [a52_panel_info, ab2_panel_info]
Code
pmo_panel_info = merge_panel_info_dicts(panels)

Library Sample Information

Here we compile the information for all library samples from all sequencing or amplification runs of the specimens in this project. More information can be found here.

Code
library_sample_info.head()
SampleID SSPOOL specimen_id Pools
0 MADH100_ES355_8071121168_A52_S365 SSPOOL19 650004 A52
1 MADH100_ES418_8071120168_A52_S369 SSPOOL19 650005 A52
2 MADH052_ES433_8071097924_A52_S69 SSPOOL13 650007 A52
3 MADH052_ES041_8071088821_A52_S18 SSPOOL13 650008 A52
4 MADH084_ES041_8071088821_A52_S290 SSPOOL16 650008 A52
Code
pmo_library_sample = library_sample_info_table_to_pmo(
    library_sample_info, 
    library_sample_name_col='SampleID', 
    sequencing_info_name_col='SSPOOL', 
    specimen_name_col='specimen_id', 
    panel_name_col='Pools',
)

Sequencing Information

Now we put together the sequencing information for every sequencing run that was carried out on the library samples.

Code
seq_platform='Illumina'
seq_instrument_model = 'MiSeq'
library_layout='150 paired-end reads'
library_strategy='AMPLICON'
library_source='GENOMIC'
library_selection='PCR'
seq_center='NICD'

pmo_seq_info = []
for seq_run in library_sample_info.SSPOOL.unique(): 
    seq_info_dict = {
        'sequencing_info_name':(seq_run), 
        'seq_platform':seq_platform, 
        'seq_instrument_model':seq_instrument_model, 
        'library_layout':library_layout, 
        'library_strategy':library_strategy,
        'library_source':library_source,
        'library_selection':library_selection,
        'seq_center':seq_center,
    }
    pmo_seq_info.append(seq_info_dict)

Bioinformatics Method

The Mad4hatter pipeline was used to analyse all of this data. Below we link to this pipeline and provide some details about the programs used within the pipeline.

Code
pmo_bioinfo_method = [
    {
        "methods": [
            {"program": "DADA2", "program_version": "3.17"},
            {"program": "cutadapt", "program_version": "4.4"},
        ],
        "pipeline": {
            "program": "Mad4hatter",
            "program_version": "v0.2.1",
            "program_url": "https://github.com/EPPIcenter/mad4hatter",
        },
    }
]

Bioinformatics Runs

Now we put together information on each of the bioinformatics runs that were used to generate the microhaplotypes.

Code
pmo_bioinfo_runs = []
for run in library_sample_info.SSPOOL.unique(): 
    bioinfo_run = {
        'bioinformatics_run_name':(run),
        'bioinformatics_methods_id':0
    }
    pmo_bioinfo_runs.append(bioinfo_run)

Microhaplotype Info

Finally, we put together the microhaplotype sections of the pmo.

Code
mhap_info_df.head()
SampleID Locus ASV Reads Allele PseudoCIGAR SSPOOL keep
0 MADH001_MPN17383_A52_S75_L001 Pf3D7_01_v3-194742-194973-1B TACCTATAAAAATGAAAAAAATAAAGAAGAAGATAAATATGGAAAA... 12 Pf3D7_01_v3-194742-194973-1B.1 14+9N42+17N112+8N SSPOOL1 True
1 MADH001_MPN17383_A52_S75_L001 Pf3D7_01_v3-194742-194973-1B TACCTATAAAAATGAAAAAAATAAAGAAGAAGATAAATATGGAAAA... 4154 Pf3D7_01_v3-194742-194973-1B.2 14+9N42+17N88C112+8N SSPOOL1 True
2 MADH001_MPN17383_A52_S75_L001 Pf3D7_01_v3-194742-194973-1B TACCTATAAAAATGAAAAAAATAAAGAAGAAGATAAATATGGAAAA... 2056 Pf3D7_01_v3-194742-194973-1B.7 14+9N42+17N79A112+8N SSPOOL1 True
3 MADH001_MPN17383_A52_S75_L001 Pf3D7_01_v3-194742-194973-1B TACCTATAAAAATGAAAAAAATAAAGAAGAAGATAAATATGGAAAA... 3 Pf3D7_01_v3-194742-194973-1B.8 14+9N42+17N79A112+8N139D=TGATCCACTTTATGATAATAT... SSPOOL1 True
4 MADH001_KZN6765_A52_S11_L001 Pf3D7_01_v3-194742-194973-1B TACCTATAAAAATGAAAAAAATAAAGAAGAAGATAAATATGGAAAA... 593 Pf3D7_01_v3-194742-194973-1B.1 14+9N42+17N112+8N SSPOOL1 True
Code
pmo_mhaps = mhap_table_to_pmo(
                       microhaplotype_table=mhap_info_df, 
                       bioinformatics_run_name='SSPOOL',
                       library_sample_name_col='SampleID',
                       target_name_col='Locus',
                       seq_col='ASV',
                       reads_col='Reads')

Merge

Now that we have all of the sections together, we can merge these together to form a complete PMO. If there are issues with the data this will be reported with descriptive errors. To further validate the final PMO you generate see the validating pmos page.

Code
dataset1_pmo = merge_to_pmo(
    pmo_spec_info,
    pmo_library_sample,
    pmo_seq_info,
    pmo_panel_info,
    pmo_mhaps,
    pmo_bioinfo_method,
    pmo_bioinfo_runs,
    pmo_project_info
)

We can now write this to a file that can be shared. Notice the extension ‘.gz’ added to the end of the filename, this will automatically compress the output file.

Code
pmowriter = PMOWriter()
pmowriter.write_out_pmo(dataset1_pmo, "dataset1_pmo.json.gz", overwrite=True)

We can also validate our PMO, dataset1_pmo.json.gz, and make sure everything is correct according to the schema. If there are any issues these will be output from the following command.

Code
!pmotools-python validate_pmo --pmo dataset1_pmo.json.gz