Code
import pandas as pdThis notebook demonstrates how publicly available genomic surveillance data from four countries (Eswatini, Namibia, South Africa, and Zambia) are reformatted into a Portable Microhaplotype Object (PMO) file for downstream analysis.
Data sources: Aranda-Díaz et al. (2025); Eloff et al. (2025); Nhlengethwa et al. (2025); Raman et al. (2025).
Here we import the necessary libraries, including those from the pmotools package that we will use to build the PMO.
from pmotools.pmo_builder.panel_information_to_pmo import panel_info_table_to_pmo, merge_panel_info_dicts
from pmotools.pmo_builder.metatable_to_pmo import library_sample_info_table_to_pmo, specimen_info_table_to_pmo
from pmotools.pmo_builder.mhap_table_to_pmo import mhap_table_to_pmo
from pmotools.pmo_builder.merge_to_pmo import merge_to_pmo
from pmotools.pmo_engine.pmo_writer import *
import numpy as npTo build a PMO we need to assemble several pieces of information. This notebook walks through that process step-by-step, including:
We use pmotools to construct each section and then merge them at the end to create the final PMO.
For any of the functions being used you can see more details using something like the example below:
Help on function merge_to_pmo in module pmotools.pmo_builder.merge_to_pmo:
merge_to_pmo(specimen_info: list, library_sample_info: list, sequencing_info: list, panel_info: dict, mhap_info: dict, bioinfo_method_info: list, bioinfo_run_info: list, project_info: list, read_counts_by_stage_info: list | None = None)
Merge components into PMO, replacing names with indeces.
:param specimen_info (list): a list of all the specimens within this project
:param library_sample_info (list) : a list of library samples within this project
:param sequencing_info (list) : a list of sequencing info for this project
:param panel_info (list) : a dictionary containing the panel and target information for this project
:param mhap_info (list) : a dictionary containing the microhaplotypes within this project, both detected and representative
:param bioinfo_method_info (list) : the bioinformatics pipeline/methods used to generated the amplicon analysis for this project
:param bioinfo_run_info (list) : the runtime info for the bioinformatics pipeline used to generated the amplicon analysis for this project
:param project_info (list) : the information about the projects stored in this PMO
:param read_counts_by_stage_info (Optional[list]) : the read counts by stage information for this project
:return: a json formatted PMO string.
First we need to read in the data as it was stored for the project. In this case, there were three separate tables
Below, we summarize the data by examining how many specimens were sequenced in replicate and therefore have multiple library samples associated with them.
print("Number of specimens:",specimen_info_df['form participant_id'].nunique())
print("Number of library samples:",library_sample_info.SampleID.nunique())
print("\nReplicate summary:")
replicate_summary = library_sample_info.groupby('specimen_id').SampleID.size().value_counts().reset_index().rename(columns={'SampleID':'n_replicates'})
print(replicate_summary)
print("Number of specimens sequenced in replicate:",replicate_summary[replicate_summary.n_replicates>1]['count'].sum())Number of specimens: 2025
Number of library samples: 2592
Replicate summary:
n_replicates count
0 1 1503
1 2 480
2 3 39
3 4 3
Number of specimens sequenced in replicate: 522
In this section we enter information about the project. More information about the fields to include can be found here.
Now we transform the specimen table to comply with the PMO format. More information about the fields in this table can be found here.
| form participant_id | facility_district | facility_province | facility_name | form date_diagnosis | host_taxon_id | specimen_taxon_id | project_name | country | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 650004 | Sithobela | Lubombo | SITHOBELA RURAL HEALTH CENTER | 2023-03-30 | 9606 | 5833 | RegGenE8 | Eswatini |
| 1 | 650005 | Sithobela | Lubombo | SITHOBELA RURAL HEALTH CENTER | 2023-03-22 | 9606 | 5833 | RegGenE8 | Eswatini |
| 2 | 650007 | Sithobela | Lubombo | SITHOBELA RURAL HEALTH CENTER | 2023-03-30 | 9606 | 5833 | RegGenE8 | Eswatini |
| 3 | 650008 | Sithobela | Lubombo | SITHOBELA RURAL HEALTH CENTER | 2023-04-03 | 9606 | 5833 | RegGenE8 | Eswatini |
| 4 | 650010 | Sithobela | Lubombo | SITHOBELA RURAL HEALTH CENTER | 2023-04-04 | 9606 | 5833 | RegGenE8 | Eswatini |
Because a specimen could contain multiple taxon IDs, the specimen_taxon_id is stored as a list inside of PMO. Therefore, we need to convert the values in this column to a list now.
pmo_spec_info = specimen_info_table_to_pmo(
specimen_info_df,
specimen_name_col='form participant_id',
specimen_taxon_id_col='specimen_taxon_id',
host_taxon_id_col='host_taxon_id',
collection_date_col='form date_diagnosis',
collection_country_col='country',
project_name_col='project_name',
geo_admin1_col='facility_province',
geo_admin2_col='facility_district',
geo_admin3_col='facility_name'
)The panel information describes which panel was used to sequence the samples. At the time this data was generated, only the drug resistance data was publicly available, so the dataset has been subset accordingly, and the panel information tables include only those targets. In this project, two different pool combinations from the MAD4HatTeR panel were used. Because they all share the same reference genome, we first generate the genome information here:
Now we generate the panel information using the panel_info_table_to_pmo for each of the panels
Now we have all of the panels generated seperately, we can merge them all together using the merge_panel_info_dicts function.
Here we compile the information for all library samples from all sequencing or amplification runs of the specimens in this project. More information can be found here.
| SampleID | SSPOOL | specimen_id | Pools | |
|---|---|---|---|---|
| 0 | MADH100_ES355_8071121168_A52_S365 | SSPOOL19 | 650004 | A52 |
| 1 | MADH100_ES418_8071120168_A52_S369 | SSPOOL19 | 650005 | A52 |
| 2 | MADH052_ES433_8071097924_A52_S69 | SSPOOL13 | 650007 | A52 |
| 3 | MADH052_ES041_8071088821_A52_S18 | SSPOOL13 | 650008 | A52 |
| 4 | MADH084_ES041_8071088821_A52_S290 | SSPOOL16 | 650008 | A52 |
Now we put together the sequencing information for every sequencing run that was carried out on the library samples.
seq_platform='Illumina'
seq_instrument_model = 'MiSeq'
library_layout='150 paired-end reads'
library_strategy='AMPLICON'
library_source='GENOMIC'
library_selection='PCR'
seq_center='NICD'
pmo_seq_info = []
for seq_run in library_sample_info.SSPOOL.unique():
seq_info_dict = {
'sequencing_info_name':(seq_run),
'seq_platform':seq_platform,
'seq_instrument_model':seq_instrument_model,
'library_layout':library_layout,
'library_strategy':library_strategy,
'library_source':library_source,
'library_selection':library_selection,
'seq_center':seq_center,
}
pmo_seq_info.append(seq_info_dict)The Mad4hatter pipeline was used to analyse all of this data. Below we link to this pipeline and provide some details about the programs used within the pipeline.
Now we put together information on each of the bioinformatics runs that were used to generate the microhaplotypes.
Finally, we put together the microhaplotype sections of the pmo.
| SampleID | Locus | ASV | Reads | Allele | PseudoCIGAR | SSPOOL | keep | |
|---|---|---|---|---|---|---|---|---|
| 0 | MADH001_MPN17383_A52_S75_L001 | Pf3D7_01_v3-194742-194973-1B | TACCTATAAAAATGAAAAAAATAAAGAAGAAGATAAATATGGAAAA... | 12 | Pf3D7_01_v3-194742-194973-1B.1 | 14+9N42+17N112+8N | SSPOOL1 | True |
| 1 | MADH001_MPN17383_A52_S75_L001 | Pf3D7_01_v3-194742-194973-1B | TACCTATAAAAATGAAAAAAATAAAGAAGAAGATAAATATGGAAAA... | 4154 | Pf3D7_01_v3-194742-194973-1B.2 | 14+9N42+17N88C112+8N | SSPOOL1 | True |
| 2 | MADH001_MPN17383_A52_S75_L001 | Pf3D7_01_v3-194742-194973-1B | TACCTATAAAAATGAAAAAAATAAAGAAGAAGATAAATATGGAAAA... | 2056 | Pf3D7_01_v3-194742-194973-1B.7 | 14+9N42+17N79A112+8N | SSPOOL1 | True |
| 3 | MADH001_MPN17383_A52_S75_L001 | Pf3D7_01_v3-194742-194973-1B | TACCTATAAAAATGAAAAAAATAAAGAAGAAGATAAATATGGAAAA... | 3 | Pf3D7_01_v3-194742-194973-1B.8 | 14+9N42+17N79A112+8N139D=TGATCCACTTTATGATAATAT... | SSPOOL1 | True |
| 4 | MADH001_KZN6765_A52_S11_L001 | Pf3D7_01_v3-194742-194973-1B | TACCTATAAAAATGAAAAAAATAAAGAAGAAGATAAATATGGAAAA... | 593 | Pf3D7_01_v3-194742-194973-1B.1 | 14+9N42+17N112+8N | SSPOOL1 | True |
Now that we have all of the sections together, we can merge these together to form a complete PMO. If there are issues with the data this will be reported with descriptive errors. To further validate the final PMO you generate see the validating pmos page.
We can now write this to a file that can be shared. Notice the extension ‘.gz’ added to the end of the filename, this will automatically compress the output file.
We can also validate our PMO, dataset1_pmo.json.gz, and make sure everything is correct according to the schema. If there are any issues these will be output from the following command.