Public Genomic Surveillance Data of Plasmodium falciparum from Five Countries

This notebook demonstrates how publicly available genomic surveillance data from four countries (Eswatini, Namibia, South Africa, and Zambia) are reformatted into a Portable Microhaplotype Object (PMO) file for downstream analysis.

Data sources: Aranda-Díaz et al. (2025); Eloff et al. (2025); Nhlengethwa et al. (2025); Raman et al. (2025).

Setup

Here we import the necessary libraries, including those from the pmotools package that we will use to build the PMO.

Code

import pandas as pd

Code

from pmotools.pmo_builder.panel_information_to_pmo import panel_info_table_to_pmo, merge_panel_info_dicts
from pmotools.pmo_builder.metatable_to_pmo import library_sample_info_table_to_pmo, specimen_info_table_to_pmo
from pmotools.pmo_builder.mhap_table_to_pmo import mhap_table_to_pmo
from pmotools.pmo_builder.merge_to_pmo import merge_to_pmo
from pmotools.pmo_engine.pmo_writer import * 
import numpy as np

Code

from pmotools.pmo_builder import panel_information_to_pmo

Read in Data

To build a PMO we need to assemble several pieces of information. This notebook walks through that process step-by-step, including:

Project Information
Specimen Information
Panel Information
Library Sample Information
Sequencing Information
Bioinformatics Information
Microhaplotype Information

We use pmotools to construct each section and then merge them at the end to create the final PMO.

For any of the functions being used you can see more details using something like the example below:

Code

help(merge_to_pmo)

Help on function merge_to_pmo in module pmotools.pmo_builder.merge_to_pmo:

merge_to_pmo(specimen_info: list, library_sample_info: list, sequencing_info: list, panel_info: dict, mhap_info: dict, bioinfo_method_info: list, bioinfo_run_info: list, project_info: list, read_counts_by_stage_info: list | None = None)
    Merge components into PMO, replacing names with indeces.
    
    :param specimen_info (list): a list of all the specimens within this project
    :param library_sample_info (list) : a list of library samples within this project
    :param sequencing_info (list) : a list of sequencing info for this project
    :param panel_info (list) : a dictionary containing the panel and target information for this project
    :param mhap_info (list) : a dictionary containing the microhaplotypes within this project, both detected and representative
    :param bioinfo_method_info (list) : the bioinformatics pipeline/methods used to generated the amplicon analysis for this project
    :param bioinfo_run_info (list) : the runtime info for the bioinformatics pipeline used to generated the amplicon analysis for this project
    :param project_info (list) : the information about the projects stored in this PMO
    :param read_counts_by_stage_info (Optional[list]) : the read counts by stage information for this project
    
    :return: a json formatted PMO string.

First we need to read in the data as it was stored for the project. In this case, there were three separate tables

allele_data.txt.gz : this file contains the microhaplotype information
specimen_info.tsv : this file contains details of the speciment collected within the project
library_sample_info.tsv : this file contains the information on the sequencing performed on the specimens

Code

mhap_info_df = pd.read_csv('allele_data.txt.gz', sep='\t')
specimen_info_df = pd.read_csv('specimen_info.tsv', sep='\t')
library_sample_info = pd.read_csv('library_sample_info.tsv', sep='\t')

Below, we summarize the data by examining how many specimens were sequenced in replicate and therefore have multiple library samples associated with them.

Code

print("Number of specimens:",specimen_info_df['form participant_id'].nunique())
print("Number of library samples:",library_sample_info.SampleID.nunique())
print("\nReplicate summary:")
replicate_summary = library_sample_info.groupby('specimen_id').SampleID.size().value_counts().reset_index().rename(columns={'SampleID':'n_replicates'})
print(replicate_summary)
print("Number of specimens sequenced in replicate:",replicate_summary[replicate_summary.n_replicates>1]['count'].sum())

Number of specimens: 2025
Number of library samples: 2592

Replicate summary:
   n_replicates  count
0             1   1503
1             2    480
2             3     39
3             4      3
Number of specimens sequenced in replicate: 522

Project Information

In this section we enter information about the project. More information about the fields to include can be found here.

Code

pmo_project_info = [{
    "project_name":"RegGenE8",
    "project_description":"P. falciparum malaria surveillance in the Elimination 8.",
    "project_collector_chief_scientist":"Jennifer Smith",
    "project_type":"cross-sectional"
}]

Specimen Information

Now we transform the specimen table to comply with the PMO format. More information about the fields in this table can be found here.

Code

specimen_info_df.head()

	form participant_id	facility_district	facility_province	facility_name	form date_diagnosis	host_taxon_id	specimen_taxon_id	project_name	country
0	650004	Sithobela	Lubombo	SITHOBELA RURAL HEALTH CENTER	2023-03-30	9606	5833	RegGenE8	Eswatini
1	650005	Sithobela	Lubombo	SITHOBELA RURAL HEALTH CENTER	2023-03-22	9606	5833	RegGenE8	Eswatini
2	650007	Sithobela	Lubombo	SITHOBELA RURAL HEALTH CENTER	2023-03-30	9606	5833	RegGenE8	Eswatini
3	650008	Sithobela	Lubombo	SITHOBELA RURAL HEALTH CENTER	2023-04-03	9606	5833	RegGenE8	Eswatini
4	650010	Sithobela	Lubombo	SITHOBELA RURAL HEALTH CENTER	2023-04-04	9606	5833	RegGenE8	Eswatini

Because a specimen could contain multiple taxon IDs, the specimen_taxon_id is stored as a list inside of PMO. Therefore, we need to convert the values in this column to a list now.

Code

specimen_info_df.specimen_taxon_id=[[taxon_id]for taxon_id in (specimen_info_df.specimen_taxon_id)]

Code

pmo_spec_info = specimen_info_table_to_pmo(
                            specimen_info_df, 
                            specimen_name_col='form participant_id',
                            specimen_taxon_id_col='specimen_taxon_id', 
                            host_taxon_id_col='host_taxon_id', 
                            collection_date_col='form date_diagnosis',
                            collection_country_col='country',
                            project_name_col='project_name',
                            geo_admin1_col='facility_province',
                            geo_admin2_col='facility_district',
                            geo_admin3_col='facility_name'
                           )

Panel Information

The panel information describes which panel was used to sequence the samples. At the time this data was generated, only the drug resistance data was publicly available, so the dataset has been subset accordingly, and the panel information tables include only those targets. In this project, two different pool combinations from the MAD4HatTeR panel were used. Because they all share the same reference genome, we first generate the genome information here:

Code

genome_info_dict = {
        'name':'3D7',
        'genome_version':'65',
        'taxon_id':[5833],
        'url':'https://plasmodb.org/a/service/raw-files/release-65/Pfalciparum3D7/fasta/data/PlasmoDB-65_Pfalciparum3D7_Genome.fasta'
}

Now we generate the panel information using the panel_info_table_to_pmo for each of the panels

Code

a52panel_info = pd.read_csv('a52_pools.tsv', sep='\t')
ab2panel_info = pd.read_csv('ab2_pools.tsv', sep='\t')

Code

# A52
a52_panel_info = panel_info_table_to_pmo(
    a52panel_info,
    'A52',
    genome_info_dict, 
    target_name_col='amplicon',
)

Code

# AB2
ab2_panel_info = panel_info_table_to_pmo(
    ab2panel_info,
    'AB2',
    genome_info_dict, 
    target_name_col='amplicon',
)

Now we have all of the panels generated seperately, we can merge them all together using the merge_panel_info_dicts function.

Code

panels = [a52_panel_info, ab2_panel_info]

Code

pmo_panel_info = merge_panel_info_dicts(panels)

Library Sample Information

Here we compile the information for all library samples from all sequencing or amplification runs of the specimens in this project. More information can be found here.

Code

library_sample_info.head()

	SampleID	SSPOOL	specimen_id	Pools
0	MADH100_ES355_8071121168_A52_S365	SSPOOL19	650004	A52
1	MADH100_ES418_8071120168_A52_S369	SSPOOL19	650005	A52
2	MADH052_ES433_8071097924_A52_S69	SSPOOL13	650007	A52
3	MADH052_ES041_8071088821_A52_S18	SSPOOL13	650008	A52
4	MADH084_ES041_8071088821_A52_S290	SSPOOL16	650008	A52

Code

pmo_library_sample = library_sample_info_table_to_pmo(
    library_sample_info, 
    library_sample_name_col='SampleID', 
    sequencing_info_name_col='SSPOOL', 
    specimen_name_col='specimen_id', 
    panel_name_col='Pools',
)

Sequencing Information

Now we put together the sequencing information for every sequencing run that was carried out on the library samples.

Code

seq_platform='Illumina'
seq_instrument_model = 'MiSeq'
library_layout='150 paired-end reads'
library_strategy='AMPLICON'
library_source='GENOMIC'
library_selection='PCR'
seq_center='NICD'

pmo_seq_info = []
for seq_run in library_sample_info.SSPOOL.unique(): 
    seq_info_dict = {
        'sequencing_info_name':(seq_run), 
        'seq_platform':seq_platform, 
        'seq_instrument_model':seq_instrument_model, 
        'library_layout':library_layout, 
        'library_strategy':library_strategy,
        'library_source':library_source,
        'library_selection':library_selection,
        'seq_center':seq_center,
    }
    pmo_seq_info.append(seq_info_dict)

Bioinformatics Method

The Mad4hatter pipeline was used to analyse all of this data. Below we link to this pipeline and provide some details about the programs used within the pipeline.

Code

pmo_bioinfo_method = [
    {
        "methods": [
            {"program": "DADA2", "program_version": "3.17"},
            {"program": "cutadapt", "program_version": "4.4"},
        ],
        "pipeline": {
            "program": "Mad4hatter",
            "program_version": "v0.2.1",
            "program_url": "https://github.com/EPPIcenter/mad4hatter",
        },
    }
]

Bioinformatics Runs

Now we put together information on each of the bioinformatics runs that were used to generate the microhaplotypes.

Code

pmo_bioinfo_runs = []
for run in library_sample_info.SSPOOL.unique(): 
    bioinfo_run = {
        'bioinformatics_run_name':(run),
        'bioinformatics_methods_id':0
    }
    pmo_bioinfo_runs.append(bioinfo_run)

Microhaplotype Info

Finally, we put together the microhaplotype sections of the pmo.

Code

mhap_info_df.head()

	SampleID	Locus	ASV	Reads	Allele	PseudoCIGAR	SSPOOL	keep
0	MADH001_MPN17383_A52_S75_L001	Pf3D7_01_v3-194742-194973-1B	TACCTATAAAAATGAAAAAAATAAAGAAGAAGATAAATATGGAAAA...	12	Pf3D7_01_v3-194742-194973-1B.1	14+9N42+17N112+8N	SSPOOL1	True
1	MADH001_MPN17383_A52_S75_L001	Pf3D7_01_v3-194742-194973-1B	TACCTATAAAAATGAAAAAAATAAAGAAGAAGATAAATATGGAAAA...	4154	Pf3D7_01_v3-194742-194973-1B.2	14+9N42+17N88C112+8N	SSPOOL1	True
2	MADH001_MPN17383_A52_S75_L001	Pf3D7_01_v3-194742-194973-1B	TACCTATAAAAATGAAAAAAATAAAGAAGAAGATAAATATGGAAAA...	2056	Pf3D7_01_v3-194742-194973-1B.7	14+9N42+17N79A112+8N	SSPOOL1	True
3	MADH001_MPN17383_A52_S75_L001	Pf3D7_01_v3-194742-194973-1B	TACCTATAAAAATGAAAAAAATAAAGAAGAAGATAAATATGGAAAA...	3	Pf3D7_01_v3-194742-194973-1B.8	14+9N42+17N79A112+8N139D=TGATCCACTTTATGATAATAT...	SSPOOL1	True
4	MADH001_KZN6765_A52_S11_L001	Pf3D7_01_v3-194742-194973-1B	TACCTATAAAAATGAAAAAAATAAAGAAGAAGATAAATATGGAAAA...	593	Pf3D7_01_v3-194742-194973-1B.1	14+9N42+17N112+8N	SSPOOL1	True

Code

pmo_mhaps = mhap_table_to_pmo(
                       microhaplotype_table=mhap_info_df, 
                       bioinformatics_run_name='SSPOOL',
                       library_sample_name_col='SampleID',
                       target_name_col='Locus',
                       seq_col='ASV',
                       reads_col='Reads')

Merge

Now that we have all of the sections together, we can merge these together to form a complete PMO. If there are issues with the data this will be reported with descriptive errors. To further validate the final PMO you generate see the validating pmos page.

Code

dataset1_pmo = merge_to_pmo(
    pmo_spec_info,
    pmo_library_sample,
    pmo_seq_info,
    pmo_panel_info,
    pmo_mhaps,
    pmo_bioinfo_method,
    pmo_bioinfo_runs,
    pmo_project_info
)

We can now write this to a file that can be shared. Notice the extension ‘.gz’ added to the end of the filename, this will automatically compress the output file.

Code

pmowriter = PMOWriter()
pmowriter.write_out_pmo(dataset1_pmo, "dataset1_pmo.json.gz", overwrite=True)

We can also validate our PMO, dataset1_pmo.json.gz, and make sure everything is correct according to the schema. If there are any issues these will be output from the following command.

Code

!pmotools-python validate_pmo --pmo dataset1_pmo.json.gz