Building a PMO with Minimum Required Fields

In this tutorial we will show an example of using pmotools-python to generate a PMO that only contains the minimal information. We will use example data downloaded from the SRA associated with the following study:

Furstenau, T. N., Whealy, R., Timm, S., Roberts, A., Maltinsky, S., Wells, S. J., Drake, K., Ross, A., Bolduc, C., Pearson, T., & Fofanov, V. Y. (2025). High-throughput targeted amplicon screening tool for characterizing intrahost diversity in Staphylococcus aureus directly from sample. Microbial Genomics, 11(6). https://doi.org/10.1099/mgen.0.001427

Code

import pandas as pd
from pmotools.pmo_builder.panel_information_to_pmo import panel_info_table_to_pmo, merge_panel_info_dicts
from pmotools.pmo_builder.metatable_to_pmo import library_sample_info_table_to_pmo, specimen_info_table_to_pmo
from pmotools.pmo_builder.mhap_table_to_pmo import (
    mhap_table_to_pmo, 
    create_minimum_library_specimen_dict_from_mhap_table)
from pmotools.pmo_builder.merge_to_pmo import merge_to_pmo
from pmotools.pmo_engine.pmo_writer import * 
from pmotools.pmo_engine.pmo_checker import PMOChecker
import numpy as np
from pmotools.pmo_builder import panel_information_to_pmo

Required data

The minimum amount of information needed to create a PMO is the microhaplotype data and information on the panel used (at a minimum, the target’s primers). The files we will use from this study are:

allele_data.tsv.gz - results of microhaplotype called data
Furstenau2025_primers.tsv - primers used in the experiment

Merging info pmo

First we read in our data

Code

mhap_info_df = pd.read_csv("allele_data.tsv.gz", sep='\t')
primers = pd.read_csv("Furstenau2025_primers.tsv", sep='\t')

Next we convert our panel information to corresponding sections of PMO using the following code

Code

pmo_panel_and_target_info = panel_info_table_to_pmo(primers, 
                                                    panel_name = "staph_aureus_Furstenau2025",
                                                    target_name_col = "target",
                                                    forward_primers_seq_col = "forward", 
                                                    reverse_primers_seq_col = "reverse")

We also convert our microhaplotype data to corresponding sections of PMO

Code

pmo_mhaps = mhap_table_to_pmo(
                       microhaplotype_table=mhap_info_df, 
                       library_sample_name_col='s_Sample',
                       target_name_col='p_name',
                       seq_col='h_Consensus',
                       reads_col='c_ReadCnt')

Now we merge these two sections together to create a true PMO

Code

# merge into pmo
staph_aureus_pmo = merge_to_pmo(
    panel_target_info = pmo_panel_and_target_info,
    mhap_info = pmo_mhaps
)

We can check that our PMO complies with the ontology fully using the below

Code

# Validate the PMO file against schema 
checker = PMOChecker()
checker.validate_pmo_json(staph_aureus_pmo)

Now we have merged and validated our PMO we can write it to a file.

Code

# write out
pmowriter = PMOWriter()
pmowriter.write_out_pmo(staph_aureus_pmo, "minimum_Furstenau2025_PMO.json.gz", overwrite=True)

Adding a key to set a specimen_name for library_sample_name

If we only supply the panel and microhaplotype information (like in the example above) the specimen_names and library_sample_names are auto generated from microhaplotype data. The with specimen_names will be identical to the libary_sample_names. This can be changed by supplying a table that supplies a specimen_name to be used for each library_sample_name. Here we use another file we generated from SRA metadata for this dataset:

sra_info_table.tsv - this has the SRA/ENA meta information

Below we show how the current specimen_name and library_sample_name relate to each other in the PMO we generated above. Notice that we use a function from pmotools to export this information into a table to look at

Code

from pmotools.pmo_engine.pmo_exporter import PMOExporter
lib_to_spec_df = PMOExporter.list_library_sample_names_per_specimen_name(staph_aureus_pmo)
lib_to_spec_df.head()

	specimen_name	library_sample_name	library_sample_count
0	SRR30825770	SRR30825770	1
1	SRR30825771	SRR30825771	1
2	SRR30825772	SRR30825772	1
3	SRR30825773	SRR30825773	1
4	SRR30825774	SRR30825774	1

We want to change how these are generateds, so we first read in the SRA information

Code

sra_info = pd.read_csv("sra_info_table.tsv", sep = '\t')
sra_info.head()

	run_accession	experiment_title	sample_accession	project_name	submission_accession	library_min_fragment_size	bam_md5	assembly_software	library_prep_longitude	library_selection	...	sequencing_primer_lot	first_public	transposase_protocol	study_alias	library_prep_location	rna_prep_3_protocol	ph	sequencing_longitude	tissue_type	isolation_source
0	SRR31969808	Illumina MiSeq sequencing: AmpSeq of Staphyloc...	SAMN46224567	NaN	SRA2049563	NaN	NaN	NaN	NaN	PCR	...	NaN	2025-01-14	NaN	PRJNA1209594	NaN	NaN	NaN	NaN	NaN	nares
1	SRR31969809	Illumina MiSeq sequencing: AmpSeq of Staphyloc...	SAMN46224567	NaN	SRA2049563	NaN	NaN	NaN	NaN	PCR	...	NaN	2025-01-14	NaN	PRJNA1209594	NaN	NaN	NaN	NaN	NaN	nares
2	SRR31969810	Illumina MiSeq sequencing: AmpSeq of Staphyloc...	SAMN46224566	NaN	SRA2049563	NaN	NaN	NaN	NaN	PCR	...	NaN	2025-01-14	NaN	PRJNA1209594	NaN	NaN	NaN	NaN	NaN	nares
3	SRR31969817	NextSeq 500 sequencing: WGS of Staphylococcus ...	SAMN46224576	NaN	SRA2049563	NaN	NaN	NaN	NaN	size fractionation	...	NaN	2025-01-14	NaN	PRJNA1209594	NaN	NaN	NaN	NaN	NaN	nares
4	SRR31969820	NextSeq 500 sequencing: WGS of Staphylococcus ...	SAMN46224576	NaN	SRA2049563	NaN	NaN	NaN	NaN	size fractionation	...	NaN	2025-01-14	NaN	PRJNA1209594	NaN	NaN	NaN	NaN	NaN	nares

5 rows × 192 columns

We convert this into a dictionary we can use in pmotools

Code

# create a dictionary key
lib_to_spec_key = sra_info.set_index('run_accession')['sample_alias'].to_dict()

now we build our tables with this information

Code

# supply key when building library_sample_info and specimen_info 
library_sample_and_spec_renamed_infos = create_minimum_library_specimen_dict_from_mhap_table(
    pmo_mhaps["detected_microhaplotypes"], 
    panel_name = "staph_aureus_Furstenau2025", 
    library_sample_specimen_key = lib_to_spec_key)

Finally, we can use this information along with the sections we already generated above (pmo_panel_and_target_info and pmo_mhaps) to merge into a new pmo

Code

# now build with renamed 
staph_aureus_pmo_renamed = merge_to_pmo(
    specimen_info = library_sample_and_spec_renamed_infos["specimen_info"],
    library_sample_info = library_sample_and_spec_renamed_infos["library_sample_info"],
    panel_target_info = pmo_panel_and_target_info,
    mhap_info = pmo_mhaps
)

Again, we can validate that this PMO complies with the schema

Code

checker.validate_pmo_json(staph_aureus_pmo)

Below we can see that our library and sample names are different

Code

lib_to_spec_renamed_df = PMOExporter.list_library_sample_names_per_specimen_name(staph_aureus_pmo_renamed)
lib_to_spec_renamed_df.head()

	specimen_name	library_sample_name	library_sample_count
0	85b498-Wk16-Nasal	SRR30825770	1
1	85b498-Wk28-Nasal	SRR30825771	1
2	85b498-Wk12-Nasal	SRR30825772	1
3	85b498-Wk20-Nasal	SRR30825773	1
4	85b498-Wk14-Nasal	SRR30825774	1

Finally we can write this PMO to a new file

Code

pmowriter.write_out_pmo(staph_aureus_pmo_renamed, "minimum_Furstenau2025_new_names_PMO.json.gz", overwrite=True)

For more information on adding extra information to your minimal PMO see Update meta in a minimal pmo

For more information on generating a PMO with extra metadata see PMO Generation