Portable Microhaplotype Object (PMO)
  • Home
  • Format Info
    • Development of Format
    • PMO fields overview
    • PMO Examples
    • Format Overview For Developers
  • Tools Installation
    • pmotools-python installation
  • pmotools-python usages
    • Command line interface

    • pmotools-runner.py
    • Command line interface to pmotools-python with pmotools-runner.py
    • Extracting out of PMO
    • Extracting allele tables using pmotools-python
    • Subset PMO
    • Subsetting from a PMO using pmotools-python
    • Getting sub info from PMO
    • Getting basic info out of PMO using pmotools-python
    • Getting panel info out of PMO using pmotools-python
    • Handling Multiple PMOs
    • Handling multiple PMOs pmotools-python

    • Python interface
    • Getting basic info out of a PMO
    • Creating a PMO File
  • Resources
    • References

Contents

  • Setup
  • Creating PMO
    • Panel Information
    • Metadata
      • Specimen Level Metadata
      • Experiment Level Metadata
    • Microhaplotype Information
    • Demultiplexed Experiment Samples
    • Sequencing info
    • Bioinformatics Info
  • Compose PMO

Creating a PMO File

  • Show All Code
  • Hide All Code

In this tutorial we will go through the steps to build a PMO, utilising the functions within the pmo-tools package.

For more information on any of the fields mentioned please see the documentation.

Setup

First we will import the functions that we will need to run this notebook.

Code
from pmotools.json_convertors.microhaplotype_table_to_pmo_dict import microhaplotype_table_to_pmo_dict
from pmotools.json_convertors.metatable_to_json_meta import experiment_info_table_to_json, specimen_info_table_to_json
from pmotools.json_convertors.panel_information_to_pmo_dict import panel_info_table_to_pmo_dict
from pmotools.json_convertors.demultiplexed_targets_to_pmo_dict import demultiplexed_targets_to_pmo_dict
Code
import pandas as pd
import json 

Here we define a function that will be used to print a few lines from the data we will be creating.

Code
def print_json_head(dict, n=10):
    json_object = json.dumps(dict, indent=4)
    for i, l in enumerate(json_object.split('\n')):
        if i >= n:
            break
        print(l)

Creating PMO

To create the full PMO we will need a few sets of information. These include:

  • Panel Information : A table including data on the targets that make up the panel.
  • Allele table : A table containing the alleles called for each of the samples for each of the targets and the reads associated.
  • Demultiplexed reads : A table containing the raw reads for each sample, for each target after demultiplexing, before any filtering.
  • Experimental metadata : Information on the sequencing run, for example where each sample was located on the plate.
  • Specimen Information : metadata on the biological samples

We will specify the paths to the example data we will use below, but if you would like to try and use your own data then replace the following paths:

Code
panel_information_path = 'example_data/mad4hatter_panel_info_example.tsv'
allele_table_path = 'example_data/mad4hatter_allele_data_example.txt'
demultiplexed_reads_path = 'example_data/mad4hatter_amplicon_coverage.txt'

Panel Information

First we will work on putting the panel information into PMO format. Although labs may store this information in a variety of ways and this process may seem cumbersome, you will only have to do this once for each panel that you work with.

The panel information consists of 2 parts; The panel_targets (information on the targets) and the target_genome (information on the reference genome being targeted).

To include details of the reference genome we need the following information.

  • name : name of the genome
  • version : the genome version
  • taxon_id : the NCBI taxonomy number
  • url : a link to the where this genome file could be downloaded

Optionally, you can also include a link to genomes annotation file, as we have below. Below is an example of compiling this information into the json format manually:

Code
target_genome_info = {
            "gff_url" : "https://plasmodb.org/common/downloads/release-65/Pfalciparum3D7/gff/data/PlasmoDB-65_Pfalciparum3D7.gff",
            "name" : "3D7",
            "taxon_id" : 5833,
            "url" : "https://plasmodb.org/common/downloads/release-65/Pfalciparum3D7/fasta/data/PlasmoDB-65_Pfalciparum3D7_Genome.fasta",
            "version" : "2020-09-01"
        }

Fields that are required to define the target information are …

  • target_id : a unique identifier for each of the targets
  • forward primer seq : The sequence for the forward primer associated with this target
  • reverse primer sequence : The sequence for the reverse primer associated with this target

Note: in the case that you have multiple primers to target the same region, please include these on separate lines in the table with the same target_id.

Optionally you can also include location information for the primers. To include this information you will need to include in the table:

  • chrom : the chromosome name
  • start : the start of the location, 0-based positioning
  • end : the end of the location, 0-based positioning

For more information on optional fields that can be included, check the documentation.

Here we show how to take panel information that is used to run the MAD4HatTeR pipeline and convert it to PMO.

Code
madhatter_panel_info = pd.read_csv(panel_information_path, sep='\t')
madhatter_panel_info.head()
amplicon amplicon_start amplicon_end ampInsert_start ampInsert_end rev_primer amplicon_length ampInsert_length fwd_primer target_type strand gene_id
0 Pf3D7_01_v3-145388-145662-1A 145388 145662 145421 145630 AAAATGTCCAATATGTCAAGGTATATTAAAGT 274 209 CCTGAGTTTTAAGTGAATGAATATATTTTTGTT diversity + PF3D7_0103300
1 Pf3D7_01_v3-162867-163115-1A 162867 163115 162889 163092 TGTGTGCTTTGTCGTTGATTCAT 248 203 TACTACCGATCATCAAGCCGAA diversity + PF3D7_0103600
2 Pf3D7_01_v3-181512-181761-1A 181512 181761 181545 181729 TAGTTTAAATCTATACTTGTCTCACCTGAACA 249 184 CTTTTCATATTTGTCTATTAGCTTTTTCAAACC diversity + PF3D7_0104100
3 Pf3D7_01_v3-455794-456054-1A 455794 456054 455827 456021 GTGTTTCATTATTTTAGACACATTCAGGAATTT 260 194 ACAATGTAGAACAATATATAAAACTGGAAAAGA diversity + NaN
4 Pf3D7_01_v3-528859-529104-1A 528859 529104 528890 529073 AATCATTTTATCCCACTTATTTATCTCGTCT 245 183 CTTAGTTTAGATTTGCCTACAATATTTGCAC diversity + PF3D7_0113800

We can use the panel_info_table_to_pmo_dict function to convert this into the correct format for PMO.

Code
print(panel_info_table_to_pmo_dict.__doc__)

    Convert a dataframe containing panel information into dictionary of targets and reference information


    :param target_table: The dataframe containing the target information
    :param panel_id: the panel ID assigned to the panel
    :param genome_info: A dictionary containing the genome information
    :param target_id_col: the name of the column containing the target IDs
    :param forward_primers_seq_col: the name of the column containing the sequence of the forward primer
    :param reverse_primers_seq_col: the name of the column containing the sequence of the reverse primer
    :param forward_primers_start_col (Optional): the name of the column containing the 0-based start coordinate of the forward primer
    :param forward_primers_end_col (Optional): the name of the column containing the 0-based end coordinate of the forward primer
    :param reverse_primers_start_col (Optional): the name of the column containing the 0-based start coordinate of the reverse primer
    :param reverse_primers_end_col (Optional): the name of the column containing the 0-based end coordinate of the reverse primer
    :param insert_start_col (Optional): the name of the column containing the 0-based start coordinate of the insert
    :param insert_end_col (Optional): the name of the column containing the 0-based end coordinate of the insert
    :param chrom_col (Optional): the name of the column containing the chromosome for the target
    :param gene_id_col (Optional): the name of the column containing the gene id
    :param strand_col (Optional): the name of the column containing the strand for the target
    :param target_type_col (Optional): A classification type for the target
    :param additional_target_info_cols (Optional): dictionary of optional additional columns to add to the target information dictionary. Keys are column names and values are the type.
    :return: a dict of the panel information
    

We will use this first just to include the most basic required information.

Code
panel_information_pmo = panel_info_table_to_pmo_dict(
    madhatter_panel_info,
    "mad4hatter_poolsD1R1R2",
    target_genome_info,
    target_id_col="amplicon",
    forward_primers_seq_col="fwd_primer",
    reverse_primers_seq_col="rev_primer",
)

Let’s take a look at the first 30 rows of the information we put together…

Code
print_json_head(panel_information_pmo,30)
{
    "panel_info": {
        "mad4hatter_poolsD1R1R2": {
            "panel_id": "mad4hatter_poolsD1R1R2",
            "target_genome": {
                "gff_url": "https://plasmodb.org/common/downloads/release-65/Pfalciparum3D7/gff/data/PlasmoDB-65_Pfalciparum3D7.gff",
                "name": "3D7",
                "taxon_id": 5833,
                "url": "https://plasmodb.org/common/downloads/release-65/Pfalciparum3D7/fasta/data/PlasmoDB-65_Pfalciparum3D7_Genome.fasta",
                "version": "2020-09-01"
            },
            "targets": {
                "Pf3D7_01_v3-145388-145662-1A": {
                    "target_id": "Pf3D7_01_v3-145388-145662-1A",
                    "forward_primers": [
                        {
                            "seq": "CCTGAGTTTTAAGTGAATGAATATATTTTTGTT"
                        }
                    ],
                    "reverse_primers": [
                        {
                            "seq": "AAAATGTCCAATATGTCAAGGTATATTAAAGT"
                        }
                    ]
                },
                "Pf3D7_01_v3-162867-163115-1A": {
                    "target_id": "Pf3D7_01_v3-162867-163115-1A",
                    "forward_primers": [
                        {
                            "seq": "TACTACCGATCATCAAGCCGAA"

Optionally we can include some more information. You can see that some of the fields in the table above don’t directly match the optional fields. Therefore, we must first wrangle the data slightly to fit the requirements.

Note: You may not have to apply all of the following steps to your panel information, this is just an example and is specific to the MAD4HatTeR panel information.

The chromosome for each target is stored within the locus name, so we extract that and put it in it’s own column below.

Code
madhatter_panel_info['chrom'] = [chr[0] for chr in madhatter_panel_info.amplicon.str.split('-')]

Next we need to generate 0-based coordinates of the location the primers are targetting. The panel information we have only includes the full target start and end (including the primer) and is 1-based, so we do the conversion as follows.

Code
# Create 0-based coordinate for start of forward primer
madhatter_panel_info['fwd_primer_start_0_based'] = madhatter_panel_info.amplicon_start-1
# Calculate the length of the forward primer, add this to the primer start coordinate to get the end coordinate 
madhatter_panel_info['fwd_primer_len'] = [len(p) for p in madhatter_panel_info.fwd_primer]
madhatter_panel_info['fwd_primer_end_0_based'] = madhatter_panel_info.fwd_primer_start_0_based+madhatter_panel_info.fwd_primer_len

# Calculate the length of the reverse primer. Subtract this from the end coordinate of the target to get the start coordinate of the reverse primer
madhatter_panel_info['rev_primer_len'] = [len(p) for p in madhatter_panel_info.rev_primer]
madhatter_panel_info['rev_primer_start_0_based'] = madhatter_panel_info.amplicon_end-madhatter_panel_info.rev_primer_len
# The 0-based reverse primer end would be the same as the amplicon end 
madhatter_panel_info['rev_primer_end_0_based'] = madhatter_panel_info.amplicon_end

In the MAD4HatTeR pipeline, we trim one base from each end of the amplicon insert because the base following a primer is often erroneous. We can create insert coordinates with this adjustment, as shown below. If you choose not to apply this trimming step, you can instead use the coordinate at the end of the forward primer and the beginning of the reverse primer to define the start and end of the insert.

Code
madhatter_panel_info['insert_start_0_based'] = madhatter_panel_info.fwd_primer_end_0_based+1
madhatter_panel_info['insert_end_0_based'] = madhatter_panel_info.rev_primer_start_0_based-1

Now we can create panel information to go into PMO with all of the optional fields we just created.

Code
panel_information_pmo = panel_info_table_to_pmo_dict(
    madhatter_panel_info, 
    "mad4hatter_poolsD1R1R2", 
    target_genome_info, 
    target_id_col="amplicon",
    forward_primers_seq_col="fwd_primer",
    reverse_primers_seq_col="rev_primer",
    forward_primers_start_col="fwd_primer_start_0_based",
    forward_primers_end_col="fwd_primer_end_0_based",
    reverse_primers_start_col="rev_primer_start_0_based",
    reverse_primers_end_col="rev_primer_end_0_based",
    insert_start_col="insert_start_0_based",
    insert_end_col="insert_end_0_based",
    chrom_col="chrom",
    strand_col="strand",
    gene_id_col="gene_id",
    target_type_col="target_type",
)

You can also add on your own custom fields using the additional_target_info_cols parameter. Below we add on the amplicon insert length information. If there is a field that you want to add and think others would find useful please contact us and we can add it in. This way we can make sure to keep ontologies consistent!

Code
panel_information_pmo = panel_info_table_to_pmo_dict(
    madhatter_panel_info,
    "mad4hatter_poolsD1R1R2",
    target_genome_info,
    target_id_col="amplicon",
    forward_primers_seq_col="fwd_primer",
    reverse_primers_seq_col="rev_primer",
    forward_primers_start_col="fwd_primer_start_0_based",
    forward_primers_end_col="fwd_primer_end_0_based",
    reverse_primers_start_col="rev_primer_start_0_based",
    reverse_primers_end_col="rev_primer_end_0_based",
    insert_start_col="insert_start_0_based",
    insert_end_col="insert_end_0_based",
    chrom_col="chrom",
    strand_col="strand",
    gene_id_col="gene_id",
    target_type_col="target_type",
    additional_target_info_cols=["ampInsert_length"]
)

Let’s have a look at this now with the extra information added to the panel information

Code
print_json_head(panel_information_pmo,46)
{
    "panel_info": {
        "mad4hatter_poolsD1R1R2": {
            "panel_id": "mad4hatter_poolsD1R1R2",
            "target_genome": {
                "gff_url": "https://plasmodb.org/common/downloads/release-65/Pfalciparum3D7/gff/data/PlasmoDB-65_Pfalciparum3D7.gff",
                "name": "3D7",
                "taxon_id": 5833,
                "url": "https://plasmodb.org/common/downloads/release-65/Pfalciparum3D7/fasta/data/PlasmoDB-65_Pfalciparum3D7_Genome.fasta",
                "version": "2020-09-01"
            },
            "targets": {
                "Pf3D7_01_v3-145388-145662-1A": {
                    "target_id": "Pf3D7_01_v3-145388-145662-1A",
                    "forward_primers": [
                        {
                            "seq": "CCTGAGTTTTAAGTGAATGAATATATTTTTGTT",
                            "location": {
                                "chrom": "Pf3D7_01_v3",
                                "end": 145420,
                                "start": 145387,
                                "strand": "+"
                            }
                        }
                    ],
                    "reverse_primers": [
                        {
                            "seq": "AAAATGTCCAATATGTCAAGGTATATTAAAGT",
                            "location": {
                                "chrom": "Pf3D7_01_v3",
                                "end": 145662,
                                "start": 145630,
                                "strand": "+"
                            }
                        }
                    ],
                    "gene_id": "PF3D7_0103300",
                    "target_type": "diversity",
                    "ampInsert_length": 209,
                    "insert_location": {
                        "chrom": "Pf3D7_01_v3",
                        "start": 145421,
                        "end": 145629,
                        "strand": "+"
                    }
                },

Metadata

This section will compile metadata on two levels:

  • Specimen Level: Information about specimen that was collected.
  • Experiment Level: Information about the sequencing or amplification runs performed on a specimen.

It’s important to note that a single specimen may be linked to multiple experiments.

In our example this is stored in one table, but may be stored in multiple places for you.

Code
metadata = pd.read_excel('example_data/mad4hatter_metadata_example.xlsx')
metadata.head()
specimen_id collection_date collection_country samp_collect_device lat_lon collector geo_admin3 host_taxon_id project_name samp_store_loc samp_taxon_id experiment_sample_id panel_id plate_name plate_row plate_col sequencing_info_id
0 SAMN38241219 2019-01 Mozambique dried blood spot 25.58,32.35 Brokhattingen, Nanna Maputo 1758 PRJNA1040019 UCSF Greenhouse Lab 5833 SRR26819135 Mad4hatter plate1 A 1 run1
1 SAMN38241215 2017-02 Mozambique dried blood spot 25.58,32.35 Brokhattingen, Nanna Maputo 1758 PRJNA1040019 UCSF Greenhouse Lab 5833 SRR26819139 Mad4hatter plate1 A 2 run1
2 SAMN38241214 2016-05 Mozambique dried blood spot 25.58,32.35 Brokhattingen, Nanna Maputo 1758 PRJNA1040019 UCSF Greenhouse Lab 5833 SRR26819141 Mad4hatter plate1 A 3 run1
3 SAMN38241052 2015-05 Mozambique dried blood spot 25.58,32.35 Brokhattingen, Nanna Maputo 1758 PRJNA1040019 UCSF Greenhouse Lab 5833 SRR26819151 Mad4hatter plate1 A 4 run1
4 SAMN38241112 2019-06 Mozambique dried blood spot 25.58,32.35 Brokhattingen, Nanna Maputo 1758 PRJNA1040019 UCSF Greenhouse Lab 5833 SRR26819200 Mad4hatter plate1 A 5 run1

Specimen Level Metadata

Now we put together the specimen level metadata. This is the metadata associated with the sample collected from the host. For more information on this section see the documentation.

Code
print(specimen_info_table_to_json.__doc__)

    Converts a DataFrame containing specimen information into JSON.

    :param contents (pd.DataFrame): The input DataFrame containing experiment data.
    :param specimen_id_col (str): The column name for specimen sample IDs. Default: specimen_id
    :param samp_taxon_id (int): NCBI taxonomy number of the organism. Default: samp_taxon_id
    :param collection_date (string): Date of the sample collection. Default: collection_date
    :param collection_country (string): Name of country collected in (admin level 0). Default : collection_country
    :param collector (string): Name of the primary person managing the specimen. Default: collector
    :param samp_store_loc (string): Sample storage site. Default: samp_store_loc
    :param samp_collect_device (string): The way the sample was collected. Default : samp_collect_device
    :param project_name (string): Name of the project. Default : project_name
    :param alternate_identifiers (Optional[str]): List of optional alternative names for the samples
    :param geo_admin1 (Optional[str]): Geographical admin level 1
    :param geo_admin2 (Optional[str]): Geographical admin level 2
    :param geo_admin3 (Optional[str]): Geographical admin level 3
    :param host_taxon_id (Optional[int]): NCBI taxonomy number of the host
    :param individual_id (Optional[str]): ID for the individual a specimen was collected from
    :param lat_lon (Optional[str]): Latitude and longitude of the collection site
    :param parasite_density (Optional[float]): The parasite density
    :param plate_col (Optional[int]): Column the specimen was in in the plate
    :param plate_name (Optional[str]): Name of plate the specimen was in
    :param plate_row (Optional[str]): Row the specimen was in in the plate
    :param sample_comments (Optional[str]): Additional comments about the sample
    :param additional_specimen_cols (Optional[List[str], None]]): Additional column names to include

    :return: JSON format where keys are `specimen_id` and values are corresponding row data.
    
Code
specimen_info_json = specimen_info_table_to_json(metadata, geo_admin3='geo_admin3',host_taxon_id='host_taxon_id', lat_lon='lat_lon')
print_json_head(specimen_info_json, 20)
{
    "SAMN38241219": {
        "specimen_id": "SAMN38241219",
        "samp_taxon_id": 5833,
        "collection_date": "2019-01",
        "collection_country": "Mozambique",
        "collector": "Brokhattingen, Nanna",
        "samp_store_loc": "UCSF Greenhouse Lab",
        "samp_collect_device": "dried blood spot",
        "project_name": "PRJNA1040019",
        "geo_admin3": "Maputo",
        "host_taxon_id": 1758,
        "lat_lon": "25.58,32.35"
    },
    "SAMN38241215": {
        "specimen_id": "SAMN38241215",
        "samp_taxon_id": 5833,
        "collection_date": "2017-02",
        "collection_country": "Mozambique",
        "collector": "Brokhattingen, Nanna",

Experiment Level Metadata

This section shows how to put together the experiment level metadata. More information on this table can be found [here](pd.read_excel(‘example_data/mad4hatter_experiment_info_table_example.xlsx’).

Code
print(experiment_info_table_to_json.__doc__)

    Converts a DataFrame containing experiment information into JSON.

    :param contents (pd.DataFrame): Input DataFrame containing experiment data.
    :param experiment_sample_id_col (str): Column name for experiment sample IDs. Default: experiment_sample_id
    :param sequencing_info_id (str): Column name for sequencing information IDs. Default: sequencing_info_id
    :param specimen_id (str): Column name for specimen IDs. Default: specimen_id
    :param panel_id (str): Column name for panel IDs. Default: panel_id
    :param accession (Optional[str]): Column name for accession information.
    :param plate_col (Optional[int]): Column index for plate information.
    :param plate_name (Optional[str]): Column name for plate names.
    :param plate_row (Optional[str]): Column name for plate rows.
    :param additional_experiment_cols (Optional[List[str], None]]): Additional column names to include.

    :return: JSON format where keys are `experiment_sample_id` and values are corresponding row data.
    
Code
experiment_info_json = experiment_info_table_to_json(metadata, plate_name='plate_name', plate_col='plate_col',plate_row='plate_row', additional_experiment_cols=['collection_date','collection_country'])
print_json_head(experiment_info_json, 20)
{
    "SRR26819135": {
        "experiment_sample_id": "SRR26819135",
        "sequencing_info_id": "run1",
        "specimen_id": "SAMN38241219",
        "panel_id": "Mad4hatter",
        "plate_col": 1,
        "plate_name": "plate1",
        "plate_row": "A",
        "collection_date": "2019-01",
        "collection_country": "Mozambique"
    },
    "SRR26819139": {
        "experiment_sample_id": "SRR26819139",
        "sequencing_info_id": "run1",
        "specimen_id": "SAMN38241215",
        "panel_id": "Mad4hatter",
        "plate_col": 2,
        "plate_name": "plate1",
        "plate_row": "A",

Microhaplotype Information

Next, we’ll organize the microhaplotype information into the required format.

This involves two components that we will generate from one table(click on the links to find out more information about each part):

  • The representative microhaplotype details: A summary of all of unique microhaplotypes called within the population you have included in your PMO for each target. Each unique microhaplotype will be assigned a short ID within PMO to improve the scalability of the format.
  • The detected microhaplotypes: Microhaplotypes called for each sample for each target and the associated reads. This will be linked to the above table using the generated microhaplotype ID instead of the full microhaplotype sequence.

First we will load an example allele table that may be similar to something you have from your own microhaplotype pipeline. This table includes a sampleID, the target, and the ASV and number of reads detected for each of these.

Code
example_allele_table = pd.read_csv(allele_table_path, sep='\t')
example_allele_table.head()
SampleID Locus ASV Reads Allele PseudoCIGAR
0 SRR26819553 Pf3D7_01_v3-145388-145662-1A GATATGTTTAAATATATGATTCTCGAAAAAACTTTTTTTATTTTTT... 13 Pf3D7_01_v3-145388-145662-1A.1 25+25N169+8N188+9N
1 SRR26819207 Pf3D7_01_v3-145388-145662-1A GATATGTTTAAATATATGATTCTCGAAAAAACTTTTTTTATTTTTT... 4 Pf3D7_01_v3-145388-145662-1A.2 25+25N94A139T169+8N188+9N
2 SRR26819545 Pf3D7_01_v3-145388-145662-1A GATATGTTTAAATATATGATTCTCGAAAAAACTTTTTTTATTTTTT... 22 Pf3D7_01_v3-145388-145662-1A.1 25+25N169+8N188+9N
3 SRR26819527 Pf3D7_01_v3-145388-145662-1A GATATGTTTAAATATATGATTCTCGAAAAAACTTTTTTTATTTTTT... 1 Pf3D7_01_v3-145388-145662-1A.2 25+25N94A139T169+8N188+9N
4 SRR26819214 Pf3D7_01_v3-145388-145662-1A GATATGTTTAAATATATGATTCTCGAAAAAACTTTTTTTATTTTTT... 14 Pf3D7_01_v3-145388-145662-1A.3 25+25N139T169+8N188+9N

Let’s have a look at the function we will use to create this part of PMO microhaplotype_table_to_pmo_dict

Code
print(microhaplotype_table_to_pmo_dict.__doc__)

    Convert a dataframe of a microhaplotype calls into a dictionary containing a dictionary for the haplotypes_detected and a dictionary for the representative_haplotype_sequences.

    :param contents: The dataframe containing microhaplotype calls
    :param bioinfo_id: the bioinformatics ID of the microhaplotype table
    :param sampleID_col: the name of the column containing the sample IDs
    :param locus_col: the name of the column containing the locus IDs
    :param mhap_col: the name of the column containing the microhaplotype sequence
    :param reads_col: the name of the column containing the reads counts
    :param additional_hap_detected_cols: optional additional columns to add to the microhaplotype detected dictionary, the key is the pandas column and the value is what to name it in the output
    :return: a dict of both the haplotypes_detected and representative_haplotype_sequences
    

We can see that we need a dataframe with columns for sample IDs, locus names, microhaplotype sequences, and their corresponding read counts. We also need to supply a unique bioinformatics ID. Including this ID allows us to store results from multiple bioinformatics pipelines run on the same sequencing data in a unified format if necessary.

Here we set a bioinformatics ID, so we can use the same one when generating other tables later on.

Code
bioinfo_id = "Mozambique2018-MAD4HatTeR"

The function has default column names that align with the standard output from DADA2. However, since we’re using MAD4HatTeR data, which has slightly different column headers, we’ll need to specify these column names explicitly in the function.

Code
microhaplotype_info = microhaplotype_table_to_pmo_dict(
    example_allele_table,
    sampleID_col="SampleID",
    locus_col="Locus",
    mhap_col="ASV",
    reads_col="Reads",
    bioinfo_id=bioinfo_id,
)

Demultiplexed Experiment Samples

We also include information on the demultiplexed reads for each Sample for each target using a function called demultiplexed_targets_to_pmo_dict.

Code
print(demultiplexed_targets_to_pmo_dict.__doc__)

    Convert a dataframe of microhaplotype calls into a dictionary for detected haplotypes 
    and representative haplotype sequences.

    :param contents: DataFrame containing demultiplexed sample information
    :param bioinfo_id: Bioinformatics ID of the demultiplexed targets
    :param sampleID_col: Name of the column containing sample IDs
    :param target_id_col: Name of the column containing locus IDs
    :param read_count_col: Name of the column containing read counts
    :param additional_hap_detected_cols: Optional columns to include in the output,
                                         with keys as column names and values as their output names
    :return: JSON string containing the processed data
    
Code
amplicon_coverage = pd.read_csv(demultiplexed_reads_path, sep='\t')
amplicon_coverage.head()
SampleID Locus Reads OutputDada2 OutputPostprocessing
0 SRR26819135 Pf3D7_01_v3-145388-145662-1A 54 54 54
1 SRR26819135 Pf3D7_01_v3-162867-163115-1A 400 398 398
2 SRR26819135 Pf3D7_01_v3-181512-181761-1A 266 266 266
3 SRR26819135 Pf3D7_01_v3-455794-456054-1A 81 80 80
4 SRR26819135 Pf3D7_01_v3-528859-529104-1A 485 485 485

Note that we use the same bioinfo_id that we set above.

Code
demultiplexed_targets_pmo = demultiplexed_targets_to_pmo_dict(amplicon_coverage, bioinfo_id,  sampleID_col = 'SampleID', target_id_col='Locus',read_count_col ='Reads')
Code
print_json_head(demultiplexed_targets_pmo, 15)
{
    "target_demultiplexed_experiment_samples": {
        "Mozambique2018-MAD4HatTeR": {
            "demultiplexed_experiment_samples": {
                "SRR26819135": {
                    "demultiplexed_targets": {
                        "experiment_sample_id": "SRR26819135",
                        "Pf3D7_01_v3-145388-145662-1A": {
                            "raw_read_count": 54,
                            "target_id": "Pf3D7_01_v3-145388-145662-1A"
                        },
                        "Pf3D7_01_v3-162867-163115-1A": {
                            "raw_read_count": 400,
                            "target_id": "Pf3D7_01_v3-162867-163115-1A"
                        },

Sequencing info

PMO includes details of the sequencing run. Below is an example and more information on the required fields can be found here

Code
sequencing_infos ={
        "Mozambique2018" : 
        {
            "lib_kit" : "TruSeq i5/i7 barcode primers",
            "lib_layout" : "paired-end",
            "lib_screen" : "40 µL reaction containing 10 µL of bead purified digested product, 18μL of nuclease-free water, 8μL of 5X secondary PCR master mix, and 5 µL of 10 µM TruSeq i5/i7 barcode primers",
            "nucl_acid_amp" : "https://www.paragongenomics.com/targeted-sequencing/amplicon-sequencing/cleanplex-ngs-amplicon-sequencing/",
            "nucl_acid_date" : "2019-07-15",
            "nucl_acid_ext" : "https://www.paragongenomics.com/targeted-sequencing/amplicon-sequencing/cleanplex-ngs-amplicon-sequencing/",
            "pcr_cond" : "10 min at 95°C, 13 cycles for high density samples (or 15 cycles for low density samples) of 15 sec at 98°C and 75 sec at 60°C",
            "seq_center" : "UCSF",
            "seq_date" : "2019-07-15",
            "seq_instrument" : "NextSeq 550 instrument",
            "sequencing_info_id" : "run1"
        }
    }

Bioinformatics Info

Now we manually enter some information on the bioinformatics run. More information on the fields can be found here. Below is an example from the MAD4HatTeR pipeline.

Code
taramp_bioinformatics_infos = {
    bioinfo_id : 
    {
        "demultiplexing_method" : 
        {
            "program" : "Cutadapt extractorPairedEnd",
            "purpose" : "Takes raw paired-end reads and demultiplexes on primers and does QC filtering",
            "version" : "v4.4"
        },
        "denoising_method" : 
        {
            "program" : "DADA2",
            "purpose" : "Takes sequences per sample per target and clusters them",
            "version" : "v3.16"
        },
        "tar_amp_bioinformatics_info_id" : bioinfo_id
    }
}

Compose PMO

To create our final PMO we will put together all of the parts we have created

Code
# Put together the information we h
format_pmo = {
    "experiment_infos": experiment_info_json,  
    "sequencing_infos": sequencing_infos, 
    "specimen_infos": specimen_info_json, 
    "taramp_bioinformatics_infos": taramp_bioinformatics_infos, 
    **microhaplotype_info, 
    **panel_information_pmo,
    **demultiplexed_targets_pmo,
}

Finally we output this to a file

Code
# Write to a JSON file
output_file = "example_pmo.json"
with open(output_file, "w") as f:
    json.dump(format_pmo, f, indent=4)

That’s it! You have put together a PMO, congratulations.

Next time you should be able to reuse multiple parts with minor tweaks. See the rest of the documentation for ways that you can work with your new PMO file.