Creating a PMO File

In this tutorial we will go through the steps to build a PMO, utilising the functions within the pmo-tools package.

For more information on any of the fields mentioned please see the documentation.

Setup

First we will import the functions that we will need to run this notebook.

Code

from pmotools.json_convertors.microhaplotype_table_to_pmo_dict import microhaplotype_table_to_pmo_dict
from pmotools.json_convertors.metatable_to_json_meta import experiment_info_table_to_json, specimen_info_table_to_json
from pmotools.json_convertors.panel_information_to_pmo_dict import panel_info_table_to_pmo_dict
from pmotools.json_convertors.demultiplexed_targets_to_pmo_dict import demultiplexed_targets_to_pmo_dict

Code

import pandas as pd
import json

Here we define a function that will be used to print a few lines from the data we will be creating.

Code

def print_json_head(dict, n=10):
    json_object = json.dumps(dict, indent=4)
    for i, l in enumerate(json_object.split('\n')):
        if i >= n:
            break
        print(l)

Creating PMO

To create the full PMO we will need a few sets of information. These include:

Panel Information : A table including data on the targets that make up the panel.
Allele table : A table containing the alleles called for each of the samples for each of the targets and the reads associated.
Demultiplexed reads : A table containing the raw reads for each sample, for each target after demultiplexing, before any filtering.
Experimental metadata : Information on the sequencing run, for example where each sample was located on the plate.
Specimen Information : metadata on the biological samples

We will specify the paths to the example data we will use below, but if you would like to try and use your own data then replace the following paths:

Code

panel_information_path = 'example_data/mad4hatter_panel_info_example.tsv'
allele_table_path = 'example_data/mad4hatter_allele_data_example.txt'
demultiplexed_reads_path = 'example_data/mad4hatter_amplicon_coverage.txt'

Panel Information

First we will work on putting the panel information into PMO format. Although labs may store this information in a variety of ways and this process may seem cumbersome, you will only have to do this once for each panel that you work with.

The panel information consists of 2 parts; The panel_targets (information on the targets) and the target_genome (information on the reference genome being targeted).

To include details of the reference genome we need the following information.

name : name of the genome
version : the genome version
taxon_id : the NCBI taxonomy number
url : a link to the where this genome file could be downloaded

Optionally, you can also include a link to genomes annotation file, as we have below. Below is an example of compiling this information into the json format manually:

Code

target_genome_info = {
            "gff_url" : "https://plasmodb.org/common/downloads/release-65/Pfalciparum3D7/gff/data/PlasmoDB-65_Pfalciparum3D7.gff",
            "name" : "3D7",
            "taxon_id" : 5833,
            "url" : "https://plasmodb.org/common/downloads/release-65/Pfalciparum3D7/fasta/data/PlasmoDB-65_Pfalciparum3D7_Genome.fasta",
            "version" : "2020-09-01"
        }

Fields that are required to define the target information are …

target_id : a unique identifier for each of the targets
forward primer seq : The sequence for the forward primer associated with this target
reverse primer sequence : The sequence for the reverse primer associated with this target

Note: in the case that you have multiple primers to target the same region, please include these on separate lines in the table with the same target_id.

Optionally you can also include location information for the primers. To include this information you will need to include in the table:

chrom : the chromosome name
start : the start of the location, 0-based positioning
end : the end of the location, 0-based positioning

For more information on optional fields that can be included, check the documentation.

Here we show how to take panel information that is used to run the MAD4HatTeR pipeline and convert it to PMO.

Code

madhatter_panel_info = pd.read_csv(panel_information_path, sep='\t')
madhatter_panel_info.head()

	amplicon	amplicon_start	amplicon_end	ampInsert_start	ampInsert_end	rev_primer	amplicon_length	ampInsert_length	fwd_primer	target_type	strand	gene_id
0	Pf3D7_01_v3-145388-145662-1A	145388	145662	145421	145630	AAAATGTCCAATATGTCAAGGTATATTAAAGT	274	209	CCTGAGTTTTAAGTGAATGAATATATTTTTGTT	diversity	+	PF3D7_0103300
1	Pf3D7_01_v3-162867-163115-1A	162867	163115	162889	163092	TGTGTGCTTTGTCGTTGATTCAT	248	203	TACTACCGATCATCAAGCCGAA	diversity	+	PF3D7_0103600
2	Pf3D7_01_v3-181512-181761-1A	181512	181761	181545	181729	TAGTTTAAATCTATACTTGTCTCACCTGAACA	249	184	CTTTTCATATTTGTCTATTAGCTTTTTCAAACC	diversity	+	PF3D7_0104100
3	Pf3D7_01_v3-455794-456054-1A	455794	456054	455827	456021	GTGTTTCATTATTTTAGACACATTCAGGAATTT	260	194	ACAATGTAGAACAATATATAAAACTGGAAAAGA	diversity	+	NaN
4	Pf3D7_01_v3-528859-529104-1A	528859	529104	528890	529073	AATCATTTTATCCCACTTATTTATCTCGTCT	245	183	CTTAGTTTAGATTTGCCTACAATATTTGCAC	diversity	+	PF3D7_0113800

We can use the panel_info_table_to_pmo_dict function to convert this into the correct format for PMO.

Code

print(panel_info_table_to_pmo_dict.__doc__)


    Convert a dataframe containing panel information into dictionary of targets and reference information


    :param target_table: The dataframe containing the target information
    :param panel_id: the panel ID assigned to the panel
    :param genome_info: A dictionary containing the genome information
    :param target_id_col: the name of the column containing the target IDs
    :param forward_primers_seq_col: the name of the column containing the sequence of the forward primer
    :param reverse_primers_seq_col: the name of the column containing the sequence of the reverse primer
    :param forward_primers_start_col (Optional): the name of the column containing the 0-based start coordinate of the forward primer
    :param forward_primers_end_col (Optional): the name of the column containing the 0-based end coordinate of the forward primer
    :param reverse_primers_start_col (Optional): the name of the column containing the 0-based start coordinate of the reverse primer
    :param reverse_primers_end_col (Optional): the name of the column containing the 0-based end coordinate of the reverse primer
    :param insert_start_col (Optional): the name of the column containing the 0-based start coordinate of the insert
    :param insert_end_col (Optional): the name of the column containing the 0-based end coordinate of the insert
    :param chrom_col (Optional): the name of the column containing the chromosome for the target
    :param gene_id_col (Optional): the name of the column containing the gene id
    :param strand_col (Optional): the name of the column containing the strand for the target
    :param target_type_col (Optional): A classification type for the target
    :param additional_target_info_cols (Optional): dictionary of optional additional columns to add to the target information dictionary. Keys are column names and values are the type.
    :return: a dict of the panel information

We will use this first just to include the most basic required information.

Code

panel_information_pmo = panel_info_table_to_pmo_dict(
    madhatter_panel_info,
    "mad4hatter_poolsD1R1R2",
    target_genome_info,
    target_id_col="amplicon",
    forward_primers_seq_col="fwd_primer",
    reverse_primers_seq_col="rev_primer",
)

Let’s take a look at the first 30 rows of the information we put together…

Code

print_json_head(panel_information_pmo,30)

{
    "panel_info": {
        "mad4hatter_poolsD1R1R2": {
            "panel_id": "mad4hatter_poolsD1R1R2",
            "target_genome": {
                "gff_url": "https://plasmodb.org/common/downloads/release-65/Pfalciparum3D7/gff/data/PlasmoDB-65_Pfalciparum3D7.gff",
                "name": "3D7",
                "taxon_id": 5833,
                "url": "https://plasmodb.org/common/downloads/release-65/Pfalciparum3D7/fasta/data/PlasmoDB-65_Pfalciparum3D7_Genome.fasta",
                "version": "2020-09-01"
            },
            "targets": {
                "Pf3D7_01_v3-145388-145662-1A": {
                    "target_id": "Pf3D7_01_v3-145388-145662-1A",
                    "forward_primers": [
                        {
                            "seq": "CCTGAGTTTTAAGTGAATGAATATATTTTTGTT"
                        }
                    ],
                    "reverse_primers": [
                        {
                            "seq": "AAAATGTCCAATATGTCAAGGTATATTAAAGT"
                        }
                    ]
                },
                "Pf3D7_01_v3-162867-163115-1A": {
                    "target_id": "Pf3D7_01_v3-162867-163115-1A",
                    "forward_primers": [
                        {
                            "seq": "TACTACCGATCATCAAGCCGAA"

Optionally we can include some more information. You can see that some of the fields in the table above don’t directly match the optional fields. Therefore, we must first wrangle the data slightly to fit the requirements.

Note: You may not have to apply all of the following steps to your panel information, this is just an example and is specific to the MAD4HatTeR panel information.

The chromosome for each target is stored within the locus name, so we extract that and put it in it’s own column below.

Code

madhatter_panel_info['chrom'] = [chr[0] for chr in madhatter_panel_info.amplicon.str.split('-')]

Next we need to generate 0-based coordinates of the location the primers are targetting. The panel information we have only includes the full target start and end (including the primer) and is 1-based, so we do the conversion as follows.

Code

# Create 0-based coordinate for start of forward primer
madhatter_panel_info['fwd_primer_start_0_based'] = madhatter_panel_info.amplicon_start-1
# Calculate the length of the forward primer, add this to the primer start coordinate to get the end coordinate 
madhatter_panel_info['fwd_primer_len'] = [len(p) for p in madhatter_panel_info.fwd_primer]
madhatter_panel_info['fwd_primer_end_0_based'] = madhatter_panel_info.fwd_primer_start_0_based+madhatter_panel_info.fwd_primer_len

# Calculate the length of the reverse primer. Subtract this from the end coordinate of the target to get the start coordinate of the reverse primer
madhatter_panel_info['rev_primer_len'] = [len(p) for p in madhatter_panel_info.rev_primer]
madhatter_panel_info['rev_primer_start_0_based'] = madhatter_panel_info.amplicon_end-madhatter_panel_info.rev_primer_len
# The 0-based reverse primer end would be the same as the amplicon end 
madhatter_panel_info['rev_primer_end_0_based'] = madhatter_panel_info.amplicon_end

In the MAD4HatTeR pipeline, we trim one base from each end of the amplicon insert because the base following a primer is often erroneous. We can create insert coordinates with this adjustment, as shown below. If you choose not to apply this trimming step, you can instead use the coordinate at the end of the forward primer and the beginning of the reverse primer to define the start and end of the insert.

Code

madhatter_panel_info['insert_start_0_based'] = madhatter_panel_info.fwd_primer_end_0_based+1
madhatter_panel_info['insert_end_0_based'] = madhatter_panel_info.rev_primer_start_0_based-1

Now we can create panel information to go into PMO with all of the optional fields we just created.

Code

panel_information_pmo = panel_info_table_to_pmo_dict(
    madhatter_panel_info, 
    "mad4hatter_poolsD1R1R2", 
    target_genome_info, 
    target_id_col="amplicon",
    forward_primers_seq_col="fwd_primer",
    reverse_primers_seq_col="rev_primer",
    forward_primers_start_col="fwd_primer_start_0_based",
    forward_primers_end_col="fwd_primer_end_0_based",
    reverse_primers_start_col="rev_primer_start_0_based",
    reverse_primers_end_col="rev_primer_end_0_based",
    insert_start_col="insert_start_0_based",
    insert_end_col="insert_end_0_based",
    chrom_col="chrom",
    strand_col="strand",
    gene_id_col="gene_id",
    target_type_col="target_type",
)

You can also add on your own custom fields using the additional_target_info_cols parameter. Below we add on the amplicon insert length information. If there is a field that you want to add and think others would find useful please contact us and we can add it in. This way we can make sure to keep ontologies consistent!

Code

panel_information_pmo = panel_info_table_to_pmo_dict(
    madhatter_panel_info,
    "mad4hatter_poolsD1R1R2",
    target_genome_info,
    target_id_col="amplicon",
    forward_primers_seq_col="fwd_primer",
    reverse_primers_seq_col="rev_primer",
    forward_primers_start_col="fwd_primer_start_0_based",
    forward_primers_end_col="fwd_primer_end_0_based",
    reverse_primers_start_col="rev_primer_start_0_based",
    reverse_primers_end_col="rev_primer_end_0_based",
    insert_start_col="insert_start_0_based",
    insert_end_col="insert_end_0_based",
    chrom_col="chrom",
    strand_col="strand",
    gene_id_col="gene_id",
    target_type_col="target_type",
    additional_target_info_cols=["ampInsert_length"]
)

Let’s have a look at this now with the extra information added to the panel information

Code

print_json_head(panel_information_pmo,46)

{
    "panel_info": {
        "mad4hatter_poolsD1R1R2": {
            "panel_id": "mad4hatter_poolsD1R1R2",
            "target_genome": {
                "gff_url": "https://plasmodb.org/common/downloads/release-65/Pfalciparum3D7/gff/data/PlasmoDB-65_Pfalciparum3D7.gff",
                "name": "3D7",
                "taxon_id": 5833,
                "url": "https://plasmodb.org/common/downloads/release-65/Pfalciparum3D7/fasta/data/PlasmoDB-65_Pfalciparum3D7_Genome.fasta",
                "version": "2020-09-01"
            },
            "targets": {
                "Pf3D7_01_v3-145388-145662-1A": {
                    "target_id": "Pf3D7_01_v3-145388-145662-1A",
                    "forward_primers": [
                        {
                            "seq": "CCTGAGTTTTAAGTGAATGAATATATTTTTGTT",
                            "location": {
                                "chrom": "Pf3D7_01_v3",
                                "end": 145420,
                                "start": 145387,
                                "strand": "+"
                            }
                        }
                    ],
                    "reverse_primers": [
                        {
                            "seq": "AAAATGTCCAATATGTCAAGGTATATTAAAGT",
                            "location": {
                                "chrom": "Pf3D7_01_v3",
                                "end": 145662,
                                "start": 145630,
                                "strand": "+"
                            }
                        }
                    ],
                    "gene_id": "PF3D7_0103300",
                    "target_type": "diversity",
                    "ampInsert_length": 209,
                    "insert_location": {
                        "chrom": "Pf3D7_01_v3",
                        "start": 145421,
                        "end": 145629,
                        "strand": "+"
                    }
                },

Metadata

This section will compile metadata on two levels:

Specimen Level: Information about specimen that was collected.
Experiment Level: Information about the sequencing or amplification runs performed on a specimen.

It’s important to note that a single specimen may be linked to multiple experiments.

In our example this is stored in one table, but may be stored in multiple places for you.

Code

metadata = pd.read_excel('example_data/mad4hatter_metadata_example.xlsx')
metadata.head()

	specimen_id	collection_date	collection_country	samp_collect_device	lat_lon	collector	geo_admin3	host_taxon_id	project_name	samp_store_loc	samp_taxon_id	experiment_sample_id	panel_id	plate_name	plate_row	plate_col	sequencing_info_id
0	SAMN38241219	2019-01	Mozambique	dried blood spot	25.58,32.35	Brokhattingen, Nanna	Maputo	1758	PRJNA1040019	UCSF Greenhouse Lab	5833	SRR26819135	Mad4hatter	plate1	A	1	run1
1	SAMN38241215	2017-02	Mozambique	dried blood spot	25.58,32.35	Brokhattingen, Nanna	Maputo	1758	PRJNA1040019	UCSF Greenhouse Lab	5833	SRR26819139	Mad4hatter	plate1	A	2	run1
2	SAMN38241214	2016-05	Mozambique	dried blood spot	25.58,32.35	Brokhattingen, Nanna	Maputo	1758	PRJNA1040019	UCSF Greenhouse Lab	5833	SRR26819141	Mad4hatter	plate1	A	3	run1
3	SAMN38241052	2015-05	Mozambique	dried blood spot	25.58,32.35	Brokhattingen, Nanna	Maputo	1758	PRJNA1040019	UCSF Greenhouse Lab	5833	SRR26819151	Mad4hatter	plate1	A	4	run1
4	SAMN38241112	2019-06	Mozambique	dried blood spot	25.58,32.35	Brokhattingen, Nanna	Maputo	1758	PRJNA1040019	UCSF Greenhouse Lab	5833	SRR26819200	Mad4hatter	plate1	A	5	run1

Specimen Level Metadata

Now we put together the specimen level metadata. This is the metadata associated with the sample collected from the host. For more information on this section see the documentation.

Code

print(specimen_info_table_to_json.__doc__)


    Converts a DataFrame containing specimen information into JSON.

    :param contents (pd.DataFrame): The input DataFrame containing experiment data.
    :param specimen_id_col (str): The column name for specimen sample IDs. Default: specimen_id
    :param samp_taxon_id (int): NCBI taxonomy number of the organism. Default: samp_taxon_id
    :param collection_date (string): Date of the sample collection. Default: collection_date
    :param collection_country (string): Name of country collected in (admin level 0). Default : collection_country
    :param collector (string): Name of the primary person managing the specimen. Default: collector
    :param samp_store_loc (string): Sample storage site. Default: samp_store_loc
    :param samp_collect_device (string): The way the sample was collected. Default : samp_collect_device
    :param project_name (string): Name of the project. Default : project_name
    :param alternate_identifiers (Optional[str]): List of optional alternative names for the samples
    :param geo_admin1 (Optional[str]): Geographical admin level 1
    :param geo_admin2 (Optional[str]): Geographical admin level 2
    :param geo_admin3 (Optional[str]): Geographical admin level 3
    :param host_taxon_id (Optional[int]): NCBI taxonomy number of the host
    :param individual_id (Optional[str]): ID for the individual a specimen was collected from
    :param lat_lon (Optional[str]): Latitude and longitude of the collection site
    :param parasite_density (Optional[float]): The parasite density
    :param plate_col (Optional[int]): Column the specimen was in in the plate
    :param plate_name (Optional[str]): Name of plate the specimen was in
    :param plate_row (Optional[str]): Row the specimen was in in the plate
    :param sample_comments (Optional[str]): Additional comments about the sample
    :param additional_specimen_cols (Optional[List[str], None]]): Additional column names to include

    :return: JSON format where keys are `specimen_id` and values are corresponding row data.

Code

specimen_info_json = specimen_info_table_to_json(metadata, geo_admin3='geo_admin3',host_taxon_id='host_taxon_id', lat_lon='lat_lon')
print_json_head(specimen_info_json, 20)

{
    "SAMN38241219": {
        "specimen_id": "SAMN38241219",
        "samp_taxon_id": 5833,
        "collection_date": "2019-01",
        "collection_country": "Mozambique",
        "collector": "Brokhattingen, Nanna",
        "samp_store_loc": "UCSF Greenhouse Lab",
        "samp_collect_device": "dried blood spot",
        "project_name": "PRJNA1040019",
        "geo_admin3": "Maputo",
        "host_taxon_id": 1758,
        "lat_lon": "25.58,32.35"
    },
    "SAMN38241215": {
        "specimen_id": "SAMN38241215",
        "samp_taxon_id": 5833,
        "collection_date": "2017-02",
        "collection_country": "Mozambique",
        "collector": "Brokhattingen, Nanna",

Experiment Level Metadata

This section shows how to put together the experiment level metadata. More information on this table can be found [here](pd.read_excel(‘example_data/mad4hatter_experiment_info_table_example.xlsx’).

Code

print(experiment_info_table_to_json.__doc__)


    Converts a DataFrame containing experiment information into JSON.

    :param contents (pd.DataFrame): Input DataFrame containing experiment data.
    :param experiment_sample_id_col (str): Column name for experiment sample IDs. Default: experiment_sample_id
    :param sequencing_info_id (str): Column name for sequencing information IDs. Default: sequencing_info_id
    :param specimen_id (str): Column name for specimen IDs. Default: specimen_id
    :param panel_id (str): Column name for panel IDs. Default: panel_id
    :param accession (Optional[str]): Column name for accession information.
    :param plate_col (Optional[int]): Column index for plate information.
    :param plate_name (Optional[str]): Column name for plate names.
    :param plate_row (Optional[str]): Column name for plate rows.
    :param additional_experiment_cols (Optional[List[str], None]]): Additional column names to include.

    :return: JSON format where keys are `experiment_sample_id` and values are corresponding row data.

Code

experiment_info_json = experiment_info_table_to_json(metadata, plate_name='plate_name', plate_col='plate_col',plate_row='plate_row', additional_experiment_cols=['collection_date','collection_country'])
print_json_head(experiment_info_json, 20)

{
    "SRR26819135": {
        "experiment_sample_id": "SRR26819135",
        "sequencing_info_id": "run1",
        "specimen_id": "SAMN38241219",
        "panel_id": "Mad4hatter",
        "plate_col": 1,
        "plate_name": "plate1",
        "plate_row": "A",
        "collection_date": "2019-01",
        "collection_country": "Mozambique"
    },
    "SRR26819139": {
        "experiment_sample_id": "SRR26819139",
        "sequencing_info_id": "run1",
        "specimen_id": "SAMN38241215",
        "panel_id": "Mad4hatter",
        "plate_col": 2,
        "plate_name": "plate1",
        "plate_row": "A",

Microhaplotype Information

Next, we’ll organize the microhaplotype information into the required format.

This involves two components that we will generate from one table(click on the links to find out more information about each part):

The representative microhaplotype details: A summary of all of unique microhaplotypes called within the population you have included in your PMO for each target. Each unique microhaplotype will be assigned a short ID within PMO to improve the scalability of the format.
The detected microhaplotypes: Microhaplotypes called for each sample for each target and the associated reads. This will be linked to the above table using the generated microhaplotype ID instead of the full microhaplotype sequence.

First we will load an example allele table that may be similar to something you have from your own microhaplotype pipeline. This table includes a sampleID, the target, and the ASV and number of reads detected for each of these.

Code

example_allele_table = pd.read_csv(allele_table_path, sep='\t')
example_allele_table.head()

	SampleID	Locus	ASV	Reads	Allele	PseudoCIGAR
0	SRR26819553	Pf3D7_01_v3-145388-145662-1A	GATATGTTTAAATATATGATTCTCGAAAAAACTTTTTTTATTTTTT...	13	Pf3D7_01_v3-145388-145662-1A.1	25+25N169+8N188+9N
1	SRR26819207	Pf3D7_01_v3-145388-145662-1A	GATATGTTTAAATATATGATTCTCGAAAAAACTTTTTTTATTTTTT...	4	Pf3D7_01_v3-145388-145662-1A.2	25+25N94A139T169+8N188+9N
2	SRR26819545	Pf3D7_01_v3-145388-145662-1A	GATATGTTTAAATATATGATTCTCGAAAAAACTTTTTTTATTTTTT...	22	Pf3D7_01_v3-145388-145662-1A.1	25+25N169+8N188+9N
3	SRR26819527	Pf3D7_01_v3-145388-145662-1A	GATATGTTTAAATATATGATTCTCGAAAAAACTTTTTTTATTTTTT...	1	Pf3D7_01_v3-145388-145662-1A.2	25+25N94A139T169+8N188+9N
4	SRR26819214	Pf3D7_01_v3-145388-145662-1A	GATATGTTTAAATATATGATTCTCGAAAAAACTTTTTTTATTTTTT...	14	Pf3D7_01_v3-145388-145662-1A.3	25+25N139T169+8N188+9N

Let’s have a look at the function we will use to create this part of PMO microhaplotype_table_to_pmo_dict

Code

print(microhaplotype_table_to_pmo_dict.__doc__)


    Convert a dataframe of a microhaplotype calls into a dictionary containing a dictionary for the haplotypes_detected and a dictionary for the representative_haplotype_sequences.

    :param contents: The dataframe containing microhaplotype calls
    :param bioinfo_id: the bioinformatics ID of the microhaplotype table
    :param sampleID_col: the name of the column containing the sample IDs
    :param locus_col: the name of the column containing the locus IDs
    :param mhap_col: the name of the column containing the microhaplotype sequence
    :param reads_col: the name of the column containing the reads counts
    :param additional_hap_detected_cols: optional additional columns to add to the microhaplotype detected dictionary, the key is the pandas column and the value is what to name it in the output
    :return: a dict of both the haplotypes_detected and representative_haplotype_sequences

We can see that we need a dataframe with columns for sample IDs, locus names, microhaplotype sequences, and their corresponding read counts. We also need to supply a unique bioinformatics ID. Including this ID allows us to store results from multiple bioinformatics pipelines run on the same sequencing data in a unified format if necessary.

Here we set a bioinformatics ID, so we can use the same one when generating other tables later on.

Code

bioinfo_id = "Mozambique2018-MAD4HatTeR"

The function has default column names that align with the standard output from DADA2. However, since we’re using MAD4HatTeR data, which has slightly different column headers, we’ll need to specify these column names explicitly in the function.

Code

microhaplotype_info = microhaplotype_table_to_pmo_dict(
    example_allele_table,
    sampleID_col="SampleID",
    locus_col="Locus",
    mhap_col="ASV",
    reads_col="Reads",
    bioinfo_id=bioinfo_id,
)

Demultiplexed Experiment Samples

We also include information on the demultiplexed reads for each Sample for each target using a function called demultiplexed_targets_to_pmo_dict.

Code

print(demultiplexed_targets_to_pmo_dict.__doc__)


    Convert a dataframe of microhaplotype calls into a dictionary for detected haplotypes 
    and representative haplotype sequences.

    :param contents: DataFrame containing demultiplexed sample information
    :param bioinfo_id: Bioinformatics ID of the demultiplexed targets
    :param sampleID_col: Name of the column containing sample IDs
    :param target_id_col: Name of the column containing locus IDs
    :param read_count_col: Name of the column containing read counts
    :param additional_hap_detected_cols: Optional columns to include in the output,
                                         with keys as column names and values as their output names
    :return: JSON string containing the processed data

Code

amplicon_coverage = pd.read_csv(demultiplexed_reads_path, sep='\t')
amplicon_coverage.head()

	SampleID	Locus	Reads	OutputDada2	OutputPostprocessing
0	SRR26819135	Pf3D7_01_v3-145388-145662-1A	54	54	54
1	SRR26819135	Pf3D7_01_v3-162867-163115-1A	400	398	398
2	SRR26819135	Pf3D7_01_v3-181512-181761-1A	266	266	266
3	SRR26819135	Pf3D7_01_v3-455794-456054-1A	81	80	80
4	SRR26819135	Pf3D7_01_v3-528859-529104-1A	485	485	485

Note that we use the same bioinfo_id that we set above.

Code

demultiplexed_targets_pmo = demultiplexed_targets_to_pmo_dict(amplicon_coverage, bioinfo_id,  sampleID_col = 'SampleID', target_id_col='Locus',read_count_col ='Reads')

Code

print_json_head(demultiplexed_targets_pmo, 15)

{
    "target_demultiplexed_experiment_samples": {
        "Mozambique2018-MAD4HatTeR": {
            "demultiplexed_experiment_samples": {
                "SRR26819135": {
                    "demultiplexed_targets": {
                        "experiment_sample_id": "SRR26819135",
                        "Pf3D7_01_v3-145388-145662-1A": {
                            "raw_read_count": 54,
                            "target_id": "Pf3D7_01_v3-145388-145662-1A"
                        },
                        "Pf3D7_01_v3-162867-163115-1A": {
                            "raw_read_count": 400,
                            "target_id": "Pf3D7_01_v3-162867-163115-1A"
                        },

Sequencing info

PMO includes details of the sequencing run. Below is an example and more information on the required fields can be found here

Code

sequencing_infos ={
        "Mozambique2018" : 
        {
            "lib_kit" : "TruSeq i5/i7 barcode primers",
            "lib_layout" : "paired-end",
            "lib_screen" : "40 µL reaction containing 10 µL of bead purified digested product, 18μL of nuclease-free water, 8μL of 5X secondary PCR master mix, and 5 µL of 10 µM TruSeq i5/i7 barcode primers",
            "nucl_acid_amp" : "https://www.paragongenomics.com/targeted-sequencing/amplicon-sequencing/cleanplex-ngs-amplicon-sequencing/",
            "nucl_acid_date" : "2019-07-15",
            "nucl_acid_ext" : "https://www.paragongenomics.com/targeted-sequencing/amplicon-sequencing/cleanplex-ngs-amplicon-sequencing/",
            "pcr_cond" : "10 min at 95°C, 13 cycles for high density samples (or 15 cycles for low density samples) of 15 sec at 98°C and 75 sec at 60°C",
            "seq_center" : "UCSF",
            "seq_date" : "2019-07-15",
            "seq_instrument" : "NextSeq 550 instrument",
            "sequencing_info_id" : "run1"
        }
    }

Bioinformatics Info

Now we manually enter some information on the bioinformatics run. More information on the fields can be found here. Below is an example from the MAD4HatTeR pipeline.

Code

taramp_bioinformatics_infos = {
    bioinfo_id : 
    {
        "demultiplexing_method" : 
        {
            "program" : "Cutadapt extractorPairedEnd",
            "purpose" : "Takes raw paired-end reads and demultiplexes on primers and does QC filtering",
            "version" : "v4.4"
        },
        "denoising_method" : 
        {
            "program" : "DADA2",
            "purpose" : "Takes sequences per sample per target and clusters them",
            "version" : "v3.16"
        },
        "tar_amp_bioinformatics_info_id" : bioinfo_id
    }
}

Compose PMO

To create our final PMO we will put together all of the parts we have created

Code

# Put together the information we h
format_pmo = {
    "experiment_infos": experiment_info_json,  
    "sequencing_infos": sequencing_infos, 
    "specimen_infos": specimen_info_json, 
    "taramp_bioinformatics_infos": taramp_bioinformatics_infos, 
    **microhaplotype_info, 
    **panel_information_pmo,
    **demultiplexed_targets_pmo,
}

Finally we output this to a file

Code

# Write to a JSON file
output_file = "example_pmo.json"
with open(output_file, "w") as f:
    json.dump(format_pmo, f, indent=4)

That’s it! You have put together a PMO, congratulations.

Next time you should be able to reuse multiple parts with minor tweaks. See the rest of the documentation for ways that you can work with your new PMO file.