First we will work on putting the panel information into PMO format. Although labs may store this information in a variety of ways and this process may seem cumbersome, you will only have to do this once for each panel that you work with.
The panel information consists of 2 parts; The panel_targets (information on the targets) and the target_genome (information on the reference genome being targeted).
To include details of the reference genome we need the following information.
name : name of the genome
version : the genome version
taxon_id : the NCBI taxonomy number
url : a link to the where this genome file could be downloaded
Optionally, you can also include a link to genomes annotation file, as we have below. Below is an example of compiling this information into the json format manually:
Fields that are required to define the target information are …
target_id : a unique identifier for each of the targets
forward primer seq : The sequence for the forward primer associated with this target
reverse primer sequence : The sequence for the reverse primer associated with this target
Note: in the case that you have multiple primers to target the same region, please include these on separate lines in the table with the same target_id.
Optionally you can also include location information for the primers. To include this information you will need to include in the table:
chrom : the chromosome name
start : the start of the location, 0-based positioning
end : the end of the location, 0-based positioning
For more information on optional fields that can be included, check the documentation.
Here we show how to take panel information that is used to run the MAD4HatTeR pipeline and convert it to PMO.
We can use the panel_info_table_to_pmo_dict function to convert this into the correct format for PMO.
Code
print(panel_info_table_to_pmo_dict.__doc__)
Convert a dataframe containing panel information into dictionary of targets and reference information
:param target_table: The dataframe containing the target information
:param panel_id: the panel ID assigned to the panel
:param genome_info: A dictionary containing the genome information
:param target_id_col: the name of the column containing the target IDs
:param forward_primers_seq_col: the name of the column containing the sequence of the forward primer
:param reverse_primers_seq_col: the name of the column containing the sequence of the reverse primer
:param forward_primers_start_col (Optional): the name of the column containing the 0-based start coordinate of the forward primer
:param forward_primers_end_col (Optional): the name of the column containing the 0-based end coordinate of the forward primer
:param reverse_primers_start_col (Optional): the name of the column containing the 0-based start coordinate of the reverse primer
:param reverse_primers_end_col (Optional): the name of the column containing the 0-based end coordinate of the reverse primer
:param insert_start_col (Optional): the name of the column containing the 0-based start coordinate of the insert
:param insert_end_col (Optional): the name of the column containing the 0-based end coordinate of the insert
:param chrom_col (Optional): the name of the column containing the chromosome for the target
:param gene_id_col (Optional): the name of the column containing the gene id
:param strand_col (Optional): the name of the column containing the strand for the target
:param target_type_col (Optional): A classification type for the target
:param additional_target_info_cols (Optional): dictionary of optional additional columns to add to the target information dictionary. Keys are column names and values are the type.
:return: a dict of the panel information
We will use this first just to include the most basic required information.
Optionally we can include some more information. You can see that some of the fields in the table above don’t directly match the optional fields. Therefore, we must first wrangle the data slightly to fit the requirements.
Note: You may not have to apply all of the following steps to your panel information, this is just an example and is specific to the MAD4HatTeR panel information.
The chromosome for each target is stored within the locus name, so we extract that and put it in it’s own column below.
Next we need to generate 0-based coordinates of the location the primers are targetting. The panel information we have only includes the full target start and end (including the primer) and is 1-based, so we do the conversion as follows.
Code
# Create 0-based coordinate for start of forward primermadhatter_panel_info['fwd_primer_start_0_based'] = madhatter_panel_info.amplicon_start-1# Calculate the length of the forward primer, add this to the primer start coordinate to get the end coordinate madhatter_panel_info['fwd_primer_len'] = [len(p) for p in madhatter_panel_info.fwd_primer]madhatter_panel_info['fwd_primer_end_0_based'] = madhatter_panel_info.fwd_primer_start_0_based+madhatter_panel_info.fwd_primer_len# Calculate the length of the reverse primer. Subtract this from the end coordinate of the target to get the start coordinate of the reverse primermadhatter_panel_info['rev_primer_len'] = [len(p) for p in madhatter_panel_info.rev_primer]madhatter_panel_info['rev_primer_start_0_based'] = madhatter_panel_info.amplicon_end-madhatter_panel_info.rev_primer_len# The 0-based reverse primer end would be the same as the amplicon end madhatter_panel_info['rev_primer_end_0_based'] = madhatter_panel_info.amplicon_end
In the MAD4HatTeR pipeline, we trim one base from each end of the amplicon insert because the base following a primer is often erroneous. We can create insert coordinates with this adjustment, as shown below. If you choose not to apply this trimming step, you can instead use the coordinate at the end of the forward primer and the beginning of the reverse primer to define the start and end of the insert.
You can also add on your own custom fields using the additional_target_info_cols parameter. Below we add on the amplicon insert length information. If there is a field that you want to add and think others would find useful please contact us and we can add it in. This way we can make sure to keep ontologies consistent!
Now we put together the specimen level metadata. This is the metadata associated with the sample collected from the host. For more information on this section see the documentation.
Code
print(specimen_info_table_to_json.__doc__)
Converts a DataFrame containing specimen information into JSON.
:param contents (pd.DataFrame): The input DataFrame containing experiment data.
:param specimen_id_col (str): The column name for specimen sample IDs. Default: specimen_id
:param samp_taxon_id (int): NCBI taxonomy number of the organism. Default: samp_taxon_id
:param collection_date (string): Date of the sample collection. Default: collection_date
:param collection_country (string): Name of country collected in (admin level 0). Default : collection_country
:param collector (string): Name of the primary person managing the specimen. Default: collector
:param samp_store_loc (string): Sample storage site. Default: samp_store_loc
:param samp_collect_device (string): The way the sample was collected. Default : samp_collect_device
:param project_name (string): Name of the project. Default : project_name
:param alternate_identifiers (Optional[str]): List of optional alternative names for the samples
:param geo_admin1 (Optional[str]): Geographical admin level 1
:param geo_admin2 (Optional[str]): Geographical admin level 2
:param geo_admin3 (Optional[str]): Geographical admin level 3
:param host_taxon_id (Optional[int]): NCBI taxonomy number of the host
:param individual_id (Optional[str]): ID for the individual a specimen was collected from
:param lat_lon (Optional[str]): Latitude and longitude of the collection site
:param parasite_density (Optional[float]): The parasite density
:param plate_col (Optional[int]): Column the specimen was in in the plate
:param plate_name (Optional[str]): Name of plate the specimen was in
:param plate_row (Optional[str]): Row the specimen was in in the plate
:param sample_comments (Optional[str]): Additional comments about the sample
:param additional_specimen_cols (Optional[List[str], None]]): Additional column names to include
:return: JSON format where keys are `specimen_id` and values are corresponding row data.
This section shows how to put together the experiment level metadata. More information on this table can be found [here](pd.read_excel(‘example_data/mad4hatter_experiment_info_table_example.xlsx’).
Code
print(experiment_info_table_to_json.__doc__)
Converts a DataFrame containing experiment information into JSON.
:param contents (pd.DataFrame): Input DataFrame containing experiment data.
:param experiment_sample_id_col (str): Column name for experiment sample IDs. Default: experiment_sample_id
:param sequencing_info_id (str): Column name for sequencing information IDs. Default: sequencing_info_id
:param specimen_id (str): Column name for specimen IDs. Default: specimen_id
:param panel_id (str): Column name for panel IDs. Default: panel_id
:param accession (Optional[str]): Column name for accession information.
:param plate_col (Optional[int]): Column index for plate information.
:param plate_name (Optional[str]): Column name for plate names.
:param plate_row (Optional[str]): Column name for plate rows.
:param additional_experiment_cols (Optional[List[str], None]]): Additional column names to include.
:return: JSON format where keys are `experiment_sample_id` and values are corresponding row data.
Next, we’ll organize the microhaplotype information into the required format.
This involves two components that we will generate from one table(click on the links to find out more information about each part):
The representative microhaplotype details: A summary of all of unique microhaplotypes called within the population you have included in your PMO for each target. Each unique microhaplotype will be assigned a short ID within PMO to improve the scalability of the format.
The detected microhaplotypes: Microhaplotypes called for each sample for each target and the associated reads. This will be linked to the above table using the generated microhaplotype ID instead of the full microhaplotype sequence.
First we will load an example allele table that may be similar to something you have from your own microhaplotype pipeline. This table includes a sampleID, the target, and the ASV and number of reads detected for each of these.
Let’s have a look at the function we will use to create this part of PMO microhaplotype_table_to_pmo_dict
Code
print(microhaplotype_table_to_pmo_dict.__doc__)
Convert a dataframe of a microhaplotype calls into a dictionary containing a dictionary for the haplotypes_detected and a dictionary for the representative_haplotype_sequences.
:param contents: The dataframe containing microhaplotype calls
:param bioinfo_id: the bioinformatics ID of the microhaplotype table
:param sampleID_col: the name of the column containing the sample IDs
:param locus_col: the name of the column containing the locus IDs
:param mhap_col: the name of the column containing the microhaplotype sequence
:param reads_col: the name of the column containing the reads counts
:param additional_hap_detected_cols: optional additional columns to add to the microhaplotype detected dictionary, the key is the pandas column and the value is what to name it in the output
:return: a dict of both the haplotypes_detected and representative_haplotype_sequences
We can see that we need a dataframe with columns for sample IDs, locus names, microhaplotype sequences, and their corresponding read counts. We also need to supply a unique bioinformatics ID. Including this ID allows us to store results from multiple bioinformatics pipelines run on the same sequencing data in a unified format if necessary.
Here we set a bioinformatics ID, so we can use the same one when generating other tables later on.
Code
bioinfo_id ="Mozambique2018-MAD4HatTeR"
The function has default column names that align with the standard output from DADA2. However, since we’re using MAD4HatTeR data, which has slightly different column headers, we’ll need to specify these column names explicitly in the function.
We also include information on the demultiplexed reads for each Sample for each target using a function called demultiplexed_targets_to_pmo_dict.
Code
print(demultiplexed_targets_to_pmo_dict.__doc__)
Convert a dataframe of microhaplotype calls into a dictionary for detected haplotypes
and representative haplotype sequences.
:param contents: DataFrame containing demultiplexed sample information
:param bioinfo_id: Bioinformatics ID of the demultiplexed targets
:param sampleID_col: Name of the column containing sample IDs
:param target_id_col: Name of the column containing locus IDs
:param read_count_col: Name of the column containing read counts
:param additional_hap_detected_cols: Optional columns to include in the output,
with keys as column names and values as their output names
:return: JSON string containing the processed data
PMO includes details of the sequencing run. Below is an example and more information on the required fields can be found here
Code
sequencing_infos ={"Mozambique2018" : {"lib_kit" : "TruSeq i5/i7 barcode primers","lib_layout" : "paired-end","lib_screen" : "40 µL reaction containing 10 µL of bead purified digested product, 18μL of nuclease-free water, 8μL of 5X secondary PCR master mix, and 5 µL of 10 µM TruSeq i5/i7 barcode primers","nucl_acid_amp" : "https://www.paragongenomics.com/targeted-sequencing/amplicon-sequencing/cleanplex-ngs-amplicon-sequencing/","nucl_acid_date" : "2019-07-15","nucl_acid_ext" : "https://www.paragongenomics.com/targeted-sequencing/amplicon-sequencing/cleanplex-ngs-amplicon-sequencing/","pcr_cond" : "10 min at 95°C, 13 cycles for high density samples (or 15 cycles for low density samples) of 15 sec at 98°C and 75 sec at 60°C","seq_center" : "UCSF","seq_date" : "2019-07-15","seq_instrument" : "NextSeq 550 instrument","sequencing_info_id" : "run1" } }
Bioinformatics Info
Now we manually enter some information on the bioinformatics run. More information on the fields can be found here. Below is an example from the MAD4HatTeR pipeline.
Code
taramp_bioinformatics_infos = { bioinfo_id : {"demultiplexing_method" : {"program" : "Cutadapt extractorPairedEnd","purpose" : "Takes raw paired-end reads and demultiplexes on primers and does QC filtering","version" : "v4.4" },"denoising_method" : {"program" : "DADA2","purpose" : "Takes sequences per sample per target and clusters them","version" : "v3.16" },"tar_amp_bioinformatics_info_id" : bioinfo_id }}
Compose PMO
To create our final PMO we will put together all of the parts we have created
Code
# Put together the information we hformat_pmo = {"experiment_infos": experiment_info_json, "sequencing_infos": sequencing_infos, "specimen_infos": specimen_info_json, "taramp_bioinformatics_infos": taramp_bioinformatics_infos, **microhaplotype_info, **panel_information_pmo,**demultiplexed_targets_pmo,}
Finally we output this to a file
Code
# Write to a JSON fileoutput_file ="example_pmo.json"withopen(output_file, "w") as f: json.dump(format_pmo, f, indent=4)
That’s it! You have put together a PMO, congratulations.
Next time you should be able to reuse multiple parts with minor tweaks. See the rest of the documentation for ways that you can work with your new PMO file.