Getting basic info out of PMO using pmotools-python

Extract basic info counts from PMO

To get simple counts of number of targets with sample counts, samples with target counts, the counts of meta fields

Most of these basic info extractor can be found underneath extract_basic_info_from_pmo

Code

pmotools-runner.py

pmotools v1.0.0 - A suite of tools for interacting with Portable Microhaplotype Object (pmo) file format

Available functions organized by groups are
convertors_to_json
    text_meta_to_json_meta - Convert text file meta to JSON Meta
    excel_meta_to_json_meta - Convert excel file meta to JSON Meta
    microhaplotype_table_to_json_file - Convert microhaplotype table to JSON Meta
    terra_amp_output_to_json - Convert terra output table to JSON seq table

extractors_from_pmo
    extract_pmo_with_selected_meta - Extract from PMO samples and associated haplotypes with selected meta
    extract_pmo_with_select_specimen_ids - Extract from PMO specific samples from the specimens table
    extract_pmo_with_select_experiment_sample_ids - Extract from PMO specific experiment sample ids from the experiment_info table
    extract_pmo_with_select_targets - Extract from PMO specific targets
    extract_pmo_with_read_filter - Extract from PMO with a read filter
    extract_allele_table - Extract allele tables which can be as used as input to such tools as dcifer or moire

working_with_multiple_pmos
    combine_pmos - Combine multiple pmos of the same panel into a single pmo

extract_basic_info_from_pmo
    list_experiment_sample_ids_per_specimen_id - Each specimen_id can have multiple experiment_sample_ids, list out all in a PMO
    list_specimen_meta_fields - List out the specimen meta fields in the specimen_info section
    list_tar_amp_bioinformatics_info_ids - List out all the tar_amp_bioinformatics_info_ids in a PMO file
    count_specimen_meta - Count the values of specific specimen meta fields in the specimen_info section
    count_targets_per_sample - Count the number of targets per sample
    count_samples_per_target - Count the number of samples per target

extract_panel_info_from_pmo
    extract_insert_of_panels - Extract the insert of panels from a PMO
    extract_refseq_of_inserts_of_panels - Extract the ref_seq of panels from a PMO

Getting files for examples

Code

cd example 

wget https://plasmogenepi.github.io/PMO_Docs/format/moz2018_PMO.json.gz
wget https://plasmogenepi.github.io/PMO_Docs/format/PathWeaverHeome1_PMO.json.gz

list_specimen_meta_fields

This will list all the meta fields within the specimen_infos section of a PMO file. Since not all meta fields are always present in all specimens, this will list the count of samples each field appears in and the number of total specimens

Code

pmotools-runner.py list_specimen_meta_fields -h

usage: pmotools-runner.py list_specimen_meta_fields [-h] --file FILE
                                                    [--output OUTPUT]
                                                    [--delim DELIM]
                                                    [--overwrite]

options:
  -h, --help       show this help message and exit
  --file FILE      PMO file
  --output OUTPUT  output file
  --delim DELIM    the delimiter of the output text file, examples input
                   tab,comma but can also be the actual delimiter
  --overwrite      If output file exists, overwrite it

The python code for list_specimen_meta_fields script is below

Code

pmotools-python/scripts/extract_info_from_pmo/list_specimen_meta_fields.py

#!/usr/bin/env python3
import os, argparse, json
import sys
from collections import defaultdict

import pandas as pd

from pmotools.pmo_utils.PMOExtractor import PMOExtractor
from pmotools.pmo_utils.PMOReader import PMOReader
from pmotools.utils.small_utils import Utils


def parse_args_list_specimen_meta_fields():
    parser = argparse.ArgumentParser()
    parser.add_argument('--file', type=str, required=True, help='PMO file')
    parser.add_argument('--output', type=str, default="STDOUT", required=False, help='output file')
    parser.add_argument('--delim', default="tab", type=str, required=False, help='the delimiter of the output text file, examples input tab,comma but can also be the actual delimiter')
    parser.add_argument('--overwrite', action = 'store_true', help='If output file exists, overwrite it')

    return parser.parse_args()

def list_specimen_meta_fields():
    args = parse_args_list_specimen_meta_fields()

    # check files
    output_delim, output_extension = Utils.process_delimiter_and_output_extension(args.delim, gzip=args.output.endswith(".gz"))
    args.output = args.output if "STDOUT" == args.output else Utils.appendStrAsNeeded(args.output, output_extension)
    Utils.inputOutputFileCheck(args.file, args.output, args.overwrite)

    # read in PMO
    pmo = PMOReader.read_in_pmo(args.file)

    # count fields
    counts_df = PMOExtractor.count_specimen_meta_fields(pmo)

    # output
    counts_df.to_csv(sys.stdout if "STDOUT" == args.output else args.output, sep = output_delim, index=False)

if __name__ == "__main__":
    list_specimen_meta_fields()

Code

cd example 
pmotools-runner.py list_specimen_meta_fields --file ../../format/moz2018_PMO.json.gz

field   presentInSpecimensCount totalSpecimenCount
collection_country  124 124
collection_date 124 124
collector   124 124
geo_admin3  124 124
host_taxon_id   124 124
lat_lon 124 124
parasite_density    124 124
plate_col   81  124
plate_name  81  124
plate_row   81  124
project_name    124 124
samp_collect_device 124 124
samp_store_loc  124 124
samp_taxon_id   124 124
specimen_id 124 124

Code

cd example 
pmotools-runner.py list_specimen_meta_fields --file ../../format/moz2018_PMO.json.gz --output spec_fields_moz2018_PMO.tsv --overwrite

count_specimen_meta

This will list all the meta values (and the combinations) for the meta fields within the specimen_infos section of a PMO file.

Code

pmotools-runner.py count_specimen_meta -h

usage: pmotools-runner.py count_specimen_meta [-h] --file FILE
                                              [--output OUTPUT]
                                              [--delim DELIM] [--overwrite]
                                              --meta_fields META_FIELDS

options:
  -h, --help            show this help message and exit
  --file FILE           PMO file
  --output OUTPUT       output file
  --delim DELIM         the delimiter of the output text file, examples input
                        tab,comma but can also be the actual delimiter
  --overwrite           If output file exists, overwrite it
  --meta_fields META_FIELDS
                        the fields to count the subfields of, can supply
                        multiple separated by commas, e.g. --meta_fields
                        collection_country,collection_date

The python code for count_specimen_meta script is below

Code

pmotools-python/scripts/extract_info_from_pmo/count_specimen_meta.py

#!/usr/bin/env python3
import os, argparse, json
import sys
from collections import defaultdict

import pandas as pd

from pmotools.pmo_utils.PMOExtractor import PMOExtractor
from pmotools.pmo_utils.PMOReader import PMOReader
from pmotools.utils.small_utils import Utils


def parse_args_count_specimen_meta():
    parser = argparse.ArgumentParser()
    parser.add_argument('--file', type=str, required=True, help='PMO file')
    parser.add_argument('--output', type=str, default="STDOUT", required=False, help='output file')
    parser.add_argument('--delim', default="tab", type=str, required=False, help='the delimiter of the output text file, examples input tab,comma but can also be the actual delimiter')
    parser.add_argument('--overwrite', action='store_true', help='If output file exists, overwrite it')
    parser.add_argument('--meta_fields', type=str, required=True, help='the fields to count the subfields of, can supply multiple separated by commas, e.g. --meta_fields collection_country,collection_date')

    return parser.parse_args()


def count_specimen_meta():
    args = parse_args_count_specimen_meta()

    # check files
    output_delim, output_extension = Utils.process_delimiter_and_output_extension(args.delim, gzip=args.output.endswith(".gz"))
    args.output = args.output if "STDOUT" == args.output else Utils.appendStrAsNeeded(args.output, output_extension)
    Utils.inputOutputFileCheck(args.file, args.output, args.overwrite)

    # process the meta_fields argument
    meta_fields_toks = args.meta_fields.split(',')

    # read in PMO
    pmo = PMOReader.read_in_pmo(args.file)

    # count sub-fields
    counts_df = PMOExtractor.count_specimen_meta_subfields(pmo, meta_fields_toks)

    #write out
    counts_df.to_csv(sys.stdout if "STDOUT" == args.output else args.output, sep = output_delim, index=False)


if __name__ == "__main__":
    count_specimen_meta()

Code

cd example 
pmotools-runner.py count_specimen_meta --file ../../format/moz2018_PMO.json.gz --meta_fields collection_country

collection_country  specimensCount  specimensFreq   totalSpecimenCount
Mozambique  81  0.6532258064516129  124
NA  43  0.3467741935483871  124

Code

cd example 
pmotools-runner.py count_specimen_meta --file ../../format/moz2018_PMO.json.gz --meta_fields collection_country --overwrite --output collection_country_count_moz2018_PMO.tsv.gz

Code

cd example 
pmotools-runner.py count_specimen_meta --file ../../format/moz2018_PMO.json.gz --meta_fields collection_country,geo_admin3

collection_country  geo_admin3  specimensCount  specimensFreq   totalSpecimenCount
Mozambique  Inhassoro   27  0.21774193548387097 124
Mozambique  Mandlakazi  28  0.22580645161290322 124
Mozambique  Namaacha    26  0.20967741935483872 124
NA  NA  43  0.3467741935483871  124

Code

cd example 
pmotools-runner.py count_specimen_meta --file ../../format/PathWeaverHeome1_PMO.json.gz --meta_fields collection_country,collection_date | head

collection_country  collection_date specimensCount  specimensFreq   totalSpecimenCount
Bangladesh  2008    15  0.0007718828796377296   19433
Bangladesh  2009    16  0.000823341738280245    19433
Bangladesh  2012    51  0.002624401790768281    19433
Bangladesh  2015    508 0.026141100190397778    19433
Bangladesh  2016    816 0.041990428652292494    19433
Bangladesh  2017    12  0.0006175063037101837   19433
Benin   2014    41  0.002109813204343128    19433
Benin   2016    117 0.006020686461174291    19433
Brazil  1980    1   5.145885864251531e-05   19433

list_tar_amp_bioinformatics_info_ids

This will simply list out all the analyses (all the tar_amp_bioinformatics_info_ids) stored within a PMO

Code

pmotools-runner.py list_tar_amp_bioinformatics_info_ids -h

usage: pmotools-runner.py list_tar_amp_bioinformatics_info_ids
       [-h] --file FILE [--output OUTPUT] [--overwrite]

options:
  -h, --help       show this help message and exit
  --file FILE      PMO file
  --output OUTPUT  output file
  --overwrite      If output file exists, overwrite it

The python code for list_tar_amp_bioinformatics_info_ids script is below

Code

pmotools-python/scripts/extract_info_from_pmo/list_tar_amp_bioinformatics_info_ids.py

#!/usr/bin/env python3
import os, argparse, json
import sys
from collections import defaultdict

import pandas as pd

from pmotools.pmo_utils.PMOExtractor import PMOExtractor
from pmotools.pmo_utils.PMOReader import PMOReader
from pmotools.utils.small_utils import Utils


def parse_args_list_tar_amp_bioinformatics_info_ids():
    parser = argparse.ArgumentParser()
    parser.add_argument('--file', type=str, required=True, help='PMO file')
    parser.add_argument('--output', type=str, default="STDOUT", required=False, help='output file')
    parser.add_argument('--overwrite', action = 'store_true', help='If output file exists, overwrite it')

    return parser.parse_args()

def list_tar_amp_bioinformatics_info_ids():
    args = parse_args_list_tar_amp_bioinformatics_info_ids()

    # check files
    Utils.inputOutputFileCheck(args.file, args.output, args.overwrite)

    # read in PMO
    pmo = PMOReader.read_in_pmo(args.file)

    # extract all taramp_bioinformatics_ids
    bioids = pmo["taramp_bioinformatics_infos"].keys()

    # write
    output_target = sys.stdout if args.output == "STDOUT" else open(args.output, "w")
    with output_target as f:
        f.write("\n".join(bioids) + "\n")




if __name__ == "__main__":
    list_tar_amp_bioinformatics_info_ids()

Code

cd example 
pmotools-runner.py list_tar_amp_bioinformatics_info_ids --file ../../format/moz2018_PMO.json.gz

Mozambique2018-SeekDeep

Code

cd example 
pmotools-runner.py list_tar_amp_bioinformatics_info_ids --file ../../format/PathWeaverHeome1_PMO.json.gz

PathWeaverHeome1

This can be helpful after combining PMOs

Code

cd example 

pmotools-runner.py combine_pmos --pmo_files ../../format/moz2018_PMO.json.gz,../../format/PathWeaverHeome1_PMO.json.gz --output combined_Heome1_PMO.json.gz --overwrite

pmotools-runner.py list_tar_amp_bioinformatics_info_ids --file combined_Heome1_PMO.json.gz

Mozambique2018-SeekDeep
PathWeaverHeome1

count_targets_per_sample

Count up the number targets each experimental_sample_id has. A read filter can be applied to see how targets would be kept if such a filter was applied

Code

pmotools-runner.py count_targets_per_sample -h

usage: pmotools-runner.py count_targets_per_sample [-h] --file FILE
                                                   [--output OUTPUT]
                                                   [--delim DELIM]
                                                   [--overwrite]
                                                   [--read_count_minimum READ_COUNT_MINIMUM]

options:
  -h, --help            show this help message and exit
  --file FILE           PMO file
  --output OUTPUT       output file
  --delim DELIM         the delimiter of the output text file, examples input
                        tab,comma but can also be the actual delimiter
  --overwrite           If output file exists, overwrite it
  --read_count_minimum READ_COUNT_MINIMUM
                        the minimum read count (inclusive) to be counted as
                        covered by sample

The python code for count_targets_per_sample script is below

Code

pmotools-python/scripts/extract_info_from_pmo/count_targets_per_sample.py

#!/usr/bin/env python3
import os, argparse, json
import sys
from collections import defaultdict

import pandas as pd

from pmotools.pmo_utils.PMOExtractor import PMOExtractor
from pmotools.pmo_utils.PMOReader import PMOReader
from pmotools.utils.small_utils import Utils


def parse_args_count_targets_per_sample():
    parser = argparse.ArgumentParser()
    parser.add_argument('--file', type=str, required=True, help='PMO file')
    parser.add_argument('--output', type=str, default="STDOUT", required=False, help='output file')
    parser.add_argument('--delim', default="tab", type=str, required=False, help='the delimiter of the output text file, examples input tab,comma but can also be the actual delimiter')
    parser.add_argument('--overwrite', action='store_true', help='If output file exists, overwrite it')
    parser.add_argument('--read_count_minimum', default=0.0, type=float, required=False, help='the minimum read count (inclusive) to be counted as covered by sample')

    return parser.parse_args()


def count_targets_per_sample():
    args = parse_args_count_targets_per_sample()

    # check files
    output_delim, output_extension = Utils.process_delimiter_and_output_extension(args.delim, gzip=args.output.endswith(".gz"))
    args.output = args.output if "STDOUT" == args.output else Utils.appendStrAsNeeded(args.output, output_extension)
    Utils.inputOutputFileCheck(args.file, args.output, args.overwrite)

    # read in PMO
    pmo = PMOReader.read_in_pmo(args.file)

    # count
    counts_df = PMOExtractor.count_targets_per_sample(pmo, args.read_count_minimum)

    #write out
    counts_df.to_csv(sys.stdout if "STDOUT" == args.output else args.output, sep = output_delim, index=False)


if __name__ == "__main__":
    count_targets_per_sample()

Code

cd example 

pmotools-runner.py count_targets_per_sample --file ../../format/moz2018_PMO.json.gz  | head

tar_amp_bioinformatics_info_id  experiment_sample_id    target_number
Mozambique2018-SeekDeep 8025874217  99
Mozambique2018-SeekDeep 8025874231  99
Mozambique2018-SeekDeep 8025874234  97
Mozambique2018-SeekDeep 8025874237  99
Mozambique2018-SeekDeep 8025874250  98
Mozambique2018-SeekDeep 8025874253  99
Mozambique2018-SeekDeep 8025874261  99
Mozambique2018-SeekDeep 8025874266  85
Mozambique2018-SeekDeep 8025874271  99

Apply a read count minimum filter (this a total read count summed for a target and not on a haplotype level)

Code

cd example 

pmotools-runner.py count_targets_per_sample --read_count_minimum 3000 --file ../../format/moz2018_PMO.json.gz  | head

tar_amp_bioinformatics_info_id  experiment_sample_id    target_number
Mozambique2018-SeekDeep 8025874217  99
Mozambique2018-SeekDeep 8025874231  73
Mozambique2018-SeekDeep 8025874234  93
Mozambique2018-SeekDeep 8025874237  98
Mozambique2018-SeekDeep 8025874250  68
Mozambique2018-SeekDeep 8025874253  99
Mozambique2018-SeekDeep 8025874261  98
Mozambique2018-SeekDeep 8025874266  37
Mozambique2018-SeekDeep 8025874271  98

count_samples_per_target

Count up the number of experimental_sample_ids each target has. A read filter can be applied to see how many samples a taget would have if a filter was applied

Code

pmotools-runner.py count_samples_per_target -h

usage: pmotools-runner.py count_samples_per_target [-h] --file FILE
                                                   [--output OUTPUT]
                                                   [--delim DELIM]
                                                   [--overwrite]
                                                   [--read_count_minimum READ_COUNT_MINIMUM]

options:
  -h, --help            show this help message and exit
  --file FILE           PMO file
  --output OUTPUT       output file
  --delim DELIM         the delimiter of the output text file, examples input
                        tab,comma but can also be the actual delimiter
  --overwrite           If output file exists, overwrite it
  --read_count_minimum READ_COUNT_MINIMUM
                        the minimum read count (inclusive) to be counted as
                        covered by sample

The python code for count_samples_per_target script is below

Code

pmotools-python/scripts/extract_info_from_pmo/count_samples_per_target.py

#!/usr/bin/env python3
import os, argparse, json
import sys
from collections import defaultdict

import pandas as pd

from pmotools.pmo_utils.PMOExtractor import PMOExtractor
from pmotools.pmo_utils.PMOReader import PMOReader
from pmotools.utils.small_utils import Utils


def parse_args_count_samples_per_target():
    parser = argparse.ArgumentParser()
    parser.add_argument('--file', type=str, required=True, help='PMO file')
    parser.add_argument('--output', type=str, default="STDOUT", required=False, help='output file')
    parser.add_argument('--delim', default="tab", type=str, required=False, help='the delimiter of the output text file, examples input tab,comma but can also be the actual delimiter')
    parser.add_argument('--overwrite', action='store_true', help='If output file exists, overwrite it')
    parser.add_argument('--read_count_minimum', default=0.0, type=float, required=False, help='the minimum read count (inclusive) to be counted as covered by sample')

    return parser.parse_args()


def count_samples_per_target():
    args = parse_args_count_samples_per_target()

    # check files
    output_delim, output_extension = Utils.process_delimiter_and_output_extension(args.delim, gzip=args.output.endswith(".gz"))
    args.output = args.output if "STDOUT" == args.output else Utils.appendStrAsNeeded(args.output, output_extension)
    Utils.inputOutputFileCheck(args.file, args.output, args.overwrite)

    # read in PMO
    pmo = PMOReader.read_in_pmo(args.file)

    # count
    counts_df = PMOExtractor.count_samples_per_target(pmo, args.read_count_minimum)

    #write out
    counts_df.to_csv(sys.stdout if "STDOUT" == args.output else args.output, sep = output_delim, index=False)


if __name__ == "__main__":
    count_samples_per_target()

Code

cd example 

pmotools-runner.py count_samples_per_target --file ../../format/moz2018_PMO.json.gz  | head

tar_amp_bioinformatics_info_id  target_id   sample_number
Mozambique2018-SeekDeep t1  119
Mozambique2018-SeekDeep t10 117
Mozambique2018-SeekDeep t100    124
Mozambique2018-SeekDeep t11 120
Mozambique2018-SeekDeep t12 119
Mozambique2018-SeekDeep t13 124
Mozambique2018-SeekDeep t14 118
Mozambique2018-SeekDeep t15 119
Mozambique2018-SeekDeep t16 121

Apply a read count minimum filter (this a total read count summed for a target and not on a haplotype level)

Code

cd example 

pmotools-runner.py count_samples_per_target --read_count_minimum 3000 --file ../../format/moz2018_PMO.json.gz  | head

tar_amp_bioinformatics_info_id  target_id   sample_number
Mozambique2018-SeekDeep t1  108
Mozambique2018-SeekDeep t10 107
Mozambique2018-SeekDeep t100    107
Mozambique2018-SeekDeep t11 111
Mozambique2018-SeekDeep t12 104
Mozambique2018-SeekDeep t13 105
Mozambique2018-SeekDeep t14 110
Mozambique2018-SeekDeep t15 110
Mozambique2018-SeekDeep t16 106