Code
pmotools-runner.py
To get simple counts of number of targets with sample counts, samples with target counts, the counts of meta fields
Most of these basic info extractor can be found underneath extract_basic_info_from_pmo
pmotools v1.0.0 - A suite of tools for interacting with Portable Microhaplotype Object (pmo) file format
Available functions are
convertors_to_json
text_meta_to_json_meta - Convert text file meta to JSON Meta
excel_meta_to_json_meta - Convert excel file meta to JSON Meta
microhaplotype_table_to_json_file - Convert microhaplotype table to JSON Meta
terra_amp_output_to_json - Convert terra output table to JSON seq table
extractors_from_pmo
extract_pmo_with_selected_meta - Extract from PMO samples and associated haplotypes with selected meta
extract_pmo_with_select_specimen_ids - Extract from PMO specific samples from the specimens table
extract_pmo_with_select_experiment_sample_ids - Extract from PMO specific experiment sample ids from the experiment_info table
extract_pmo_with_select_targets - Extract from PMO specific targets
extract_allele_table - Extract allele tables which can be as used as input to such tools as dcifer or moire
working_with_multiple_pmos
combine_pmos - Combine multiple pmos of the same panel into a single pmo
extract_basic_info_from_pmo
list_experiment_sample_ids_per_specimen_id - Each specimen_id can have multiple experiment_sample_ids, list out all in a PMO
list_specimen_meta_fields - List out the specimen meta fields in the specimen_info section
list_tar_amp_bioinformatics_info_ids - List out all the tar_amp_bioinformatics_info_ids in a PMO file
count_specimen_meta - Count the values of specific specimen meta fields in the specimen_info section
count_targets_per_sample - Count the number of targets per sample
count_samples_per_target - Count the number of samples per target
Getting files for examples
This will list all the meta fields within the specimen_infos
section of a PMO file. Since not all meta fields are always present in all specimens, this will list the count of samples each field appears in and the number of total specimens
usage: pmotools-runner.py list_specimen_meta_fields [-h] --file FILE
[--output OUTPUT]
[--delim DELIM]
[--overwrite]
options:
-h, --help show this help message and exit
--file FILE PMO file
--output OUTPUT output file
--delim DELIM the delimiter of the output text file, examples input
tab,comma but can also be the actual delimiter
--overwrite If output file exists, overwrite it
The python code for list_specimen_meta_fields
script is below
pmotools-python/scripts/extract_info_from_pmo/list_specimen_meta_fields.py
#!/usr/bin/env python3
import os, argparse, json
import sys
from collections import defaultdict
import pandas as pd
from pmotools.pmo_utils.PMOExtractor import PMOExtractor
from pmotools.pmo_utils.PMOReader import PMOReader
from pmotools.utils.small_utils import Utils
def parse_args_list_specimen_meta_fields():
parser = argparse.ArgumentParser()
parser.add_argument('--file', type=str, required=True, help='PMO file')
parser.add_argument('--output', type=str, default="STDOUT", required=False, help='output file')
parser.add_argument('--delim', default="tab", type=str, required=False, help='the delimiter of the output text file, examples input tab,comma but can also be the actual delimiter')
parser.add_argument('--overwrite', action = 'store_true', help='If output file exists, overwrite it')
return parser.parse_args()
def list_specimen_meta_fields():
args = parse_args_list_specimen_meta_fields()
# check files
output_delim, output_extension = Utils.process_delimiter_and_output_extension(args.delim, gzip=args.output.endswith(".gz"))
args.output = args.output if "STDOUT" == args.output else Utils.appendStrAsNeeded(args.output, output_extension)
Utils.inputOutputFileCheck(args.file, args.output, args.overwrite)
# read in PMO
pmo = PMOReader.read_in_pmo(args.file)
# count fields
counts_df = PMOExtractor.count_specimen_meta_fields(pmo)
# output
counts_df.to_csv(sys.stdout if "STDOUT" == args.output else args.output, sep = output_delim, index=False)
if __name__ == "__main__":
list_specimen_meta_fields()
field presentInSpecimensCount totalSpecimenCount
collection_country 124 124
collection_date 124 124
collector 124 124
geo_admin3 124 124
host_taxon_id 124 124
lat_lon 124 124
parasite_density 124 124
plate_col 81 124
plate_name 81 124
plate_row 81 124
project_name 124 124
samp_collect_device 124 124
samp_store_loc 124 124
samp_taxon_id 124 124
specimen_id 124 124
This will list all the meta values (and the combinations) for the meta fields within the specimen_infos
section of a PMO file.
usage: pmotools-runner.py count_specimen_meta [-h] --file FILE
[--output OUTPUT]
[--delim DELIM] [--overwrite]
--meta_fields META_FIELDS
options:
-h, --help show this help message and exit
--file FILE PMO file
--output OUTPUT output file
--delim DELIM the delimiter of the output text file, examples input
tab,comma but can also be the actual delimiter
--overwrite If output file exists, overwrite it
--meta_fields META_FIELDS
the fields to count the subfields of, can supply
multiple separated by commas, e.g. --meta_fields
collection_country,collection_date
The python code for count_specimen_meta
script is below
pmotools-python/scripts/extract_info_from_pmo/count_specimen_meta.py
#!/usr/bin/env python3
import os, argparse, json
import sys
from collections import defaultdict
import pandas as pd
from pmotools.pmo_utils.PMOExtractor import PMOExtractor
from pmotools.pmo_utils.PMOReader import PMOReader
from pmotools.utils.small_utils import Utils
def parse_args_count_specimen_meta():
parser = argparse.ArgumentParser()
parser.add_argument('--file', type=str, required=True, help='PMO file')
parser.add_argument('--output', type=str, default="STDOUT", required=False, help='output file')
parser.add_argument('--delim', default="tab", type=str, required=False, help='the delimiter of the output text file, examples input tab,comma but can also be the actual delimiter')
parser.add_argument('--overwrite', action='store_true', help='If output file exists, overwrite it')
parser.add_argument('--meta_fields', type=str, required=True, help='the fields to count the subfields of, can supply multiple separated by commas, e.g. --meta_fields collection_country,collection_date')
return parser.parse_args()
def count_specimen_meta():
args = parse_args_count_specimen_meta()
# check files
output_delim, output_extension = Utils.process_delimiter_and_output_extension(args.delim, gzip=args.output.endswith(".gz"))
args.output = args.output if "STDOUT" == args.output else Utils.appendStrAsNeeded(args.output, output_extension)
Utils.inputOutputFileCheck(args.file, args.output, args.overwrite)
# process the meta_fields argument
meta_fields_toks = args.meta_fields.split(',')
# read in PMO
pmo = PMOReader.read_in_pmo(args.file)
# count sub-fields
counts_df = PMOExtractor.count_specimen_meta_subfields(pmo, meta_fields_toks)
#write out
counts_df.to_csv(sys.stdout if "STDOUT" == args.output else args.output, sep = output_delim, index=False)
if __name__ == "__main__":
count_specimen_meta()
collection_country specimensCount specimensFreq totalSpecimenCount
Mozambique 81 0.6532258064516129 124
NA 43 0.3467741935483871 124
collection_country geo_admin3 specimensCount specimensFreq totalSpecimenCount
Mozambique Inhassoro 27 0.21774193548387097 124
Mozambique Mandlakazi 28 0.22580645161290322 124
Mozambique Namaacha 26 0.20967741935483872 124
NA NA 43 0.3467741935483871 124
collection_country collection_date specimensCount specimensFreq totalSpecimenCount
Bangladesh 2008 15 0.0007718828796377296 19433
Bangladesh 2009 16 0.000823341738280245 19433
Bangladesh 2012 51 0.002624401790768281 19433
Bangladesh 2015 508 0.026141100190397778 19433
Bangladesh 2016 816 0.041990428652292494 19433
Bangladesh 2017 12 0.0006175063037101837 19433
Benin 2014 41 0.002109813204343128 19433
Benin 2016 117 0.006020686461174291 19433
Brazil 1980 1 5.145885864251531e-05 19433
This will simply list out all the analyses (all the tar_amp_bioinformatics_info_id
s) stored within a PMO
usage: pmotools-runner.py list_tar_amp_bioinformatics_info_ids
[-h] --file FILE [--output OUTPUT] [--overwrite]
options:
-h, --help show this help message and exit
--file FILE PMO file
--output OUTPUT output file
--overwrite If output file exists, overwrite it
The python code for list_tar_amp_bioinformatics_info_ids
script is below
pmotools-python/scripts/extract_info_from_pmo/list_tar_amp_bioinformatics_info_ids.py
#!/usr/bin/env python3
import os, argparse, json
import sys
from collections import defaultdict
import pandas as pd
from pmotools.pmo_utils.PMOExtractor import PMOExtractor
from pmotools.pmo_utils.PMOReader import PMOReader
from pmotools.utils.small_utils import Utils
def parse_args_list_tar_amp_bioinformatics_info_ids():
parser = argparse.ArgumentParser()
parser.add_argument('--file', type=str, required=True, help='PMO file')
parser.add_argument('--output', type=str, default="STDOUT", required=False, help='output file')
parser.add_argument('--overwrite', action = 'store_true', help='If output file exists, overwrite it')
return parser.parse_args()
def list_tar_amp_bioinformatics_info_ids():
args = parse_args_list_tar_amp_bioinformatics_info_ids()
# check files
Utils.inputOutputFileCheck(args.file, args.output, args.overwrite)
# read in PMO
pmo = PMOReader.read_in_pmo(args.file)
# extract all taramp_bioinformatics_ids
bioids = pmo["taramp_bioinformatics_infos"].keys()
# write
if "STDOUT" == args.output:
sys.stdout.write("\n".join(bioids) + "\n")
else:
with open(args.output, "w") as f:
f.write("\n".join(bioids) + "\n")
if __name__ == "__main__":
list_tar_amp_bioinformatics_info_ids()
Mozambique2018-SeekDeep
PathWeaverHeome1
This can be helpful after combining PMOs
Mozambique2018-SeekDeep
PathWeaverHeome1
Count up the number targets each experimental_sample_id has. A read filter can be applied to see how targets would be kept if such a filter was applied
usage: pmotools-runner.py count_targets_per_sample [-h] --file FILE
[--output OUTPUT]
[--delim DELIM]
[--overwrite]
[--read_count_minimum READ_COUNT_MINIMUM]
options:
-h, --help show this help message and exit
--file FILE PMO file
--output OUTPUT output file
--delim DELIM the delimiter of the output text file, examples input
tab,comma but can also be the actual delimiter
--overwrite If output file exists, overwrite it
--read_count_minimum READ_COUNT_MINIMUM
the minimum read count (inclusive) to be counted as
covered by sample
The python code for count_targets_per_sample
script is below
pmotools-python/scripts/extract_info_from_pmo/count_targets_per_sample.py
#!/usr/bin/env python3
import os, argparse, json
import sys
from collections import defaultdict
import pandas as pd
from pmotools.pmo_utils.PMOExtractor import PMOExtractor
from pmotools.pmo_utils.PMOReader import PMOReader
from pmotools.utils.small_utils import Utils
def parse_args_count_targets_per_sample():
parser = argparse.ArgumentParser()
parser.add_argument('--file', type=str, required=True, help='PMO file')
parser.add_argument('--output', type=str, default="STDOUT", required=False, help='output file')
parser.add_argument('--delim', default="tab", type=str, required=False, help='the delimiter of the output text file, examples input tab,comma but can also be the actual delimiter')
parser.add_argument('--overwrite', action='store_true', help='If output file exists, overwrite it')
parser.add_argument('--read_count_minimum', default=0.0, type=float, required=False, help='the minimum read count (inclusive) to be counted as covered by sample')
return parser.parse_args()
def count_targets_per_sample():
args = parse_args_count_targets_per_sample()
# check files
output_delim, output_extension = Utils.process_delimiter_and_output_extension(args.delim, gzip=args.output.endswith(".gz"))
args.output = args.output if "STDOUT" == args.output else Utils.appendStrAsNeeded(args.output, output_extension)
Utils.inputOutputFileCheck(args.file, args.output, args.overwrite)
# read in PMO
pmo = PMOReader.read_in_pmo(args.file)
# count
counts_df = PMOExtractor.count_targets_per_sample(pmo, args.read_count_minimum)
#write out
counts_df.to_csv(sys.stdout if "STDOUT" == args.output else args.output, sep = output_delim, index=False)
if __name__ == "__main__":
count_targets_per_sample()
tar_amp_bioinformatics_info_id experiment_sample_id target_number
Mozambique2018-SeekDeep 8025874217 99
Mozambique2018-SeekDeep 8025874231 99
Mozambique2018-SeekDeep 8025874234 97
Mozambique2018-SeekDeep 8025874237 99
Mozambique2018-SeekDeep 8025874250 98
Mozambique2018-SeekDeep 8025874253 99
Mozambique2018-SeekDeep 8025874261 99
Mozambique2018-SeekDeep 8025874266 85
Mozambique2018-SeekDeep 8025874271 99
Apply a read count minimum filter (this a total read count summed for a target and not on a haplotype level)
tar_amp_bioinformatics_info_id experiment_sample_id target_number
Mozambique2018-SeekDeep 8025874217 99
Mozambique2018-SeekDeep 8025874231 73
Mozambique2018-SeekDeep 8025874234 93
Mozambique2018-SeekDeep 8025874237 98
Mozambique2018-SeekDeep 8025874250 68
Mozambique2018-SeekDeep 8025874253 99
Mozambique2018-SeekDeep 8025874261 98
Mozambique2018-SeekDeep 8025874266 37
Mozambique2018-SeekDeep 8025874271 98
Count up the number of experimental_sample_ids each target has. A read filter can be applied to see how many samples a taget would have if a filter was applied
usage: pmotools-runner.py count_samples_per_target [-h] --file FILE
[--output OUTPUT]
[--delim DELIM]
[--overwrite]
[--read_count_minimum READ_COUNT_MINIMUM]
options:
-h, --help show this help message and exit
--file FILE PMO file
--output OUTPUT output file
--delim DELIM the delimiter of the output text file, examples input
tab,comma but can also be the actual delimiter
--overwrite If output file exists, overwrite it
--read_count_minimum READ_COUNT_MINIMUM
the minimum read count (inclusive) to be counted as
covered by sample
The python code for count_samples_per_target
script is below
pmotools-python/scripts/extract_info_from_pmo/count_samples_per_target.py
#!/usr/bin/env python3
import os, argparse, json
import sys
from collections import defaultdict
import pandas as pd
from pmotools.pmo_utils.PMOExtractor import PMOExtractor
from pmotools.pmo_utils.PMOReader import PMOReader
from pmotools.utils.small_utils import Utils
def parse_args_count_samples_per_target():
parser = argparse.ArgumentParser()
parser.add_argument('--file', type=str, required=True, help='PMO file')
parser.add_argument('--output', type=str, default="STDOUT", required=False, help='output file')
parser.add_argument('--delim', default="tab", type=str, required=False, help='the delimiter of the output text file, examples input tab,comma but can also be the actual delimiter')
parser.add_argument('--overwrite', action='store_true', help='If output file exists, overwrite it')
parser.add_argument('--read_count_minimum', default=0.0, type=float, required=False, help='the minimum read count (inclusive) to be counted as covered by sample')
return parser.parse_args()
def count_samples_per_target():
args = parse_args_count_samples_per_target()
# check files
output_delim, output_extension = Utils.process_delimiter_and_output_extension(args.delim, gzip=args.output.endswith(".gz"))
args.output = args.output if "STDOUT" == args.output else Utils.appendStrAsNeeded(args.output, output_extension)
Utils.inputOutputFileCheck(args.file, args.output, args.overwrite)
# read in PMO
pmo = PMOReader.read_in_pmo(args.file)
# count
counts_df = PMOExtractor.count_samples_per_target(pmo, args.read_count_minimum)
#write out
counts_df.to_csv(sys.stdout if "STDOUT" == args.output else args.output, sep = output_delim, index=False)
if __name__ == "__main__":
count_samples_per_target()
tar_amp_bioinformatics_info_id target_id sample_number
Mozambique2018-SeekDeep t1 119
Mozambique2018-SeekDeep t10 117
Mozambique2018-SeekDeep t100 124
Mozambique2018-SeekDeep t11 120
Mozambique2018-SeekDeep t12 119
Mozambique2018-SeekDeep t13 124
Mozambique2018-SeekDeep t14 118
Mozambique2018-SeekDeep t15 119
Mozambique2018-SeekDeep t16 121
Apply a read count minimum filter (this a total read count summed for a target and not on a haplotype level)
tar_amp_bioinformatics_info_id target_id sample_number
Mozambique2018-SeekDeep t1 108
Mozambique2018-SeekDeep t10 107
Mozambique2018-SeekDeep t100 107
Mozambique2018-SeekDeep t11 111
Mozambique2018-SeekDeep t12 104
Mozambique2018-SeekDeep t13 105
Mozambique2018-SeekDeep t14 110
Mozambique2018-SeekDeep t15 110
Mozambique2018-SeekDeep t16 106
---
title: Getting basic info out of PMO using pmotools-python
---
```{r setup, echo=F}
source("../common.R")
```
# Extract basic info counts from PMO
To get simple counts of number of targets with sample counts, samples with target counts, the counts of meta fields
Most of these basic info extractor can be found underneath `extract_basic_info_from_pmo`
```{bash, eval = F}
pmotools-runner.py
```
```{bash, echo = F}
pmotools-runner.py | perl -pe 's/\e\[[0-9;]*m(?:\e\[K)?//g'
```
Getting files for examples
```{bash, eval = F}
cd example
wget https://plasmogenepi.github.io/PMO_Docs/format/moz2018_PMO.json.gz
wget https://plasmogenepi.github.io/PMO_Docs/format/PathWeaverHeome1_PMO.json.gz
```
## list_specimen_meta_fields
This will list all the meta fields within the `specimen_infos` section of a PMO file. Since not all meta fields are always present in all specimens, this will list the count of samples each field appears in and the number of total specimens
```{bash}
pmotools-runner.py list_specimen_meta_fields -h
```
The python code for `list_specimen_meta_fields` script is below
```{python}
#| echo: true
#| eval: false
#| code-fold: true
#| code-line-numbers: true
#| filename: pmotools-python/scripts/extract_info_from_pmo/list_specimen_meta_fields.py
#| file: ../pmotools-python/scripts/extract_info_from_pmo/list_specimen_meta_fields.py
```
```{bash}
cd example
pmotools-runner.py list_specimen_meta_fields --file ../../format/moz2018_PMO.json.gz
```
```{bash}
cd example
pmotools-runner.py list_specimen_meta_fields --file ../../format/moz2018_PMO.json.gz --output spec_fields_moz2018_PMO.tsv --overwrite
```
## count_specimen_meta
This will list all the meta values (and the combinations) for the meta fields within the `specimen_infos` section of a PMO file.
```{bash}
pmotools-runner.py count_specimen_meta -h
```
The python code for `count_specimen_meta` script is below
```{python}
#| echo: true
#| eval: false
#| code-fold: true
#| code-line-numbers: true
#| filename: pmotools-python/scripts/extract_info_from_pmo/count_specimen_meta.py
#| file: ../pmotools-python/scripts/extract_info_from_pmo/count_specimen_meta.py
```
```{bash}
cd example
pmotools-runner.py count_specimen_meta --file ../../format/moz2018_PMO.json.gz --meta_fields collection_country
```
```{bash}
cd example
pmotools-runner.py count_specimen_meta --file ../../format/moz2018_PMO.json.gz --meta_fields collection_country --overwrite --output collection_country_count_moz2018_PMO.tsv.gz
```
```{bash}
cd example
pmotools-runner.py count_specimen_meta --file ../../format/moz2018_PMO.json.gz --meta_fields collection_country,geo_admin3
```
```{bash}
cd example
pmotools-runner.py count_specimen_meta --file ../../format/PathWeaverHeome1_PMO.json.gz --meta_fields collection_country,collection_date | head
```
## list_tar_amp_bioinformatics_info_ids
This will simply list out all the analyses (all the `tar_amp_bioinformatics_info_id`s) stored within a PMO
```{bash}
pmotools-runner.py list_tar_amp_bioinformatics_info_ids -h
```
The python code for `list_tar_amp_bioinformatics_info_ids` script is below
```{python}
#| echo: true
#| eval: false
#| code-fold: true
#| code-line-numbers: true
#| filename: pmotools-python/scripts/extract_info_from_pmo/list_tar_amp_bioinformatics_info_ids.py
#| file: ../pmotools-python/scripts/extract_info_from_pmo/list_tar_amp_bioinformatics_info_ids.py
```
```{bash}
cd example
pmotools-runner.py list_tar_amp_bioinformatics_info_ids --file ../../format/moz2018_PMO.json.gz
```
```{bash}
cd example
pmotools-runner.py list_tar_amp_bioinformatics_info_ids --file ../../format/PathWeaverHeome1_PMO.json.gz
```
This can be helpful after combining PMOs
```{bash, eval = F}
cd example
pmotools-runner.py combine_pmos --pmo_files ../../format/moz2018_PMO.json.gz,../../format/PathWeaverHeome1_PMO.json.gz --output combined_Heome1_PMO.json.gz --overwrite
pmotools-runner.py list_tar_amp_bioinformatics_info_ids --file combined_Heome1_PMO.json.gz
```
```{bash, echo = F}
cd example
pmotools-runner.py list_tar_amp_bioinformatics_info_ids --file combined_Heome1_PMO.json.gz
```
## count_targets_per_sample
Count up the number targets each experimental_sample_id has. A read filter can be applied to see how targets would be kept if such a filter was applied
```{bash}
pmotools-runner.py count_targets_per_sample -h
```
The python code for `count_targets_per_sample` script is below
```{python}
#| echo: true
#| eval: false
#| code-fold: true
#| code-line-numbers: true
#| filename: pmotools-python/scripts/extract_info_from_pmo/count_targets_per_sample.py
#| file: ../pmotools-python/scripts/extract_info_from_pmo/count_targets_per_sample.py
```
```{bash}
cd example
pmotools-runner.py count_targets_per_sample --file ../../format/moz2018_PMO.json.gz | head
```
Apply a read count minimum filter (this a total read count summed for a target and not on a haplotype level)
```{bash}
cd example
pmotools-runner.py count_targets_per_sample --read_count_minimum 3000 --file ../../format/moz2018_PMO.json.gz | head
```
## count_samples_per_target
Count up the number of experimental_sample_ids each target has. A read filter can be applied to see how many samples a taget would have if a filter was applied
```{bash}
pmotools-runner.py count_samples_per_target -h
```
The python code for `count_samples_per_target` script is below
```{python}
#| echo: true
#| eval: false
#| code-fold: true
#| code-line-numbers: true
#| filename: pmotools-python/scripts/extract_info_from_pmo/count_samples_per_target.py
#| file: ../pmotools-python/scripts/extract_info_from_pmo/count_samples_per_target.py
```
```{bash}
cd example
pmotools-runner.py count_samples_per_target --file ../../format/moz2018_PMO.json.gz | head
```
Apply a read count minimum filter (this a total read count summed for a target and not on a haplotype level)
```{bash}
cd example
pmotools-runner.py count_samples_per_target --read_count_minimum 3000 --file ../../format/moz2018_PMO.json.gz | head
```