Code
pmotools-python
To get simple counts of number of targets with sample counts, samples with target counts, the counts of meta fields
Most of these basic info extractor can be found underneath extract_basic_info_from_pmo
pmotools-python v0.1.0 - A suite of tools for interacting with Portable Microhaplotype Object (PMO) file format
Available functions organized by groups are
convertors_to_json
text_meta_to_json_meta - Convert text file meta to JSON Meta
excel_meta_to_json_meta - Convert Excel file meta to JSON Meta
microhaplotype_table_to_json_file - Convert microhaplotype table to a JSON file
terra_amp_output_to_json - Convert Terra output to JSON sequence table
extractors_from_pmo
extract_pmo_with_selected_meta - Extract samples + haplotypes using selected meta
extract_pmo_with_select_specimen_names - Extract specific samples from the specimens table
extract_pmo_with_select_library_sample_names - Extract experiment sample names from experiment_info table
extract_pmo_with_select_targets - Extract specific targets
extract_pmo_with_read_filter - Extract with a read filter
extract_allele_table - Extract allele tables for tools like dcifer or moire
extract_insert_of_panels - Extract inserts of panels from a PMO
extract_refseq_of_inserts_of_panels - Extract ref_seq of panel inserts from a PMO
working_with_multiple_pmos
combine_pmos - Combine multiple PMOs of the same panel
extract_basic_info_from_pmo
list_library_sample_names_per_specimen_name - List experiment_sample_ids per specimen_id
list_specimen_meta_fields - List specimen meta fields in the specimen_info section
list_bioinformatics_run_names - List all tar_amp_bioinformatics_info_ids in a PMO
count_specimen_meta - Count values of selected specimen meta fields
count_targets_per_library_sample - Count number of targets per sample
count_library_samples_per_target - Count number of samples per target
validation
validate_pmo - Validate a PMO file against a JSON Schema
Getting files for examples
This will list all the meta fields within the specimen_infos
section of a PMO file. Since not all meta fields are always present in all specimens, this will list the count of samples each field appears in and the number of total specimens
usage: pmotools-python list_specimen_meta_fields [-h] --file FILE
[--output OUTPUT]
[--delim DELIM] [--overwrite]
options:
-h, --help show this help message and exit
--file FILE PMO file
--output OUTPUT output file
--delim DELIM the delimiter of the output text file, examples input
tab,comma but can also be the actual delimiter
--overwrite If output file exists, overwrite it
The python code for list_specimen_meta_fields
script is below
pmotools-python/src/pmotools/scripts/extract_info_from_pmo/list_specimen_meta_fields.py
#!/usr/bin/env python3
import argparse
import sys
from pmotools.pmo_engine.pmo_processor import PMOProcessor
from pmotools.pmo_engine.pmo_reader import PMOReader
from pmotools.utils.small_utils import Utils
def parse_args_list_specimen_meta_fields():
parser = argparse.ArgumentParser()
parser.add_argument("--file", type=str, required=True, help="PMO file")
parser.add_argument(
"--output", type=str, default="STDOUT", required=False, help="output file"
)
parser.add_argument(
"--delim",
default="tab",
type=str,
required=False,
help="the delimiter of the output text file, examples input tab,comma but can also be the actual delimiter",
)
parser.add_argument(
"--overwrite", action="store_true", help="If output file exists, overwrite it"
)
return parser.parse_args()
def list_specimen_meta_fields():
args = parse_args_list_specimen_meta_fields()
# check files
output_delim, output_extension = Utils.process_delimiter_and_output_extension(
args.delim, gzip=args.output.endswith(".gz")
)
args.output = (
args.output
if "STDOUT" == args.output
else Utils.appendStrAsNeeded(args.output, output_extension)
)
Utils.inputOutputFileCheck(args.file, args.output, args.overwrite)
# read in PMO
pmo = PMOReader.read_in_pmo(args.file)
# count fields
counts_df = PMOProcessor.count_specimen_per_meta_fields(pmo)
# output
counts_df.to_csv(
sys.stdout if "STDOUT" == args.output else args.output,
sep=output_delim,
index=False,
)
if __name__ == "__main__":
list_specimen_meta_fields()
field present_in_specimens_count total_specimen_count
collection_country 124 124
collection_date 124 124
geo_admin3 124 124
host_taxon_id 124 124
lat_lon 124 124
parasite_density_info 123 124
project_id 124 124
specimen_collect_device 124 124
specimen_name 124 124
specimen_store_loc 124 124
specimen_taxon_id 124 124
storage_plate_info 81 124
This will list all the meta values (and the combinations) for the meta fields within the specimen_infos
section of a PMO file.
usage: pmotools-python count_specimen_meta [-h] --file FILE [--output OUTPUT]
[--delim DELIM] [--overwrite]
--meta_fields META_FIELDS
options:
-h, --help show this help message and exit
--file FILE PMO file
--output OUTPUT output file
--delim DELIM the delimiter of the output text file, examples input
tab,comma but can also be the actual delimiter
--overwrite If output file exists, overwrite it
--meta_fields META_FIELDS
the fields to count the subfields of, can supply
multiple separated by commas, e.g. --meta_fields
collection_country,collection_date
The python code for count_specimen_meta
script is below
pmotools-python/src/pmotools/scripts/extract_info_from_pmo/count_specimen_meta.py
#!/usr/bin/env python3
import argparse
import sys
from pmotools.pmo_engine.pmo_processor import PMOProcessor
from pmotools.pmo_engine.pmo_reader import PMOReader
from pmotools.utils.small_utils import Utils
def parse_args_count_specimen_meta():
parser = argparse.ArgumentParser()
parser.add_argument("--file", type=str, required=True, help="PMO file")
parser.add_argument(
"--output", type=str, default="STDOUT", required=False, help="output file"
)
parser.add_argument(
"--delim",
default="tab",
type=str,
required=False,
help="the delimiter of the output text file, examples input tab,comma but can also be the actual delimiter",
)
parser.add_argument(
"--overwrite", action="store_true", help="If output file exists, overwrite it"
)
parser.add_argument(
"--meta_fields",
type=str,
required=True,
help="the fields to count the subfields of, can supply multiple separated by commas, e.g. --meta_fields collection_country,collection_date",
)
return parser.parse_args()
def count_specimen_meta():
args = parse_args_count_specimen_meta()
# check files
output_delim, output_extension = Utils.process_delimiter_and_output_extension(
args.delim, gzip=args.output.endswith(".gz")
)
args.output = (
args.output
if "STDOUT" == args.output
else Utils.appendStrAsNeeded(args.output, output_extension)
)
Utils.inputOutputFileCheck(args.file, args.output, args.overwrite)
# process the meta_fields argument
meta_fields_toks = args.meta_fields.split(",")
# read in PMO
pmo = PMOReader.read_in_pmo(args.file)
# count sub-fields
counts_df = PMOProcessor.count_specimen_by_field_value(pmo, meta_fields_toks)
# write out
counts_df.to_csv(
sys.stdout if "STDOUT" == args.output else args.output,
sep=output_delim,
index=False,
)
if __name__ == "__main__":
count_specimen_meta()
collection_country specimens_count specimens_freq total_specimen_count
Mozambique 81 0.6532258064516129 124
NA 43 0.3467741935483871 124
collection_country geo_admin3 specimens_count specimens_freq total_specimen_count
Mozambique Inhassoro 27 0.21774193548387097 124
Mozambique Mandlakazi 28 0.22580645161290322 124
Mozambique Namaacha 26 0.20967741935483872 124
NA NA 43 0.3467741935483871 124
collection_country collection_date specimens_count specimens_freq total_specimen_count
Bangladesh 2008 15 0.0007718828796377296 19433
Bangladesh 2009 16 0.000823341738280245 19433
Bangladesh 2012 8 0.0004116708691401225 19433
Bangladesh 2012-04-19 1 5.145885864251531e-05 19433
Bangladesh 2012-06-05 2 0.00010291771728503062 19433
Bangladesh 2012-06-13 1 5.145885864251531e-05 19433
Bangladesh 2012-06-17 1 5.145885864251531e-05 19433
Bangladesh 2012-07-17 1 5.145885864251531e-05 19433
Bangladesh 2012-07-23 1 5.145885864251531e-05 19433
Traceback (most recent call last):
File "/Users/nick/projects/plasmodium/falciparum/PMO_Docs/PMO_Docs_deployment/PMO_Docs/env/bin/pmotools-python", line 7, in <module>
sys.exit(main())
^^^^^^
File "/Users/nick/projects/plasmodium/falciparum/PMO_Docs/PMO_Docs_deployment/PMO_Docs/pmotools-python/src/pmotools/cli.py", line 366, in main
handler()
File "/Users/nick/projects/plasmodium/falciparum/PMO_Docs/PMO_Docs_deployment/PMO_Docs/pmotools-python/src/pmotools/scripts/extract_info_from_pmo/count_specimen_meta.py", line 61, in count_specimen_meta
counts_df.to_csv(
File "/Users/nick/projects/plasmodium/falciparum/PMO_Docs/PMO_Docs_deployment/PMO_Docs/env/lib/python3.12/site-packages/pandas/util/_decorators.py", line 333, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/Users/nick/projects/plasmodium/falciparum/PMO_Docs/PMO_Docs_deployment/PMO_Docs/env/lib/python3.12/site-packages/pandas/core/generic.py", line 3967, in to_csv
return DataFrameRenderer(formatter).to_csv(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/nick/projects/plasmodium/falciparum/PMO_Docs/PMO_Docs_deployment/PMO_Docs/env/lib/python3.12/site-packages/pandas/io/formats/format.py", line 1014, in to_csv
csv_formatter.save()
File "/Users/nick/projects/plasmodium/falciparum/PMO_Docs/PMO_Docs_deployment/PMO_Docs/env/lib/python3.12/site-packages/pandas/io/formats/csvs.py", line 270, in save
self._save()
File "/Users/nick/projects/plasmodium/falciparum/PMO_Docs/PMO_Docs_deployment/PMO_Docs/env/lib/python3.12/site-packages/pandas/io/formats/csvs.py", line 275, in _save
self._save_body()
File "/Users/nick/projects/plasmodium/falciparum/PMO_Docs/PMO_Docs_deployment/PMO_Docs/env/lib/python3.12/site-packages/pandas/io/formats/csvs.py", line 313, in _save_body
self._save_chunk(start_i, end_i)
File "/Users/nick/projects/plasmodium/falciparum/PMO_Docs/PMO_Docs_deployment/PMO_Docs/env/lib/python3.12/site-packages/pandas/io/formats/csvs.py", line 324, in _save_chunk
libwriters.write_csv_rows(
File "writers.pyx", line 73, in pandas._libs.writers.write_csv_rows
BrokenPipeError: [Errno 32] Broken pipe
Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>
BrokenPipeError: [Errno 32] Broken pipe
Specimens can have multiple library sample names, so it might be helpful to list out all the library_names per specimens
usage: pmotools-python list_library_sample_names_per_specimen_name
[-h] --file FILE [--output OUTPUT] [--delim DELIM] [--overwrite]
options:
-h, --help show this help message and exit
--file FILE PMO file
--output OUTPUT output file
--delim DELIM the delimiter of the output text file, examples input
tab,comma but can also be the actual delimiter
--overwrite If output file exists, overwrite it
The python code for list_library_sample_names_per_specimen_name
script is below
pmotools-python/src/pmotools/scripts/extract_info_from_pmo/list_library_sample_names_per_specimen_name.py
#!/usr/bin/env python3
import argparse
import sys
from pmotools.pmo_engine.pmo_processor import PMOProcessor
from pmotools.pmo_engine.pmo_reader import PMOReader
from pmotools.utils.small_utils import Utils
def parse_args_list_library_sample_names_per_specimen_name():
parser = argparse.ArgumentParser()
parser.add_argument("--file", type=str, required=True, help="PMO file")
parser.add_argument(
"--output", type=str, default="STDOUT", required=False, help="output file"
)
parser.add_argument(
"--delim",
default="tab",
type=str,
required=False,
help="the delimiter of the output text file, examples input tab,comma but can also be the actual delimiter",
)
parser.add_argument(
"--overwrite", action="store_true", help="If output file exists, overwrite it"
)
return parser.parse_args()
def list_library_sample_names_per_specimen_name():
args = parse_args_list_library_sample_names_per_specimen_name()
# check files
output_delim, output_extension = Utils.process_delimiter_and_output_extension(
args.delim, gzip=args.output.endswith(".gz")
)
args.output = (
args.output
if "STDOUT" == args.output
else Utils.appendStrAsNeeded(args.output, output_extension)
)
Utils.inputOutputFileCheck(args.file, args.output, args.overwrite)
# read in PMO
pmo = PMOReader.read_in_pmo(args.file)
# count fields
info_df = PMOProcessor.list_library_sample_names_per_specimen_name(pmo)
# output
info_df.to_csv(
sys.stdout if "STDOUT" == args.output else args.output,
sep=output_delim,
index=False,
)
if __name__ == "__main__":
list_library_sample_names_per_specimen_name()
specimen_name library_sample_name library_sample_count
8025874217 8025874217 1
8025874237 8025874237 1
8025874250 8025874250 1
8025874300 8025874300 1
8025874321 8025874321 1
8025874349 8025874349 1
8025874366 8025874366 1
8025874377 8025874377 1
8025874411 8025874411 1
8025874421 8025874421 1
8025874463 8025874463 1
8025874482 8025874482 1
8025874507 8025874507 1
8025874578 8025874578 1
8025874627 8025874627 1
8025874637 8025874637 1
8025874665 8025874665 1
8025874669 8025874669 1
8025874714 8025874714 1
8025874729 8025874729 1
8025874809 8025874809 1
8025874865 8025874865 1
8025874899 8025874899 1
8025874940 8025874940 1
8034209589 8034209589 1
8034209790 8034209790 1
8034209228 8034209228 1
8025874253 8025874253 1
8025874261 8025874261 1
8025874382 8025874382 1
8025874457 8025874457 1
8025874502 8025874502 1
8025874526 8025874526 1
8025874537 8025874537 1
8025874586 8025874586 1
8025874589 8025874589 1
8025874636 8025874636 1
8025874829 8025874829 1
8025874849 8025874849 1
8025874875 8025874875 1
8025874975 8025874975 1
8025874988 8025874988 1
8025875052 8025875052 1
8025875059 8025875059 1
8025875140 8025875140 1
8025875144 8025875144 1
8025875145 8025875145 1
8025875146 8025875146 1
8025875166 8025875166 1
8034209115 8034209115 1
8034209281 8034209281 1
8034209465 8034209465 1
8034209834 8034209834 1
8025874494 8025874494 1
8025874536 8025874536 1
8025874266 8025874266 1
8025874271 8025874271 1
8025874316 8025874316 1
8025874330 8025874330 1
8025874340 8025874340 1
8025874357 8025874357 1
8025874376 8025874376 1
8025874380 8025874380 1
8025874387 8025874387 1
8025874419 8025874419 1
8025874435 8025874435 1
8025874447 8025874447 1
8025874672 8025874672 1
8025874706 8025874706 1
8025874738 8025874738 1
8025874928 8025874928 1
8025874933 8025874933 1
8025874973 8025874973 1
8025875029 8025875029 1
8025875042 8025875042 1
8025875121 8025875121 1
8025875170 8025875170 1
8034209773 8034209773 1
8034209803 8034209803 1
8034209818 8034209818 1
8025874297 8025874297 1
8025874231 8025874231 1
8025874234 8025874234 1
8025874286 8025874286 1
8025874343 8025874343 1
8025874348 8025874348 1
8025874352 8025874352 1
8025874396 8025874396 1
8025874405 8025874405 1
8025874437 8025874437 1
8025874452 8025874452 1
8025874454 8025874454 1
8025874484 8025874484 1
8025874491 8025874491 1
8025874499 8025874499 1
8025874568 8025874568 1
8025874585 8025874585 1
8025874591 8025874591 1
8025874613 8025874613 1
8025874662 8025874662 1
8025874675 8025874675 1
8025874676 8025874676 1
8025874701 8025874701 1
8025874705 8025874705 1
8025874720 8025874720 1
8025874872 8025874872 1
8025874877 8025874877 1
8025874878 8025874878 1
8025874879 8025874879 1
8025874886 8025874886 1
8025874888 8025874888 1
8025874903 8025874903 1
8025874931 8025874931 1
8025874948 8025874948 1
8025874956 8025874956 1
8025875065 8025875065 1
8025875112 8025875112 1
8025875165 8025875165 1
SS1-10K-C1 SS1-10K-C1 1
SS2-10K-C1 SS2-10K-C1 1
SS3-10K-C1 SS3-10K-C1 1
SS4-10K-C1 SS4-10K-C1 1
SS5-10K-C1 SS5-10K-C1 1
8025875168 8025875168 1
This will simply list out all the analyses (all the bioinformatics_run_names
s) stored within a PMO
usage: pmotools-python list_bioinformatics_run_names [-h] --file FILE
[--output OUTPUT]
[--overwrite]
options:
-h, --help show this help message and exit
--file FILE PMO file
--output OUTPUT output file
--overwrite If output file exists, overwrite it
The python code for list_bioinformatics_run_names
script is below
pmotools-python/src/pmotools/scripts/extract_info_from_pmo/list_bioinformatics_run_names.py
#!/usr/bin/env python3
import argparse
import sys
from pmotools.pmo_engine.pmo_processor import PMOProcessor
from pmotools.pmo_engine.pmo_reader import PMOReader
from pmotools.utils.small_utils import Utils
def parse_args_list_bioinformatics_run_names():
parser = argparse.ArgumentParser()
parser.add_argument("--file", type=str, required=True, help="PMO file")
parser.add_argument(
"--output", type=str, default="STDOUT", required=False, help="output file"
)
parser.add_argument(
"--overwrite", action="store_true", help="If output file exists, overwrite it"
)
return parser.parse_args()
def list_bioinformatics_run_names():
args = parse_args_list_bioinformatics_run_names()
# check files
Utils.inputOutputFileCheck(args.file, args.output, args.overwrite)
# read in PMO
pmo = PMOReader.read_in_pmo(args.file)
# extract all bio run names
bio_run_names = PMOProcessor.get_bioinformatics_run_names(pmo)
# write
output_target = sys.stdout if args.output == "STDOUT" else open(args.output, "w")
with output_target as f:
f.write("\n".join(bio_run_names) + "\n")
if __name__ == "__main__":
list_bioinformatics_run_names()
Mozambique2018-SeekDeep
PathWeaver-Heome1
This can be helpful after combining PMOs
Mozambique2018-SeekDeep
PathWeaver-Heome1
Count up the number targets each library_sample_id has. A read filter can be applied to see how targets would be kept if such a filter was applied
usage: pmotools-python count_targets_per_library_sample [-h] --file FILE
[--output OUTPUT]
[--delim DELIM]
[--overwrite]
[--read_count_minimum READ_COUNT_MINIMUM]
options:
-h, --help show this help message and exit
--file FILE PMO file
--output OUTPUT output file
--delim DELIM the delimiter of the output text file, examples input
tab,comma but can also be the actual delimiter
--overwrite If output file exists, overwrite it
--read_count_minimum READ_COUNT_MINIMUM
the minimum read count (inclusive) to be counted as
covered by sample
The python code for count_targets_per_library_sample
script is below
pmotools-python/src/pmotools/scripts/extract_info_from_pmo/count_targets_per_library_sample.py
#!/usr/bin/env python3
import argparse
import sys
from pmotools.pmo_engine.pmo_processor import PMOProcessor
from pmotools.pmo_engine.pmo_reader import PMOReader
from pmotools.utils.small_utils import Utils
def parse_args_count_targets_per_library_sample():
parser = argparse.ArgumentParser()
parser.add_argument("--file", type=str, required=True, help="PMO file")
parser.add_argument(
"--output", type=str, default="STDOUT", required=False, help="output file"
)
parser.add_argument(
"--delim",
default="tab",
type=str,
required=False,
help="the delimiter of the output text file, examples input tab,comma but can also be the actual delimiter",
)
parser.add_argument(
"--overwrite", action="store_true", help="If output file exists, overwrite it"
)
parser.add_argument(
"--read_count_minimum",
default=0.0,
type=float,
required=False,
help="the minimum read count (inclusive) to be counted as covered by sample",
)
return parser.parse_args()
def count_targets_per_library_sample():
args = parse_args_count_targets_per_library_sample()
# check files
output_delim, output_extension = Utils.process_delimiter_and_output_extension(
args.delim, gzip=args.output.endswith(".gz")
)
args.output = (
args.output
if "STDOUT" == args.output
else Utils.appendStrAsNeeded(args.output, output_extension)
)
Utils.inputOutputFileCheck(args.file, args.output, args.overwrite)
# read in PMO
pmo = PMOReader.read_in_pmo(args.file)
# count
counts_df = PMOProcessor.count_targets_per_library_sample(
pmo, args.read_count_minimum
)
# write out
counts_df.to_csv(
sys.stdout if "STDOUT" == args.output else args.output,
sep=output_delim,
index=False,
)
if __name__ == "__main__":
count_targets_per_library_sample()
bioinformatics_run_id library_sample_name target_number
0 8025875168 47
0 8025874536 10
0 8025874494 23
0 8025874297 8
0 SS4-10K-C1 99
0 SS3-10K-C1 98
0 SS2-10K-C1 98
0 8034209818 99
0 8034209790 98
Apply a read count minimum filter (this a total read count summed for a target and not on a haplotype level)
bioinformatics_run_id library_sample_name target_number
0 8025875168 0
0 8025874536 0
0 8025874494 0
0 8025874297 0
0 SS4-10K-C1 97
0 SS3-10K-C1 96
0 SS2-10K-C1 97
0 8034209818 97
0 8034209790 97
Count up the number of library_sample_ids each target has. A read filter can be applied to see how many samples a taget would have if a filter was applied
usage: pmotools-python count_library_samples_per_target [-h] --file FILE
[--output OUTPUT]
[--delim DELIM]
[--overwrite]
[--read_count_minimum READ_COUNT_MINIMUM]
options:
-h, --help show this help message and exit
--file FILE PMO file
--output OUTPUT output file
--delim DELIM the delimiter of the output text file, examples input
tab,comma but can also be the actual delimiter
--overwrite If output file exists, overwrite it
--read_count_minimum READ_COUNT_MINIMUM
the minimum read count (inclusive) to be counted as
covered by sample
The python code for count_library_samples_per_target
script is below
pmotools-python/src/pmotools/scripts/extract_info_from_pmo/count_library_samples_per_target.py
#!/usr/bin/env python3
import argparse
import sys
from pmotools.pmo_engine.pmo_processor import PMOProcessor
from pmotools.pmo_engine.pmo_reader import PMOReader
from pmotools.utils.small_utils import Utils
def parse_args_count_library_samples_per_target():
parser = argparse.ArgumentParser()
parser.add_argument("--file", type=str, required=True, help="PMO file")
parser.add_argument(
"--output", type=str, default="STDOUT", required=False, help="output file"
)
parser.add_argument(
"--delim",
default="tab",
type=str,
required=False,
help="the delimiter of the output text file, examples input tab,comma but can also be the actual delimiter",
)
parser.add_argument(
"--overwrite", action="store_true", help="If output file exists, overwrite it"
)
parser.add_argument(
"--read_count_minimum",
default=0.0,
type=float,
required=False,
help="the minimum read count (inclusive) to be counted as covered by sample",
)
return parser.parse_args()
def count_library_samples_per_target():
args = parse_args_count_library_samples_per_target()
# check files
output_delim, output_extension = Utils.process_delimiter_and_output_extension(
args.delim, gzip=args.output.endswith(".gz")
)
args.output = (
args.output
if "STDOUT" == args.output
else Utils.appendStrAsNeeded(args.output, output_extension)
)
Utils.inputOutputFileCheck(args.file, args.output, args.overwrite)
# read in PMO
pmo = PMOReader.read_in_pmo(args.file)
# count
counts_df = PMOProcessor.count_library_samples_per_target(
pmo, args.read_count_minimum
)
# write out
counts_df.to_csv(
sys.stdout if "STDOUT" == args.output else args.output,
sep=output_delim,
index=False,
)
if __name__ == "__main__":
count_library_samples_per_target()
bioinformatics_run_id target_name sample_count
0 t1 119
0 t10 117
0 t100 124
0 t11 120
0 t12 119
0 t13 124
0 t14 118
0 t15 119
0 t16 121
Apply a read count minimum filter (this a total read count summed for a target and not on a haplotype level)
---
title: Getting basic info out of PMO using pmotools-python
---
```{r setup, echo=F}
source("../common.R")
```
# Extract basic info counts from PMO
To get simple counts of number of targets with sample counts, samples with target counts, the counts of meta fields
Most of these basic info extractor can be found underneath `extract_basic_info_from_pmo`
```{bash, eval = F}
pmotools-python
```
```{bash, echo = F}
pmotools-python | perl -pe 's/\e\[[0-9;]*m(?:\e\[K)?//g'
```
Getting files for examples
```{bash, eval = F}
cd example
wget https://plasmogenepi.github.io/PMO_Docs/format/moz2018_PMO.json.gz
wget https://plasmogenepi.github.io/PMO_Docs/format/PathWeaverHeome1_PMO.json.gz
```
## list_specimen_meta_fields
This will list all the meta fields within the `specimen_infos` section of a PMO file. Since not all meta fields are always present in all specimens, this will list the count of samples each field appears in and the number of total specimens
```{bash}
pmotools-python list_specimen_meta_fields -h
```
The python code for `list_specimen_meta_fields` script is below
```{python}
#| echo: true
#| eval: false
#| code-fold: true
#| code-line-numbers: true
#| filename: pmotools-python/src/pmotools/scripts/extract_info_from_pmo/list_specimen_meta_fields.py
#| file: ../pmotools-python/src/pmotools/scripts/extract_info_from_pmo/list_specimen_meta_fields.py
```
```{bash}
cd example
pmotools-python list_specimen_meta_fields --file ../../format/moz2018_PMO.json.gz
```
```{bash}
cd example
pmotools-python list_specimen_meta_fields --file ../../format/moz2018_PMO.json.gz --output spec_fields_moz2018_PMO.tsv --overwrite
```
## count_specimen_meta
This will list all the meta values (and the combinations) for the meta fields within the `specimen_infos` section of a PMO file.
```{bash}
pmotools-python count_specimen_meta -h
```
The python code for `count_specimen_meta` script is below
```{python}
#| echo: true
#| eval: false
#| code-fold: true
#| code-line-numbers: true
#| filename: pmotools-python/src/pmotools/scripts/extract_info_from_pmo/count_specimen_meta.py
#| file: ../pmotools-python/src/pmotools/scripts/extract_info_from_pmo/count_specimen_meta.py
```
```{bash}
cd example
pmotools-python count_specimen_meta --file ../../format/moz2018_PMO.json.gz --meta_fields collection_country
```
```{bash}
cd example
pmotools-python count_specimen_meta --file ../../format/moz2018_PMO.json.gz --meta_fields collection_country --overwrite --output collection_country_count_moz2018_PMO.tsv.gz
```
```{bash}
cd example
pmotools-python count_specimen_meta --file ../../format/moz2018_PMO.json.gz --meta_fields collection_country,geo_admin3
```
```{bash}
cd example
pmotools-python count_specimen_meta --file ../../format/PathWeaverHeome1_PMO.json.gz --meta_fields collection_country,collection_date | head
```
## list_library_sample_names_per_specimen_name
Specimens can have multiple library sample names, so it might be helpful to list out all the library_names per specimens
```{bash}
pmotools-python list_library_sample_names_per_specimen_name -h
```
The python code for `list_library_sample_names_per_specimen_name` script is below
```{python}
#| echo: true
#| eval: false
#| code-fold: true
#| code-line-numbers: true
#| filename: pmotools-python/src/pmotools/scripts/extract_info_from_pmo/list_library_sample_names_per_specimen_name.py
#| file: ../pmotools-python/src/pmotools/scripts/extract_info_from_pmo/list_library_sample_names_per_specimen_name.py
```
```{bash}
cd example
pmotools-python list_library_sample_names_per_specimen_name --file ../../format/moz2018_PMO.json.gz
```
```{bash}
cd example
pmotools-python list_library_sample_names_per_specimen_name --file ../../format/moz2018_PMO.json.gz --overwrite --output library_samples_per_specimen_moz2018_PMO.tsv.gz
```
## list_bioinformatics_run_names
This will simply list out all the analyses (all the `bioinformatics_run_names`s) stored within a PMO
```{bash}
pmotools-python list_bioinformatics_run_names -h
```
The python code for `list_bioinformatics_run_names` script is below
```{python}
#| echo: true
#| eval: false
#| code-fold: true
#| code-line-numbers: true
#| filename: pmotools-python/src/pmotools/scripts/extract_info_from_pmo/list_bioinformatics_run_names.py
#| file: ../pmotools-python/src/pmotools/scripts/extract_info_from_pmo/list_bioinformatics_run_names.py
```
```{bash}
cd example
pmotools-python list_bioinformatics_run_names --file ../../format/moz2018_PMO.json.gz
```
```{bash}
cd example
pmotools-python list_bioinformatics_run_names --file ../../format/PathWeaverHeome1_PMO.json.gz
```
This can be helpful after combining PMOs
```{bash, eval = F}
cd example
pmotools-python combine_pmos --pmo_files ../../format/moz2018_PMO.json.gz,../../format/PathWeaverHeome1_PMO.json.gz --output combined_Heome1_PMO.json.gz --overwrite
pmotools-python list_bioinformatics_run_names --file combined_Heome1_PMO.json.gz
```
```{bash, echo = F}
cd example
pmotools-python list_bioinformatics_run_names --file combined_Heome1_PMO.json.gz
```
## count_targets_per_library_sample
Count up the number targets each library_sample_id has. A read filter can be applied to see how targets would be kept if such a filter was applied
```{bash}
pmotools-python count_targets_per_library_sample -h
```
The python code for `count_targets_per_library_sample` script is below
```{python}
#| echo: true
#| eval: false
#| code-fold: true
#| code-line-numbers: true
#| filename: pmotools-python/src/pmotools/scripts/extract_info_from_pmo/count_targets_per_library_sample.py
#| file: ../pmotools-python/src/pmotools/scripts/extract_info_from_pmo/count_targets_per_library_sample.py
```
```{bash}
cd example
pmotools-python count_targets_per_library_sample --file ../../format/moz2018_PMO.json.gz | head
```
Apply a read count minimum filter (this a total read count summed for a target and not on a haplotype level)
```{bash}
cd example
pmotools-python count_targets_per_library_sample --read_count_minimum 3000 --file ../../format/moz2018_PMO.json.gz | head
```
## count_library_samples_per_target
Count up the number of library_sample_ids each target has. A read filter can be applied to see how many samples a taget would have if a filter was applied
```{bash}
pmotools-python count_library_samples_per_target -h
```
The python code for `count_library_samples_per_target` script is below
```{python}
#| echo: true
#| eval: false
#| code-fold: true
#| code-line-numbers: true
#| filename: pmotools-python/src/pmotools/scripts/extract_info_from_pmo/count_library_samples_per_target.py
#| file: ../pmotools-python/src/pmotools/scripts/extract_info_from_pmo/count_library_samples_per_target.py
```
```{bash}
cd example
pmotools-python count_library_samples_per_target --file ../../format/moz2018_PMO.json.gz | head
```
Apply a read count minimum filter (this a total read count summed for a target and not on a haplotype level)
```{bash}
cd example
pmotools-python count_library_samples_per_target --read_count_minimum 3000 --file ../../format/moz2018_PMO.json.gz | head
```