Portable Microhaplotype Object (PMO)
  • Home
  • Format Info
    • Development of Format
    • PMO fields overview
    • PMO Examples
    • Format Overview For Developers
  • Tools Installation
    • pmotools-python installation
  • pmotools-python usages
    • Command line interface

    • pmotools-python
    • Command line interface to pmotools-python with pmotools-python
    • Extracting out of PMO
    • Extracting allele tables using pmotools-python
    • Subset PMO
    • Subsetting from a PMO using pmotools-python
    • Getting sub info from PMO
    • Getting basic info out of PMO using pmotools-python
    • Getting panel info out of PMO using pmotools-python
    • Handling Multiple PMOs
    • Handling multiple PMOs pmotools-python
    • Validating PMO files
    • Validating PMOs pmotools-python

    • Python interface
    • Getting basic info out of a PMO
    • Creating a PMO File
  • Resources
    • References
    • Documentation
    • Documentation Source Code
    • Comment or Report an issue for Documentation

    • pmotools-python
    • pmotools-python Source Code
    • Comment or Report an issue for pmotools-python

Contents

  • Subsetting
    • Subsetting by specific targets
    • Subsetting by specific specimen_names
    • Subsetting by specific library_sample_names
    • Subsetting by samples within specific metafields
    • Subsetting by a read filter for detected microhaplotypes
    • Piping together extraction

Subsetting from a PMO using pmotools-python

  • Show All Code
  • Hide All Code

  • View Source

Subsetting

There may be some instances were you want to subset a much larger PMO file into a smaller PMO file to focus only one a set of samples and/or targets. There are several ways of doing this.

Subsetting by specific targets

Can subset to only specific targets by using pmotools-python extract_pmo_with_select_targets

Code
pmotools-python extract_pmo_with_select_targets -h
usage: pmotools-python extract_pmo_with_select_targets [-h] --file FILE
                                                       --output OUTPUT
                                                       [--overwrite]
                                                       [--verbose] --targets
                                                       TARGETS

options:
  -h, --help         show this help message and exit
  --file FILE        PMO file
  --output OUTPUT    Output json file path
  --overwrite        If output file exists, overwrite it
  --verbose          write out various messages about extraction
  --targets TARGETS  Can either comma separated target_namess, or a plain text
                     file where each line is a target_namess

The python code for extract_pmo_with_select_targets script is below

Code
pmotools-python/src/pmotools/scripts/extractors_from_pmo/extract_pmo_with_select_targets.py
#!/usr/bin/env python3
import argparse


from pmotools.pmo_engine.pmo_processor import PMOProcessor
from pmotools.pmo_engine.pmo_reader import PMOReader
from pmotools.pmo_engine.pmo_writer import PMOWriter
from pmotools.utils.small_utils import Utils


def parse_args_extract_pmo_with_select_targets():
    parser = argparse.ArgumentParser()
    parser.add_argument("--file", type=str, required=True, help="PMO file")
    parser.add_argument(
        "--output", type=str, required=True, help="Output json file path"
    )
    parser.add_argument(
        "--overwrite", action="store_true", help="If output file exists, overwrite it"
    )
    parser.add_argument(
        "--verbose",
        action="store_true",
        help="write out various messages about extraction",
    )
    parser.add_argument(
        "--targets",
        type=str,
        required=True,
        help="Can either comma separated target_namess, or a plain text file where each line is a target_namess",
    )
    return parser.parse_args()


def extract_pmo_with_select_targets():
    args = parse_args_extract_pmo_with_select_targets()

    # check files
    Utils.inputOutputFileCheck(args.file, args.output, args.overwrite)

    # parse target names
    all_target_names = Utils.parse_delimited_input_or_file(args.targets)

    # read in pmo
    pmo = PMOReader.read_in_pmo(args.file)

    # extract
    pmo_out = PMOProcessor.filter_pmo_by_target_names(pmo, all_target_names)

    # write out the extracted
    args.output = PMOWriter.add_pmo_extension_as_needed(
        args.output, args.file.endswith(".gz") or args.output.endswith(".gz")
    )
    PMOWriter.write_out_pmo(pmo_out, args.output, args.overwrite)


if __name__ == "__main__":
    extract_pmo_with_select_targets()

You can extract by supplies the desired targets with comma separated values on the command line

Code
cd example 

pmotools-python extract_pmo_with_select_targets --file ../../format/moz2018_PMO.json.gz --targets t1,t20,t31  --output t1_t20_t31_moz2018_PMO.json.gz --overwrite

You can also provide a single column file where each line is a desired target

Code
cd example 

echo -e "t1\nt20\nt31" > select_targets.txt 
pmotools-python extract_pmo_with_select_targets --file ../../format/moz2018_PMO.json.gz --targets select_targets.txt  --output t1_t20_t31_moz2018_PMO.json.gz --overwrite

echo -e "t1\nt20\nt31" | pmotools-python extract_pmo_with_select_targets --file ../../format/moz2018_PMO.json.gz --targets STDIN --output t1_t20_t31_moz2018_PMO.json.gz --overwrite

Subsetting by specific specimen_names

You can subset to just to just select specimen_name, each specimen can have several experiments associated with it, by supplying the specimen_name all associated experiments will also be pulled

Similar to above you can supply the specimen_names either as comma separated values or in a plain text file where each line is a specimen_name

Code
pmotools-python extract_pmo_with_select_specimen_names -h
usage: pmotools-python extract_pmo_with_select_specimen_names
       [-h] --file FILE --output OUTPUT [--overwrite] [--verbose]
       --specimen_names SPECIMEN_NAMES

options:
  -h, --help            show this help message and exit
  --file FILE           PMO file
  --output OUTPUT       Output json file path
  --overwrite           If output file exists, overwrite it
  --verbose             write out various messages about extraction
  --specimen_names SPECIMEN_NAMES
                        Can either comma separated specimen_names, or a plain
                        text file where each line is a specimen_name

The python code for extract_pmo_with_select_specimen_names script is below

Code
pmotools-python/src/pmotools/scripts/extractors_from_pmo/extract_pmo_with_select_specimen_names.py
#!/usr/bin/env python3
import argparse


from pmotools.pmo_engine.pmo_processor import PMOProcessor
from pmotools.pmo_engine.pmo_reader import PMOReader
from pmotools.pmo_engine.pmo_writer import PMOWriter
from pmotools.utils.small_utils import Utils


def parse_args_extract_pmo_with_select_specimen_names():
    parser = argparse.ArgumentParser()
    parser.add_argument("--file", type=str, required=True, help="PMO file")
    parser.add_argument(
        "--output", type=str, required=True, help="Output json file path"
    )
    parser.add_argument(
        "--overwrite", action="store_true", help="If output file exists, overwrite it"
    )
    parser.add_argument(
        "--verbose",
        action="store_true",
        help="write out various messages about extraction",
    )
    parser.add_argument(
        "--specimen_names",
        type=str,
        required=True,
        help="Can either comma separated specimen_names, or a plain text file where each line is a specimen_name",
    )
    return parser.parse_args()


def extract_pmo_with_select_specimen_names():
    args = parse_args_extract_pmo_with_select_specimen_names()

    # check files
    Utils.inputOutputFileCheck(args.file, args.output, args.overwrite)

    # parse specimen names
    all_specimen_names = Utils.parse_delimited_input_or_file(args.specimen_names)

    # read in pmo
    pmo = PMOReader.read_in_pmo(args.file)

    # extract
    pmo_out = PMOProcessor.filter_pmo_by_specimen_names(pmo, all_specimen_names)

    # write out the extracted
    args.output = PMOWriter.add_pmo_extension_as_needed(
        args.output, args.file.endswith(".gz") or args.output.endswith(".gz")
    )
    PMOWriter.write_out_pmo(pmo_out, args.output, args.overwrite)


if __name__ == "__main__":
    extract_pmo_with_select_specimen_names()
Code
cd example 

pmotools-python extract_pmo_with_select_specimen_names --specimen_names 8025874217,8025875146,8034209589 --file ../../format/moz2018_PMO.json.gz --output 8025874217_8025875146_8034209589_moz2018_PMO.json.gz --overwrite

echo -e "8025874217\n8025875146\n8034209589" > select_specimen_names.txt 

pmotools-python extract_pmo_with_select_specimen_names --specimen_names select_specimen_names.txt --file ../../format/moz2018_PMO.json.gz --output 8025874217_8025875146_8034209589_moz2018_PMO.json.gz --overwrite

echo -e "8025874217\n8025875146\n8034209589" | pmotools-python extract_pmo_with_select_specimen_names --specimen_names STDIN --file ../../format/moz2018_PMO.json.gz --output 8025874217_8025875146_8034209589_moz2018_PMO.json.gz --overwrite

Subsetting by specific library_sample_names

If you want just specific library_sample_name you can supply those instead too

Similar to above you can supply the library_sample_names either as comma separated values or in a plain text file where each line is a library_sample_name or from standard in (STDIN)

Code
pmotools-python extract_pmo_with_select_library_sample_names -h
usage: pmotools-python extract_pmo_with_select_library_sample_names
       [-h] --file FILE --output OUTPUT [--overwrite] [--verbose]
       --library_sample_names LIBRARY_SAMPLE_NAMES

options:
  -h, --help            show this help message and exit
  --file FILE           PMO file
  --output OUTPUT       Output json file path
  --overwrite           If output file exists, overwrite it
  --verbose             write out various messages about extraction
  --library_sample_names LIBRARY_SAMPLE_NAMES
                        Can either comma separated library_sample_names, or a
                        plain text file where each line is a
                        library_sample_name

The python code for extract_pmo_with_select_library_sample_names script is below

Code
pmotools-python/src/pmotools/scripts/extractors_from_pmo/extract_pmo_with_select_library_sample_names.py
#!/usr/bin/env python3
import argparse


from pmotools.pmo_engine.pmo_processor import PMOProcessor
from pmotools.pmo_engine.pmo_reader import PMOReader
from pmotools.pmo_engine.pmo_writer import PMOWriter
from pmotools.utils.small_utils import Utils


def parse_args_extract_pmo_with_select_library_sample_names():
    parser = argparse.ArgumentParser()
    parser.add_argument("--file", type=str, required=True, help="PMO file")
    parser.add_argument(
        "--output", type=str, required=True, help="Output json file path"
    )
    parser.add_argument(
        "--overwrite", action="store_true", help="If output file exists, overwrite it"
    )
    parser.add_argument(
        "--verbose",
        action="store_true",
        help="write out various messages about extraction",
    )
    parser.add_argument(
        "--library_sample_names",
        type=str,
        required=True,
        help="Can either comma separated library_sample_names, or a plain text file where each line is a library_sample_name",
    )
    return parser.parse_args()


def extract_pmo_with_select_library_sample_names():
    args = parse_args_extract_pmo_with_select_library_sample_names()

    # check files
    Utils.inputOutputFileCheck(args.file, args.output, args.overwrite)

    # parse specimen names
    all_library_sample_names = set(
        Utils.parse_delimited_input_or_file(args.library_sample_names)
    )

    # read in pmo
    pmo = PMOReader.read_in_pmo(args.file)

    # extract
    pmo_out = PMOProcessor.filter_pmo_by_library_sample_names(
        pmo, all_library_sample_names
    )

    # write out the extracted
    args.output = PMOWriter.add_pmo_extension_as_needed(
        args.output, args.file.endswith(".gz") or args.output.endswith(".gz")
    )
    PMOWriter.write_out_pmo(pmo_out, args.output, args.overwrite)


if __name__ == "__main__":
    extract_pmo_with_select_library_sample_names()
Code
cd example 

pmotools-python extract_pmo_with_select_library_sample_names --library_sample_names 8025875029,8034209834,8034209115 --file ../../format/moz2018_PMO.json.gz --output 8025875029_8034209834_8034209115_moz2018_PMO.json.gz --overwrite

echo -e "8025875029\n8034209834\n8034209115" > select_library_sample_names.txt 
  
pmotools-python extract_pmo_with_select_library_sample_names --library_sample_names select_library_sample_names.txt --file ../../format/moz2018_PMO.json.gz --output 8025875029_8034209834_8034209115_moz2018_PMO.json.gz --overwrite


echo -e "8025875029\n8034209834\n8034209115" | pmotools-python extract_pmo_with_select_library_sample_names --library_sample_names STDIN --file ../../format/moz2018_PMO.json.gz --output 8025875029_8034209834_8034209115_moz2018_PMO.json.gz --overwrite

Subsetting by samples within specific metafields

If you want to get specific samples that match certain meta fields like specific collection_country or collection_date you can use ``

Code
pmotools-python extract_pmo_with_selected_meta -h 
usage: pmotools-python extract_pmo_with_selected_meta [-h] --file FILE
                                                      --output OUTPUT
                                                      [--overwrite]
                                                      [--verbose]
                                                      --metaFieldsValues
                                                      METAFIELDSVALUES

options:
  -h, --help            show this help message and exit
  --file FILE           PMO file
  --output OUTPUT       Output json file path
  --overwrite           If output file exists, overwrite it
  --verbose             write out various messages about extraction
  --metaFieldsValues METAFIELDSVALUES
                        Meta Fields to include, should either be a table with
                        columns field, values (and optionally group) or
                        supplied command line as
                        field1=value1,value2,value3:field2=value1,value2

The python code for extract_pmo_with_selected_meta script is below

Code
pmotools-python/src/pmotools/scripts/extractors_from_pmo/extract_pmo_with_selected_meta.py
#!/usr/bin/env python3
import argparse
import sys


from pmotools.pmo_engine.pmo_processor import PMOProcessor
from pmotools.pmo_engine.pmo_reader import PMOReader
from pmotools.pmo_engine.pmo_writer import PMOWriter
from pmotools.utils.small_utils import Utils


def parse_args_extract_pmo_with_selected_meta():
    parser = argparse.ArgumentParser()
    parser.add_argument("--file", type=str, required=True, help="PMO file")
    parser.add_argument(
        "--output", type=str, required=True, help="Output json file path"
    )
    parser.add_argument(
        "--overwrite", action="store_true", help="If output file exists, overwrite it"
    )
    parser.add_argument(
        "--verbose",
        action="store_true",
        help="write out various messages about extraction",
    )
    parser.add_argument(
        "--metaFieldsValues",
        type=str,
        required=True,
        help="Meta Fields to include, should either be a table with columns field, values (and optionally group) or supplied command line as field1=value1,value2,value3:field2=value1,value2",
    )
    return parser.parse_args()


def extract_pmo_with_selected_meta():
    args = parse_args_extract_pmo_with_selected_meta()

    # check files
    Utils.inputOutputFileCheck(args.file, args.output, args.overwrite)

    # read in pmo
    pmo = PMOReader.read_in_pmo(args.file)

    # extract out of PMO
    pmo_out, group_counts = PMOProcessor.extract_from_pmo_samples_with_meta_groupings(
        pmo, args.metaFieldsValues
    )

    # write out the extracted
    args.output = PMOWriter.add_pmo_extension_as_needed(
        args.output, args.file.endswith(".gz") or args.output.endswith(".gz")
    )
    PMOWriter.write_out_pmo(pmo_out, args.output, args.overwrite)

    if args.verbose:
        sys.stdout.write(
            "Extracted the following number of specimens per group:" + "\n"
        )
        group_counts.to_csv(sys.stdout, sep="\t", index=True)


if __name__ == "__main__":
    extract_pmo_with_selected_meta()

pmotools-python extract_pmo_with_selected_meta is written to allow the extraction on multiple intersecting meta field requirments that can be either supplied in a file or with delimited on the command line

You may also want to know what current meta fields are present and how many samples in each. This can be done with pmotools-python list_specimen_meta_fields and pmotools-python count_specimen_meta

Code
cd example 
wget https://plasmogenepi.github.io/PMO_Docs/format/PathWeaverHeome1_PMO.json.gz
Code
cd example 
pmotools-python list_specimen_meta_fields --file ../../format/PathWeaverHeome1_PMO.json.gz
field   present_in_specimens_count  total_specimen_count
collection_country  19433   19433
collection_date 19433   19433
geo_admin3  19433   19433
host_taxon_id   19433   19433
project_id  19433   19433
specimen_collect_device 19433   19433
specimen_name   19433   19433
specimen_store_loc  19433   19433
specimen_taxon_id   19433   19433
Code
cd example 
pmotools-python count_specimen_meta --file ../../format/PathWeaverHeome1_PMO.json.gz  --meta_fields collection_country,collection_date | head -20
collection_country  collection_date specimens_count specimens_freq  total_specimen_count
Bangladesh  2008    15  0.0007718828796377296   19433
Bangladesh  2009    16  0.000823341738280245    19433
Bangladesh  2012    8   0.0004116708691401225   19433
Bangladesh  2012-04-19  1   5.145885864251531e-05   19433
Bangladesh  2012-06-05  2   0.00010291771728503062  19433
Bangladesh  2012-06-13  1   5.145885864251531e-05   19433
Bangladesh  2012-06-17  1   5.145885864251531e-05   19433
Bangladesh  2012-07-17  1   5.145885864251531e-05   19433
Bangladesh  2012-07-23  1   5.145885864251531e-05   19433
Bangladesh  2012-07-25  1   5.145885864251531e-05   19433
Bangladesh  2012-08-11  1   5.145885864251531e-05   19433
Bangladesh  2012-08-27  1   5.145885864251531e-05   19433
Bangladesh  2012-08-28  1   5.145885864251531e-05   19433
Bangladesh  2012-09-10  1   5.145885864251531e-05   19433
Bangladesh  2012-09-12  1   5.145885864251531e-05   19433
Bangladesh  2012-09-17  1   5.145885864251531e-05   19433
Bangladesh  2012-09-18  1   5.145885864251531e-05   19433
Bangladesh  2012-09-19  1   5.145885864251531e-05   19433
Bangladesh  2012-09-22  1   5.145885864251531e-05   19433
Traceback (most recent call last):
  File "/Users/nick/projects/plasmodium/falciparum/PMO_Docs/PMO_Docs_deployment/PMO_Docs/env/bin/pmotools-python", line 7, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Users/nick/projects/plasmodium/falciparum/PMO_Docs/PMO_Docs_deployment/PMO_Docs/pmotools-python/src/pmotools/cli.py", line 366, in main
    handler()
  File "/Users/nick/projects/plasmodium/falciparum/PMO_Docs/PMO_Docs_deployment/PMO_Docs/pmotools-python/src/pmotools/scripts/extract_info_from_pmo/count_specimen_meta.py", line 61, in count_specimen_meta
    counts_df.to_csv(
  File "/Users/nick/projects/plasmodium/falciparum/PMO_Docs/PMO_Docs_deployment/PMO_Docs/env/lib/python3.12/site-packages/pandas/util/_decorators.py", line 333, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/nick/projects/plasmodium/falciparum/PMO_Docs/PMO_Docs_deployment/PMO_Docs/env/lib/python3.12/site-packages/pandas/core/generic.py", line 3967, in to_csv
    return DataFrameRenderer(formatter).to_csv(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/nick/projects/plasmodium/falciparum/PMO_Docs/PMO_Docs_deployment/PMO_Docs/env/lib/python3.12/site-packages/pandas/io/formats/format.py", line 1014, in to_csv
    csv_formatter.save()
  File "/Users/nick/projects/plasmodium/falciparum/PMO_Docs/PMO_Docs_deployment/PMO_Docs/env/lib/python3.12/site-packages/pandas/io/formats/csvs.py", line 270, in save
    self._save()
  File "/Users/nick/projects/plasmodium/falciparum/PMO_Docs/PMO_Docs_deployment/PMO_Docs/env/lib/python3.12/site-packages/pandas/io/formats/csvs.py", line 275, in _save
    self._save_body()
  File "/Users/nick/projects/plasmodium/falciparum/PMO_Docs/PMO_Docs_deployment/PMO_Docs/env/lib/python3.12/site-packages/pandas/io/formats/csvs.py", line 313, in _save_body
    self._save_chunk(start_i, end_i)
  File "/Users/nick/projects/plasmodium/falciparum/PMO_Docs/PMO_Docs_deployment/PMO_Docs/env/lib/python3.12/site-packages/pandas/io/formats/csvs.py", line 324, in _save_chunk
    libwriters.write_csv_rows(
  File "writers.pyx", line 73, in pandas._libs.writers.write_csv_rows
BrokenPipeError: [Errno 32] Broken pipe
Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>
BrokenPipeError: [Errno 32] Broken pipe

Extracting on matching 1 meta field, below will extract just the samples that have collection_country=Bangladesh

Code
cd example 
pmotools-python extract_pmo_with_selected_meta  --file ../../format/PathWeaverHeome1_PMO.json.gz --metaFieldsValues "collection_country=Bangladesh" --output Bangladesh_moz2018_PMO.json.gz   --overwrite

If you want to see how many samples were extracted can use --verbose

Code
cd example 
pmotools-python extract_pmo_with_selected_meta  --file ../../format/PathWeaverHeome1_PMO.json.gz --metaFieldsValues "collection_country=Bangladesh" --output Bangladesh_moz2018_PMO.json.gz   --overwrite --verbose
Extracted the following number of specimens per group:
group   collection_country  count
0   Bangladesh  1418

Collecting more than 1 matching field separate by comma, for example to extract both Bangladesh,Benin

Code
cd example 

pmotools-python extract_pmo_with_selected_meta  --file ../../format/PathWeaverHeome1_PMO.json.gz --metaFieldsValues "collection_country=Bangladesh,Benin" --output Bangladesh_Benin_moz2018_PMO.json.gz   --overwrite --verbose 
Extracted the following number of specimens per group:
group   collection_country  count
0   Bangladesh,Benin    1576

Can add more extraction criteria meta, for example to extract samples with collection_country of Bangladesh or Benin and with collection_date of 2016

Code
cd example 

pmotools-python extract_pmo_with_selected_meta  --file ../../format/PathWeaverHeome1_PMO.json.gz --metaFieldsValues "collection_country=Bangladesh,Benin:collection_date=2016" --output Bangladesh_Benin_2016_moz2018_PMO.json.gz   --overwrite --verbose 
Extracted the following number of specimens per group:
group   collection_country  collection_date count
0   Bangladesh,Benin    2016    933

To get more specific you can group meta field extraction criteria , for example if you wanted samples from Bangladesh from year 2015 but wanted Benin from year 2016 you can separate by a ;

Code
cd example 

pmotools-python extract_pmo_with_selected_meta  --file ../../format/PathWeaverHeome1_PMO.json.gz --metaFieldsValues "collection_country=Bangladesh:collection_date=2015;collection_country=Benin:collection_date=2016" --output Bangladesh2015_Benin2016_moz2018_PMO.json.gz   --overwrite --verbose 
Extracted the following number of specimens per group:
group   collection_country  collection_date count
0   Bangladesh  2015    508
1   Benin   2016    117

Rather than supplying with the command line a file can be created

Code
cd example 

echo -e "group\tfield\tvalues" > Bangladesh2015_Benin2016_extractionCriteria.tsv 
echo -e "Bangladesh2015\tcollection_country\tBangladesh" >> Bangladesh2015_Benin2016_extractionCriteria.tsv 
echo -e "Bangladesh2015\tcollection_date\t2015" >> Bangladesh2015_Benin2016_extractionCriteria.tsv 
echo -e "Benin2016\tcollection_country\tBenin" >> Bangladesh2015_Benin2016_extractionCriteria.tsv 
echo -e "Benin2016\tcollection_date\t2016" >> Bangladesh2015_Benin2016_extractionCriteria.tsv 


pmotools-python extract_pmo_with_selected_meta  --file ../../format/PathWeaverHeome1_PMO.json.gz --metaFieldsValues Bangladesh2015_Benin2016_extractionCriteria.tsv --output Bangladesh2015_Benin2016_moz2018_PMO.json.gz   --overwrite --verbose 
Extracted the following number of specimens per group:
group   collection_country  collection_date count
Bangladesh2015  Bangladesh  2015    508
Benin2016   Benin   2016    117
Code
cd example 
pmotools-python extract_pmo_with_selected_meta  --file ../../format/moz2018_PMO.json.gz --metaFieldsValues "collection_country=Mozambique:geo_admin3=Inhassoro;collection_country=Mozambique:geo_admin3=Mandlakazi,Namaacha" --output Mozambique_moz2018_PMO.json.gz  --verbose  --overwrite
Extracted the following number of specimens per group:
group   collection_country  geo_admin3  count
0   Mozambique  Inhassoro   27
1   Mozambique  Mandlakazi,Namaacha 54

Subsetting by a read filter for detected microhaplotypes

Code
pmotools-python extract_pmo_with_read_filter -h
usage: pmotools-python extract_pmo_with_read_filter [-h] --file FILE --output
                                                    OUTPUT [--overwrite]
                                                    --read_count_minimum
                                                    READ_COUNT_MINIMUM

options:
  -h, --help            show this help message and exit
  --file FILE           PMO file
  --output OUTPUT       Output json file path
  --overwrite           If output file exists, overwrite it
  --read_count_minimum READ_COUNT_MINIMUM
                        the minimum read count (inclusive) for detected
                        haplotypes to be kept

The python code for extract_pmo_with_read_filter script is below

Code
pmotools-python/src/pmotools/scripts/extractors_from_pmo/extract_pmo_with_read_filter.py
#!/usr/bin/env python3
import argparse


from pmotools.pmo_engine.pmo_processor import PMOProcessor
from pmotools.pmo_engine.pmo_reader import PMOReader
from pmotools.pmo_engine.pmo_writer import PMOWriter
from pmotools.utils.small_utils import Utils


def parse_args_extract_pmo_with_read_filter():
    parser = argparse.ArgumentParser()
    parser.add_argument("--file", type=str, required=True, help="PMO file")
    parser.add_argument(
        "--output", type=str, required=True, help="Output json file path"
    )
    parser.add_argument(
        "--overwrite", action="store_true", help="If output file exists, overwrite it"
    )
    parser.add_argument(
        "--read_count_minimum",
        default=0.0,
        type=float,
        required=True,
        help="the minimum read count (inclusive) for detected haplotypes to be kept",
    )
    return parser.parse_args()


def extract_pmo_with_read_filter():
    args = parse_args_extract_pmo_with_read_filter()

    # check files
    Utils.inputOutputFileCheck(args.file, args.output, args.overwrite)

    # read in pmo
    pmo = PMOReader.read_in_pmo(args.file)

    # extract
    pmo_out = PMOProcessor.extract_from_pmo_with_read_filter(
        pmo, args.read_count_minimum
    )

    # write out the extracted
    args.output = PMOWriter.add_pmo_extension_as_needed(
        args.output, args.file.endswith(".gz") or args.output.endswith(".gz")
    )
    PMOWriter.write_out_pmo(pmo_out, args.output, args.overwrite)


if __name__ == "__main__":
    extract_pmo_with_read_filter()
Code
cd example 

pmotools-python extract_pmo_with_read_filter --read_count_minimum 1000 --file ../../format/moz2018_PMO.json.gz --output moz2018_PMO_minReadCount1000.json.gz --overwrite

Piping together extraction

The extraction methods also allow for STDOUT and STDIN piping for example

Code
cd example 

echo -e "8025875029\n8034209834\n8034209115" | pmotools-python extract_pmo_with_select_library_sample_names --library_sample_names STDIN --file ../../format/moz2018_PMO.json.gz --output STDOUT | pmotools-python extract_pmo_with_select_targets --file STDIN --targets t1,t20,t31  --output t1_t20_t31_8025875029_8034209834_8034209115_moz2018_PMO.json.gz --overwrite

Can also pipe into other pmotools-python functions like extracting allele tables

Code
cd example 

echo -e "8025875029\n8034209834\n8034209115" | pmotools-python extract_pmo_with_select_library_sample_names --library_sample_names STDIN --file ../../format/moz2018_PMO.json.gz --output STDOUT | pmotools-python extract_pmo_with_select_targets --file STDIN --targets t1,t20,t31  --output STDOUT | pmotools-python extract_allele_table --file STDIN --output alleles_data_t1_t20_t31_8025875029_8034209834_8034209115_moz2018_PMO.tsv.gz --overwrite

Can pipe final output to STDOUT as well for further processing

Code
cd example 

echo -e "8025875029\n8034209834\n8034209115" | pmotools-python extract_pmo_with_select_library_sample_names --library_sample_names STDIN --file ../../format/moz2018_PMO.json.gz --output STDOUT | pmotools-python extract_pmo_with_select_targets --file STDIN --targets t1,t20,t31  --output STDOUT | pmotools-python extract_allele_table --file STDIN --output STDOUT  --specimen_info_meta_fields specimen_name,collection_country
bioinformatics_run_name library_sample_name target_name mhap_id specimen_name   collection_country
Mozambique2018-SeekDeep 8034209115  t31 1   8034209115  Mozambique
Mozambique2018-SeekDeep 8034209115  t31 3   8034209115  Mozambique
Mozambique2018-SeekDeep 8034209115  t20 4   8034209115  Mozambique
Mozambique2018-SeekDeep 8034209115  t20 1   8034209115  Mozambique
Mozambique2018-SeekDeep 8034209115  t1  2   8034209115  Mozambique
Mozambique2018-SeekDeep 8025875029  t31 1   8025875029  Mozambique
Mozambique2018-SeekDeep 8025875029  t31 3   8025875029  Mozambique
Mozambique2018-SeekDeep 8025875029  t20 3   8025875029  Mozambique
Mozambique2018-SeekDeep 8025875029  t20 5   8025875029  Mozambique
Mozambique2018-SeekDeep 8025875029  t20 4   8025875029  Mozambique
Mozambique2018-SeekDeep 8025875029  t1  0   8025875029  Mozambique
Mozambique2018-SeekDeep 8025875029  t1  1   8025875029  Mozambique
Mozambique2018-SeekDeep 8034209834  t31 2   8034209834  Mozambique
Mozambique2018-SeekDeep 8034209834  t31 0   8034209834  Mozambique
Mozambique2018-SeekDeep 8034209834  t20 1   8034209834  Mozambique
Mozambique2018-SeekDeep 8034209834  t20 0   8034209834  Mozambique
Mozambique2018-SeekDeep 8034209834  t1  0   8034209834  Mozambique

filter to a read amount and then write allele table

Code
cd example 

pmotools-python extract_pmo_with_read_filter --read_count_minimum 1000 --file ../../format/moz2018_PMO.json.gz --output STDOUT | pmotools-python extract_allele_table --file STDIN --output moz2018_PMO_minReadCount1000_allele_table.tsv.gz  --microhap_fields reads --representative_haps_fields seq --default_base_col_names specimen_name,target_id,allele --overwrite
Source Code
---
title: Subsetting from a PMO using pmotools-python
---

```{r setup, echo=F}
source("../common.R")
```

# Subsetting 

There may be some instances were you want to subset a much larger PMO file into a smaller PMO file to focus only one a set of samples and/or targets. There are several ways of doing this. 


## Subsetting by specific targets  

Can subset to only specific targets by using `pmotools-python extract_pmo_with_select_targets`

```{bash}
pmotools-python extract_pmo_with_select_targets -h
```

The python code for `extract_pmo_with_select_targets` script is below

```{python}
#| echo: true
#| eval: false
#| code-fold: true
#| code-line-numbers: true
#| filename: pmotools-python/src/pmotools/scripts/extractors_from_pmo/extract_pmo_with_select_targets.py
#| file: ../pmotools-python/src/pmotools/scripts/extractors_from_pmo/extract_pmo_with_select_targets.py
```



You can extract by supplies the desired targets with comma separated values on the command line 
```{bash}
cd example 

pmotools-python extract_pmo_with_select_targets --file ../../format/moz2018_PMO.json.gz --targets t1,t20,t31  --output t1_t20_t31_moz2018_PMO.json.gz --overwrite
```

You can also provide a single column file where each line is a desired target 
```{bash}
cd example 

echo -e "t1\nt20\nt31" > select_targets.txt 
pmotools-python extract_pmo_with_select_targets --file ../../format/moz2018_PMO.json.gz --targets select_targets.txt  --output t1_t20_t31_moz2018_PMO.json.gz --overwrite

echo -e "t1\nt20\nt31" | pmotools-python extract_pmo_with_select_targets --file ../../format/moz2018_PMO.json.gz --targets STDIN --output t1_t20_t31_moz2018_PMO.json.gz --overwrite

```


## Subsetting by specific specimen_names 

You can subset to just to just select specimen_name, each specimen can have several experiments associated with it, by supplying the specimen_name all associated experiments will also be pulled 

Similar to above you can supply the specimen_names either as comma separated values or in a plain text file where each line is a specimen_name

```{bash}
pmotools-python extract_pmo_with_select_specimen_names -h
```

The python code for `extract_pmo_with_select_specimen_names` script is below

```{python}
#| echo: true
#| eval: false
#| code-fold: true
#| code-line-numbers: true
#| filename: pmotools-python/src/pmotools/scripts/extractors_from_pmo/extract_pmo_with_select_specimen_names.py
#| file: ../pmotools-python/src/pmotools/scripts/extractors_from_pmo/extract_pmo_with_select_specimen_names.py
```

```{bash}
cd example 

pmotools-python extract_pmo_with_select_specimen_names --specimen_names 8025874217,8025875146,8034209589 --file ../../format/moz2018_PMO.json.gz --output 8025874217_8025875146_8034209589_moz2018_PMO.json.gz --overwrite

echo -e "8025874217\n8025875146\n8034209589" > select_specimen_names.txt 

pmotools-python extract_pmo_with_select_specimen_names --specimen_names select_specimen_names.txt --file ../../format/moz2018_PMO.json.gz --output 8025874217_8025875146_8034209589_moz2018_PMO.json.gz --overwrite

echo -e "8025874217\n8025875146\n8034209589" | pmotools-python extract_pmo_with_select_specimen_names --specimen_names STDIN --file ../../format/moz2018_PMO.json.gz --output 8025874217_8025875146_8034209589_moz2018_PMO.json.gz --overwrite
```

## Subsetting by specific library_sample_names 

If you want just specific library_sample_name you can supply those instead too  

Similar to above you can supply the library_sample_names either as comma separated values or in a plain text file where each line is a library_sample_name or from standard in (STDIN)

```{bash}
pmotools-python extract_pmo_with_select_library_sample_names -h
```

The python code for `extract_pmo_with_select_library_sample_names` script is below

```{python}
#| echo: true
#| eval: false
#| code-fold: true
#| code-line-numbers: true
#| filename: pmotools-python/src/pmotools/scripts/extractors_from_pmo/extract_pmo_with_select_library_sample_names.py
#| file: ../pmotools-python/src/pmotools/scripts/extractors_from_pmo/extract_pmo_with_select_library_sample_names.py
```

```{bash}
cd example 

pmotools-python extract_pmo_with_select_library_sample_names --library_sample_names 8025875029,8034209834,8034209115 --file ../../format/moz2018_PMO.json.gz --output 8025875029_8034209834_8034209115_moz2018_PMO.json.gz --overwrite

echo -e "8025875029\n8034209834\n8034209115" > select_library_sample_names.txt 
  
pmotools-python extract_pmo_with_select_library_sample_names --library_sample_names select_library_sample_names.txt --file ../../format/moz2018_PMO.json.gz --output 8025875029_8034209834_8034209115_moz2018_PMO.json.gz --overwrite


echo -e "8025875029\n8034209834\n8034209115" | pmotools-python extract_pmo_with_select_library_sample_names --library_sample_names STDIN --file ../../format/moz2018_PMO.json.gz --output 8025875029_8034209834_8034209115_moz2018_PMO.json.gz --overwrite
```


## Subsetting by samples within specific metafields  

If you want to get specific samples that match certain meta fields like specific collection_country or collection_date you can use ``

```{bash}
pmotools-python extract_pmo_with_selected_meta -h 
```

The python code for `extract_pmo_with_selected_meta` script is below

```{python}
#| echo: true
#| eval: false
#| code-fold: true
#| code-line-numbers: true
#| filename: pmotools-python/src/pmotools/scripts/extractors_from_pmo/extract_pmo_with_selected_meta.py
#| file: ../pmotools-python/src/pmotools/scripts/extractors_from_pmo/extract_pmo_with_selected_meta.py
```

`pmotools-python extract_pmo_with_selected_meta` is written to allow the extraction on multiple intersecting meta field requirments that can be either supplied in a file or with delimited on the command line 

You may also want to know what current meta fields are present and how many samples in each. This can be done with [`pmotools-python list_specimen_meta_fields`](getting_basic_info_from_pmo.qmd#list_specimen_meta_fields) and `pmotools-python count_specimen_meta`

```{bash, eval = F}
cd example 
wget https://plasmogenepi.github.io/PMO_Docs/format/PathWeaverHeome1_PMO.json.gz

```

```{bash}
cd example 
pmotools-python list_specimen_meta_fields --file ../../format/PathWeaverHeome1_PMO.json.gz

```

```{bash}
cd example 
pmotools-python count_specimen_meta --file ../../format/PathWeaverHeome1_PMO.json.gz  --meta_fields collection_country,collection_date | head -20

```


Extracting on matching 1 meta field, below will extract just the samples that have collection_country=Bangladesh

```{bash, eval = F}
cd example 
pmotools-python extract_pmo_with_selected_meta  --file ../../format/PathWeaverHeome1_PMO.json.gz --metaFieldsValues "collection_country=Bangladesh" --output Bangladesh_moz2018_PMO.json.gz   --overwrite

```

If you want to see how many samples were extracted can use `--verbose`

```{bash}
cd example 
pmotools-python extract_pmo_with_selected_meta  --file ../../format/PathWeaverHeome1_PMO.json.gz --metaFieldsValues "collection_country=Bangladesh" --output Bangladesh_moz2018_PMO.json.gz   --overwrite --verbose

```

Collecting more than 1 matching field separate by comma, for example to extract both Bangladesh,Benin

```{bash}
cd example 

pmotools-python extract_pmo_with_selected_meta  --file ../../format/PathWeaverHeome1_PMO.json.gz --metaFieldsValues "collection_country=Bangladesh,Benin" --output Bangladesh_Benin_moz2018_PMO.json.gz   --overwrite --verbose 

```

Can add more extraction criteria meta, for example to extract samples with collection_country of Bangladesh or Benin and with collection_date of 2016 

```{bash}
cd example 

pmotools-python extract_pmo_with_selected_meta  --file ../../format/PathWeaverHeome1_PMO.json.gz --metaFieldsValues "collection_country=Bangladesh,Benin:collection_date=2016" --output Bangladesh_Benin_2016_moz2018_PMO.json.gz   --overwrite --verbose 

```

To get more specific you can group meta field extraction criteria , for example if you wanted samples from Bangladesh from year 2015 but wanted Benin from year 2016 you can separate by a ; 

```{bash}
cd example 

pmotools-python extract_pmo_with_selected_meta  --file ../../format/PathWeaverHeome1_PMO.json.gz --metaFieldsValues "collection_country=Bangladesh:collection_date=2015;collection_country=Benin:collection_date=2016" --output Bangladesh2015_Benin2016_moz2018_PMO.json.gz   --overwrite --verbose 

```


Rather than supplying with the command line a file can be created 

```{bash}
cd example 

echo -e "group\tfield\tvalues" > Bangladesh2015_Benin2016_extractionCriteria.tsv 
echo -e "Bangladesh2015\tcollection_country\tBangladesh" >> Bangladesh2015_Benin2016_extractionCriteria.tsv 
echo -e "Bangladesh2015\tcollection_date\t2015" >> Bangladesh2015_Benin2016_extractionCriteria.tsv 
echo -e "Benin2016\tcollection_country\tBenin" >> Bangladesh2015_Benin2016_extractionCriteria.tsv 
echo -e "Benin2016\tcollection_date\t2016" >> Bangladesh2015_Benin2016_extractionCriteria.tsv 


pmotools-python extract_pmo_with_selected_meta  --file ../../format/PathWeaverHeome1_PMO.json.gz --metaFieldsValues Bangladesh2015_Benin2016_extractionCriteria.tsv --output Bangladesh2015_Benin2016_moz2018_PMO.json.gz   --overwrite --verbose 

```



```{bash}
cd example 
pmotools-python extract_pmo_with_selected_meta  --file ../../format/moz2018_PMO.json.gz --metaFieldsValues "collection_country=Mozambique:geo_admin3=Inhassoro;collection_country=Mozambique:geo_admin3=Mandlakazi,Namaacha" --output Mozambique_moz2018_PMO.json.gz  --verbose  --overwrite

```


## Subsetting by a read filter for detected microhaplotypes 



```{bash}
pmotools-python extract_pmo_with_read_filter -h
```

The python code for `extract_pmo_with_read_filter` script is below

```{python}
#| echo: true
#| eval: false
#| code-fold: true
#| code-line-numbers: true
#| filename: pmotools-python/src/pmotools/scripts/extractors_from_pmo/extract_pmo_with_read_filter.py
#| file: ../pmotools-python/src/pmotools/scripts/extractors_from_pmo/extract_pmo_with_read_filter.py
```

```{bash}
cd example 

pmotools-python extract_pmo_with_read_filter --read_count_minimum 1000 --file ../../format/moz2018_PMO.json.gz --output moz2018_PMO_minReadCount1000.json.gz --overwrite
```


## Piping together extraction 

The extraction methods also allow for STDOUT and STDIN piping for example  
```{bash}
cd example 

echo -e "8025875029\n8034209834\n8034209115" | pmotools-python extract_pmo_with_select_library_sample_names --library_sample_names STDIN --file ../../format/moz2018_PMO.json.gz --output STDOUT | pmotools-python extract_pmo_with_select_targets --file STDIN --targets t1,t20,t31  --output t1_t20_t31_8025875029_8034209834_8034209115_moz2018_PMO.json.gz --overwrite

```

Can also pipe into other `pmotools-python` functions like extracting allele tables 

```{bash}
cd example 

echo -e "8025875029\n8034209834\n8034209115" | pmotools-python extract_pmo_with_select_library_sample_names --library_sample_names STDIN --file ../../format/moz2018_PMO.json.gz --output STDOUT | pmotools-python extract_pmo_with_select_targets --file STDIN --targets t1,t20,t31  --output STDOUT | pmotools-python extract_allele_table --file STDIN --output alleles_data_t1_t20_t31_8025875029_8034209834_8034209115_moz2018_PMO.tsv.gz --overwrite

```


Can pipe final output to STDOUT as well for further processing 

```{bash}
cd example 

echo -e "8025875029\n8034209834\n8034209115" | pmotools-python extract_pmo_with_select_library_sample_names --library_sample_names STDIN --file ../../format/moz2018_PMO.json.gz --output STDOUT | pmotools-python extract_pmo_with_select_targets --file STDIN --targets t1,t20,t31  --output STDOUT | pmotools-python extract_allele_table --file STDIN --output STDOUT  --specimen_info_meta_fields specimen_name,collection_country

```

filter to a read amount and then write allele table 

```{bash}
cd example 

pmotools-python extract_pmo_with_read_filter --read_count_minimum 1000 --file ../../format/moz2018_PMO.json.gz --output STDOUT | pmotools-python extract_allele_table --file STDIN --output moz2018_PMO_minReadCount1000_allele_table.tsv.gz  --microhap_fields reads --representative_haps_fields seq --default_base_col_names specimen_name,target_id,allele --overwrite

```