Portable Microhaplotype Object (PMO)
  • Home
  • Format Info
    • Development of Format
    • PMO fields overview
    • PMO Examples
    • Format Overview For Developers
  • Tools Installation
    • pmotools-python installation
  • pmotools-python usages
    • Command line interface

    • pmotools-runner.py
    • Command line interface to pmotools-python with pmotools-runner.py
    • Extracting out of PMO
    • Extracting allele tables using pmotools-python
    • Subset PMO
    • Subsetting from a PMO using pmotools-python
    • Getting sub info from PMO
    • Getting basic info out of PMO using pmotools-python
    • Getting panel info out of PMO using pmotools-python
    • Handling Multiple PMOs
    • Handling multiple PMOs pmotools-python

    • Python interface
    • Getting basic info out of a PMO
    • Creating a PMO File
  • Resources
    • References

Contents

  • Subsetting
    • Subsetting by specific targets
    • Subsetting by specific specimen_ids
    • Subsetting by specific experiment_sample_ids
    • Subsetting by samples within specific metafields
    • Subsetting by a read filter for detected microhaplotypes
    • Piping together extraction

Subsetting from a PMO using pmotools-python

  • Show All Code
  • Hide All Code

  • View Source

Subsetting

There may be some instances were you want to subset a much larger PMO file into a smaller PMO file to focus only one a set of samples and/or targets. There are several ways of doing this.

Subsetting by specific targets

Can subset to only specific targets by using pmotools-runner.py extract_pmo_with_select_targets

Code
pmotools-runner.py extract_pmo_with_select_targets -h
usage: pmotools-runner.py extract_pmo_with_select_targets [-h] --file FILE
                                                          --output OUTPUT
                                                          [--overwrite]
                                                          [--verbose]
                                                          --targets TARGETS

options:
  -h, --help         show this help message and exit
  --file FILE        PMO file
  --output OUTPUT    Output json file path
  --overwrite        If output file exists, overwrite it
  --verbose          write out various messages about extraction
  --targets TARGETS  Can either comma separated target_ids, or a plain text
                     file where each line is a target_ids

The python code for extract_pmo_with_select_targets script is below

Code
pmotools-python/scripts/extractors_from_pmo/extract_pmo_with_select_targets.py
#!/usr/bin/env python3
import os, argparse, json
import sys
from collections import defaultdict

import pandas as pd

from pmotools.pmo_utils.PMOExtractor import PMOExtractor
from pmotools.pmo_utils.PMOReader import PMOReader
from pmotools.pmo_utils.PMOWriter import PMOWriter
from pmotools.utils.small_utils import Utils


def parse_args_extract_pmo_with_select_targets():
    parser = argparse.ArgumentParser()
    parser.add_argument('--file', type=str, required=True, help='PMO file')
    parser.add_argument('--output', type=str, required=True, help='Output json file path')
    parser.add_argument('--overwrite', action = 'store_true', help='If output file exists, overwrite it')
    parser.add_argument('--verbose', action = 'store_true', help='write out various messages about extraction')
    parser.add_argument('--targets', type=str, required=True, help='Can either comma separated target_ids, or a plain text file where each line is a target_ids')
    return parser.parse_args()

def extract_pmo_with_select_targets():
    args = parse_args_extract_pmo_with_select_targets()

    # check files
    Utils.inputOutputFileCheck(args.file, args.output, args.overwrite)

    # parse target ids
    all_target_ids = Utils.parse_delimited_input_or_file(args.targets)

    # read in pmo
    pmo = PMOReader.read_in_pmo(args.file)

    # extract
    pmo_out = PMOExtractor.extract_from_pmo_select_targets(pmo, all_target_ids)

    # write out the extracted
    args.output = PMOWriter.add_pmo_extension_as_needed(args.output, args.file.endswith('.gz') or args.output.endswith(".gz"))
    PMOWriter.write_out_pmo(pmo_out, args.output, args.overwrite)



if __name__ == "__main__":
    extract_pmo_with_select_targets()

You can extract by supplies the desired targets with comma separated values on the command line

Code
cd example 

pmotools-runner.py extract_pmo_with_select_targets --file ../../format/moz2018_PMO.json.gz --targets t1,t20,t31  --output t1_t20_t31_moz2018_PMO.json.gz --overwrite

You can also provide a single column file where each line is a desired target

Code
cd example 

echo -e "t1\nt20\nt31" > select_targets.txt 
pmotools-runner.py extract_pmo_with_select_targets --file ../../format/moz2018_PMO.json.gz --targets select_targets.txt  --output t1_t20_t31_moz2018_PMO.json.gz --overwrite

echo -e "t1\nt20\nt31" | pmotools-runner.py extract_pmo_with_select_targets --file ../../format/moz2018_PMO.json.gz --targets STDIN --output t1_t20_t31_moz2018_PMO.json.gz --overwrite

Subsetting by specific specimen_ids

You can subset to just to just select specimen_id, each specimen can have several experiments associated with it, by supplying the specimen_id all associated experiments will also be pulled

Similar to above you can supply the specimen_ids either as comma separated values or in a plain text file where each line is a specimen_id

Code
pmotools-runner.py extract_pmo_with_select_specimen_ids -h
usage: pmotools-runner.py extract_pmo_with_select_specimen_ids
       [-h] --file FILE --output OUTPUT [--overwrite] [--verbose]
       --specimen_ids SPECIMEN_IDS

options:
  -h, --help            show this help message and exit
  --file FILE           PMO file
  --output OUTPUT       Output json file path
  --overwrite           If output file exists, overwrite it
  --verbose             write out various messages about extraction
  --specimen_ids SPECIMEN_IDS
                        Can either comma separated specimen_ids, or a plain
                        text file where each line is a specimen_id

The python code for extract_pmo_with_select_specimen_ids script is below

Code
pmotools-python/scripts/extractors_from_pmo/extract_pmo_with_select_specimen_ids.py
#!/usr/bin/env python3
import os, argparse, json
import sys
from collections import defaultdict

import pandas as pd

from pmotools.pmo_utils.PMOExtractor import PMOExtractor
from pmotools.pmo_utils.PMOReader import PMOReader
from pmotools.pmo_utils.PMOWriter import PMOWriter
from pmotools.utils.small_utils import Utils


def parse_args_extract_pmo_with_select_specimen_ids():
    parser = argparse.ArgumentParser()
    parser.add_argument('--file', type=str, required=True, help='PMO file')
    parser.add_argument('--output', type=str, required=True, help='Output json file path')
    parser.add_argument('--overwrite', action = 'store_true', help='If output file exists, overwrite it')
    parser.add_argument('--verbose', action = 'store_true', help='write out various messages about extraction')
    parser.add_argument('--specimen_ids', type=str, required=True, help='Can either comma separated specimen_ids, or a plain text file where each line is a specimen_id')
    return parser.parse_args()

def extract_pmo_with_select_specimen_ids():
    args = parse_args_extract_pmo_with_select_specimen_ids()

    # check files
    Utils.inputOutputFileCheck(args.file, args.output, args.overwrite)

    # parse specimen ids
    all_specimen_ids = Utils.parse_delimited_input_or_file(args.specimen_ids)

    # read in pmo
    pmo = PMOReader.read_in_pmo(args.file)

    # extract
    pmo_out = PMOExtractor.extract_from_pmo_select_specimen_ids(pmo, all_specimen_ids)

    # write out the extracted
    args.output = PMOWriter.add_pmo_extension_as_needed(args.output, args.file.endswith('.gz') or args.output.endswith(".gz"))
    PMOWriter.write_out_pmo(pmo_out, args.output, args.overwrite)



if __name__ == "__main__":
    extract_pmo_with_select_specimen_ids()
Code
cd example 

pmotools-runner.py extract_pmo_with_select_specimen_ids --specimen_ids 8025874217,8025875146,8034209589 --file ../../format/moz2018_PMO.json.gz --output 8025874217_8025875146_8034209589_moz2018_PMO.json.gz --overwrite

echo -e "8025874217\n8025875146\n8034209589" > select_specimen_ids.txt 

pmotools-runner.py extract_pmo_with_select_specimen_ids --specimen_ids select_specimen_ids.txt --file ../../format/moz2018_PMO.json.gz --output 8025874217_8025875146_8034209589_moz2018_PMO.json.gz --overwrite

echo -e "8025874217\n8025875146\n8034209589" | pmotools-runner.py extract_pmo_with_select_specimen_ids --specimen_ids STDIN --file ../../format/moz2018_PMO.json.gz --output 8025874217_8025875146_8034209589_moz2018_PMO.json.gz --overwrite

Subsetting by specific experiment_sample_ids

If you want just specific experiment_sample_id you can supply those instead too

Similar to above you can supply the experiment_sample_ids either as comma separated values or in a plain text file where each line is a experiment_sample_id or from standard in (STDIN)

Code
pmotools-runner.py extract_pmo_with_select_experiment_sample_ids -h
usage: pmotools-runner.py extract_pmo_with_select_experiment_sample_ids
       [-h] --file FILE --output OUTPUT [--overwrite] [--verbose]
       --experiment_sample_ids EXPERIMENT_SAMPLE_IDS

options:
  -h, --help            show this help message and exit
  --file FILE           PMO file
  --output OUTPUT       Output json file path
  --overwrite           If output file exists, overwrite it
  --verbose             write out various messages about extraction
  --experiment_sample_ids EXPERIMENT_SAMPLE_IDS
                        Can either comma separated experiment_sample_ids, or a
                        plain text file where each line is a
                        experiment_sample_id

The python code for extract_pmo_with_select_experiment_sample_ids script is below

Code
pmotools-python/scripts/extractors_from_pmo/extract_pmo_with_select_experiment_sample_ids.py
#!/usr/bin/env python3
import os, argparse, json
import sys
from collections import defaultdict

import pandas as pd

from pmotools.pmo_utils.PMOExtractor import PMOExtractor
from pmotools.pmo_utils.PMOReader import PMOReader
from pmotools.pmo_utils.PMOWriter import PMOWriter
from pmotools.utils.small_utils import Utils


def parse_args_extract_pmo_with_select_experiment_sample_ids():
    parser = argparse.ArgumentParser()
    parser.add_argument('--file', type=str, required=True, help='PMO file')
    parser.add_argument('--output', type=str, required=True, help='Output json file path')
    parser.add_argument('--overwrite', action = 'store_true', help='If output file exists, overwrite it')
    parser.add_argument('--verbose', action = 'store_true', help='write out various messages about extraction')
    parser.add_argument('--experiment_sample_ids', type=str, required=True, help='Can either comma separated experiment_sample_ids, or a plain text file where each line is a experiment_sample_id')
    return parser.parse_args()

def extract_pmo_with_select_experiment_sample_ids():
    args = parse_args_extract_pmo_with_select_experiment_sample_ids()

    # check files
    Utils.inputOutputFileCheck(args.file, args.output, args.overwrite)

    # parse specimen ids
    all_experiment_sample_ids = Utils.parse_delimited_input_or_file(args.experiment_sample_ids)

    # read in pmo
    pmo = PMOReader.read_in_pmo(args.file)

    # extract
    pmo_out = PMOExtractor.extract_from_pmo_select_experiment_sample_ids(pmo, all_experiment_sample_ids)

    # write out the extracted
    args.output = PMOWriter.add_pmo_extension_as_needed(args.output, args.file.endswith('.gz') or args.output.endswith(".gz"))
    PMOWriter.write_out_pmo(pmo_out, args.output, args.overwrite)



if __name__ == "__main__":
    extract_pmo_with_select_experiment_sample_ids()
Code
cd example 

pmotools-runner.py extract_pmo_with_select_experiment_sample_ids --experiment_sample_ids 8025875029,8034209834,8034209115 --file ../../format/moz2018_PMO.json.gz --output 8025875029_8034209834_8034209115_moz2018_PMO.json.gz --overwrite

echo -e "8025875029\n8034209834\n8034209115" > select_experiment_sample_ids.txt 
  
pmotools-runner.py extract_pmo_with_select_experiment_sample_ids --experiment_sample_ids select_experiment_sample_ids.txt --file ../../format/moz2018_PMO.json.gz --output 8025875029_8034209834_8034209115_moz2018_PMO.json.gz --overwrite


echo -e "8025875029\n8034209834\n8034209115" | pmotools-runner.py extract_pmo_with_select_experiment_sample_ids --experiment_sample_ids STDIN --file ../../format/moz2018_PMO.json.gz --output 8025875029_8034209834_8034209115_moz2018_PMO.json.gz --overwrite

Subsetting by samples within specific metafields

If you want to get specific samples that match certain meta fields like specific collection_country or collection_date you can use ``

Code
pmotools-runner.py extract_pmo_with_selected_meta -h 
usage: pmotools-runner.py extract_pmo_with_selected_meta [-h] --file FILE
                                                         --output OUTPUT
                                                         [--overwrite]
                                                         [--verbose]
                                                         --metaFieldsValues
                                                         METAFIELDSVALUES

options:
  -h, --help            show this help message and exit
  --file FILE           PMO file
  --output OUTPUT       Output json file path
  --overwrite           If output file exists, overwrite it
  --verbose             write out various messages about extraction
  --metaFieldsValues METAFIELDSVALUES
                        Meta Fields to include, should either be a table with
                        columns field, values (and optionally group) or
                        supplied command line as
                        field1=value1,value2,value3:field2=value1,value2

The python code for extract_pmo_with_selected_meta script is below

Code
pmotools-python/scripts/extractors_from_pmo/extract_pmo_with_selected_meta.py
#!/usr/bin/env python3
import os, argparse, json
import sys
from collections import defaultdict

import pandas as pd

from pmotools.pmo_utils.PMOExtractor import PMOExtractor
from pmotools.pmo_utils.PMOReader import PMOReader
from pmotools.pmo_utils.PMOWriter import PMOWriter
from pmotools.utils.small_utils import Utils


def parse_args_extract_pmo_with_selected_meta():
    parser = argparse.ArgumentParser()
    parser.add_argument('--file', type=str, required=True, help='PMO file')
    parser.add_argument('--output', type=str, required=True, help='Output json file path')
    parser.add_argument('--overwrite', action = 'store_true', help='If output file exists, overwrite it')
    parser.add_argument('--verbose', action = 'store_true', help='write out various messages about extraction')
    parser.add_argument('--metaFieldsValues', type=str, required=True, help='Meta Fields to include, should either be a table with columns field, values (and optionally group) or supplied command line as field1=value1,value2,value3:field2=value1,value2')
    return parser.parse_args()

def extract_pmo_with_selected_meta():
    args = parse_args_extract_pmo_with_selected_meta()

    # check files
    Utils.inputOutputFileCheck(args.file, args.output, args.overwrite)

    # read in pmo
    pmo = PMOReader.read_in_pmo(args.file)

    # extract out of PMO
    pmo_out, group_counts = PMOExtractor.extract_from_pmo_samples_with_meta_groupings(pmo, args.metaFieldsValues)

    # write out the extracted
    args.output = PMOWriter.add_pmo_extension_as_needed(args.output, args.file.endswith('.gz') or args.output.endswith(".gz"))
    PMOWriter.write_out_pmo(pmo_out, args.output, args.overwrite)

    if args.verbose:
        sys.stdout.write("Extracted the following number of specimens per group:" + "\n")
        group_counts.to_csv(sys.stdout, sep = "\t", index = True)

if __name__ == "__main__":
    extract_pmo_with_selected_meta()

pmotools-runner.py extract_pmo_with_selected_meta is written to allow the extraction on multiple intersecting meta field requirments that can be either supplied in a file or with delimited on the command line

You may also want to know what current meta fields are present and how many samples in each. This can be done with pmotools-runner.py list_specimen_meta_fields and pmotools-runner.py count_specimen_meta

Code
cd example 
wget https://plasmogenepi.github.io/PMO_Docs/format/PathWeaverHeome1_PMO.json.gz
Code
cd example 
pmotools-runner.py list_specimen_meta_fields --file ../../format/PathWeaverHeome1_PMO.json.gz
field   presentInSpecimensCount totalSpecimenCount
collection_country  19433   19433
collection_date 19433   19433
collector   19433   19433
geo_admin3  19433   19433
geo_continent   19433   19433
geo_region  19433   19433
geo_subRegion   19433   19433
host_taxon_id   19433   19433
project_name    19433   19433
samp_collect_device 19433   19433
samp_store_loc  19433   19433
samp_taxon_id   19433   19433
specimen_id 19433   19433
Code
cd example 
pmotools-runner.py count_specimen_meta --file ../../format/PathWeaverHeome1_PMO.json.gz  --meta_fields collection_country,collection_date | head -20
collection_country  collection_date specimensCount  specimensFreq   totalSpecimenCount
Bangladesh  2008    15  0.0007718828796377296   19433
Bangladesh  2009    16  0.000823341738280245    19433
Bangladesh  2012    51  0.002624401790768281    19433
Bangladesh  2015    508 0.026141100190397778    19433
Bangladesh  2016    816 0.041990428652292494    19433
Bangladesh  2017    12  0.0006175063037101837   19433
Benin   2014    41  0.002109813204343128    19433
Benin   2016    117 0.006020686461174291    19433
Brazil  1980    1   5.145885864251531e-05   19433
Brazil  2016    13  0.000668965162352699    19433
Brazil  2017    5   0.00025729429321257654  19433
Brazil  NA  1   5.145885864251531e-05   19433
Burkina Faso    2008    58  0.002984613801265888    19433
Cambodia    1993    6   0.00030875315185509186  19433
Cambodia    2007    26  0.001337930324705398    19433
Cambodia    2008    50  0.0025729429321257654   19433
Cambodia    2009    66  0.0033962846704060105   19433
Cambodia    2010    182 0.009365512272937786    19433
Cambodia    2011    441 0.02269335666134925 19433

Extracting on matching 1 meta field, below will extract just the samples that have collection_country=Bangladesh

Code
cd example 
pmotools-runner.py extract_pmo_with_selected_meta  --file ../../format/PathWeaverHeome1_PMO.json.gz --metaFieldsValues "collection_country=Bangladesh" --output Bangladesh_moz2018_PMO.json.gz   --overwrite

If you want to see how many samples were extracted can use --verbose

Code
cd example 
pmotools-runner.py extract_pmo_with_selected_meta  --file ../../format/PathWeaverHeome1_PMO.json.gz --metaFieldsValues "collection_country=Bangladesh" --output Bangladesh_moz2018_PMO.json.gz   --overwrite --verbose
Extracted the following number of specimens per group:
group   collection_country  count
0   Bangladesh  1418

Collecting more than 1 matching field separate by comma, for example to extract both Bangladesh,Benin

Code
cd example 

pmotools-runner.py extract_pmo_with_selected_meta  --file ../../format/PathWeaverHeome1_PMO.json.gz --metaFieldsValues "collection_country=Bangladesh,Benin" --output Bangladesh_Benin_moz2018_PMO.json.gz   --overwrite --verbose 
Extracted the following number of specimens per group:
group   collection_country  count
0   Bangladesh,Benin    1576

Can add more extraction criteria meta, for example to extract samples with collection_country of Bangladesh or Benin and with collection_date of 2016

Code
cd example 

pmotools-runner.py extract_pmo_with_selected_meta  --file ../../format/PathWeaverHeome1_PMO.json.gz --metaFieldsValues "collection_country=Bangladesh,Benin:collection_date=2016" --output Bangladesh_Benin_2016_moz2018_PMO.json.gz   --overwrite --verbose 
Extracted the following number of specimens per group:
group   collection_country  collection_date count
0   Bangladesh,Benin    2016    933

To get more specific you can group meta field extraction criteria , for example if you wanted samples from Bangladesh from year 2015 but wanted Benin from year 2016 you can separate by a ;

Code
cd example 

pmotools-runner.py extract_pmo_with_selected_meta  --file ../../format/PathWeaverHeome1_PMO.json.gz --metaFieldsValues "collection_country=Bangladesh:collection_date=2015;collection_country=Benin:collection_date=2016" --output Bangladesh2015_Benin2016_moz2018_PMO.json.gz   --overwrite --verbose 
Extracted the following number of specimens per group:
group   collection_country  collection_date count
0   Bangladesh  2015    508
1   Benin   2016    117

Rather than supplying with the command line a file can be created

Code
cd example 

echo -e "group\tfield\tvalues" > Bangladesh2015_Benin2016_extractionCriteria.tsv 
echo -e "Bangladesh2015\tcollection_country\tBangladesh" >> Bangladesh2015_Benin2016_extractionCriteria.tsv 
echo -e "Bangladesh2015\tcollection_date\t2015" >> Bangladesh2015_Benin2016_extractionCriteria.tsv 
echo -e "Benin2016\tcollection_country\tBenin" >> Bangladesh2015_Benin2016_extractionCriteria.tsv 
echo -e "Benin2016\tcollection_date\t2016" >> Bangladesh2015_Benin2016_extractionCriteria.tsv 


pmotools-runner.py extract_pmo_with_selected_meta  --file ../../format/PathWeaverHeome1_PMO.json.gz --metaFieldsValues Bangladesh2015_Benin2016_extractionCriteria.tsv --output Bangladesh2015_Benin2016_moz2018_PMO.json.gz   --overwrite --verbose 
Extracted the following number of specimens per group:
group   collection_country  collection_date count
Bangladesh2015  Bangladesh  2015    508
Benin2016   Benin   2016    117
Code
cd example 
pmotools-runner.py extract_pmo_with_selected_meta  --file ../../format/moz2018_PMO.json.gz --metaFieldsValues "collection_country=Mozambique:geo_admin3=Inhassoro;collection_country=Mozambique:geo_admin3=Mandlakazi,Namaacha" --output Mozambique_moz2018_PMO.json.gz  --verbose  --overwrite
Extracted the following number of specimens per group:
group   collection_country  geo_admin3  count
0   Mozambique  Inhassoro   27
1   Mozambique  Mandlakazi,Namaacha 54

Subsetting by a read filter for detected microhaplotypes

Code
pmotools-runner.py extract_pmo_with_read_filter -h
usage: pmotools-runner.py extract_pmo_with_read_filter [-h] --file FILE
                                                       --output OUTPUT
                                                       [--overwrite]
                                                       --read_count_minimum
                                                       READ_COUNT_MINIMUM

options:
  -h, --help            show this help message and exit
  --file FILE           PMO file
  --output OUTPUT       Output json file path
  --overwrite           If output file exists, overwrite it
  --read_count_minimum READ_COUNT_MINIMUM
                        the minimum read count (inclusive) for detected
                        haplotypes to be kept

The python code for extract_pmo_with_read_filter script is below

Code
pmotools-python/scripts/extractors_from_pmo/extract_pmo_with_read_filter.py
#!/usr/bin/env python3
import os, argparse, json
import sys
from collections import defaultdict

import pandas as pd

from pmotools.pmo_utils.PMOExtractor import PMOExtractor
from pmotools.pmo_utils.PMOReader import PMOReader
from pmotools.pmo_utils.PMOWriter import PMOWriter
from pmotools.utils.small_utils import Utils


def parse_args_extract_pmo_with_read_filter():
    parser = argparse.ArgumentParser()
    parser.add_argument('--file', type=str, required=True, help='PMO file')
    parser.add_argument('--output', type=str, required=True, help='Output json file path')
    parser.add_argument('--overwrite', action = 'store_true', help='If output file exists, overwrite it')
    parser.add_argument('--read_count_minimum', default=0.0, type=float, required=True, help='the minimum read count (inclusive) for detected haplotypes to be kept')
    return parser.parse_args()

def extract_pmo_with_read_filter():
    args = parse_args_extract_pmo_with_read_filter()

    # check files
    Utils.inputOutputFileCheck(args.file, args.output, args.overwrite)

    # read in pmo
    pmo = PMOReader.read_in_pmo(args.file)

    # extract
    pmo_out = PMOExtractor.extract_from_pmo_with_read_filter(pmo, args.read_count_minimum)

    # write out the extracted
    args.output = PMOWriter.add_pmo_extension_as_needed(args.output, args.file.endswith('.gz') or args.output.endswith(".gz"))
    PMOWriter.write_out_pmo(pmo_out, args.output, args.overwrite)

if __name__ == "__main__":
    extract_pmo_with_read_filter()
Code
cd example 

pmotools-runner.py extract_pmo_with_read_filter --read_count_minimum 1000 --file ../../format/moz2018_PMO.json.gz --output moz2018_PMO_minReadCount1000.json.gz --overwrite

Piping together extraction

The extraction methods also allow for STDOUT and STDIN piping for example

Code
cd example 

echo -e "8025875029\n8034209834\n8034209115" | pmotools-runner.py extract_pmo_with_select_experiment_sample_ids --experiment_sample_ids STDIN --file ../../format/moz2018_PMO.json.gz --output STDOUT | pmotools-runner.py extract_pmo_with_select_targets --file STDIN --targets t1,t20,t31  --output t1_t20_t31_8025875029_8034209834_8034209115_moz2018_PMO.json.gz --overwrite

Can also pipe into other pmotools-runner.py functions like extracting allele tables

Code
cd example 

echo -e "8025875029\n8034209834\n8034209115" | pmotools-runner.py extract_pmo_with_select_experiment_sample_ids --experiment_sample_ids STDIN --file ../../format/moz2018_PMO.json.gz --output STDOUT | pmotools-runner.py extract_pmo_with_select_targets --file STDIN --targets t1,t20,t31  --output STDOUT | pmotools-runner.py extract_allele_table --file STDIN --bioid Mozambique2018-SeekDeep --output alleles_data_t1_t20_t31_8025875029_8034209834_8034209115_moz2018_PMO.tsv.gz --overwrite

Can pipe final output to STDOUT as well for further processing

Code
cd example 

echo -e "8025875029\n8034209834\n8034209115" | pmotools-runner.py extract_pmo_with_select_experiment_sample_ids --experiment_sample_ids STDIN --file ../../format/moz2018_PMO.json.gz --output STDOUT | pmotools-runner.py extract_pmo_with_select_targets --file STDIN --targets t1,t20,t31  --output STDOUT | pmotools-runner.py extract_allele_table --file STDIN --bioid Mozambique2018-SeekDeep --output STDOUT  --specimen_info_meta_fields specimen_id,collection_country
sampleID    locus   allele  specimen_id collection_country
8025875029  t1  t1.0    8025875029  Mozambique
8025875029  t1  t1.1    8025875029  Mozambique
8025875029  t20 t20.3   8025875029  Mozambique
8025875029  t20 t20.5   8025875029  Mozambique
8025875029  t20 t20.4   8025875029  Mozambique
8025875029  t31 t31.1   8025875029  Mozambique
8025875029  t31 t31.3   8025875029  Mozambique
8034209115  t1  t1.2    8034209115  Mozambique
8034209115  t20 t20.4   8034209115  Mozambique
8034209115  t20 t20.1   8034209115  Mozambique
8034209115  t31 t31.1   8034209115  Mozambique
8034209115  t31 t31.3   8034209115  Mozambique
8034209834  t1  t1.0    8034209834  Mozambique
8034209834  t20 t20.1   8034209834  Mozambique
8034209834  t20 t20.0   8034209834  Mozambique
8034209834  t31 t31.2   8034209834  Mozambique
8034209834  t31 t31.0   8034209834  Mozambique

filter to a read amount and then write allele table

Code
cd example 

pmotools-runner.py extract_pmo_with_read_filter --read_count_minimum 1000 --file ../../format/moz2018_PMO.json.gz --output STDOUT | pmotools-runner.py extract_allele_table --file STDIN --bioid Mozambique2018-SeekDeep --output moz2018_PMO_minReadCount1000_allele_table.tsv.gz  --microhap_fields read_count --representative_haps_fields seq --default_base_col_names specimen_id,target_id,allele --overwrite
Source Code
---
title: Subsetting from a PMO using pmotools-python
---

```{r setup, echo=F}
source("../common.R")
```

# Subsetting 

There may be some instances were you want to subset a much larger PMO file into a smaller PMO file to focus only one a set of samples and/or targets. There are several ways of doing this. 


## Subsetting by specific targets  

Can subset to only specific targets by using `pmotools-runner.py extract_pmo_with_select_targets`

```{bash}
pmotools-runner.py extract_pmo_with_select_targets -h
```

The python code for `extract_pmo_with_select_targets` script is below

```{python}
#| echo: true
#| eval: false
#| code-fold: true
#| code-line-numbers: true
#| filename: pmotools-python/scripts/extractors_from_pmo/extract_pmo_with_select_targets.py
#| file: ../pmotools-python/scripts/extractors_from_pmo/extract_pmo_with_select_targets.py
```



You can extract by supplies the desired targets with comma separated values on the command line 
```{bash}
cd example 

pmotools-runner.py extract_pmo_with_select_targets --file ../../format/moz2018_PMO.json.gz --targets t1,t20,t31  --output t1_t20_t31_moz2018_PMO.json.gz --overwrite
```

You can also provide a single column file where each line is a desired target 
```{bash}
cd example 

echo -e "t1\nt20\nt31" > select_targets.txt 
pmotools-runner.py extract_pmo_with_select_targets --file ../../format/moz2018_PMO.json.gz --targets select_targets.txt  --output t1_t20_t31_moz2018_PMO.json.gz --overwrite

echo -e "t1\nt20\nt31" | pmotools-runner.py extract_pmo_with_select_targets --file ../../format/moz2018_PMO.json.gz --targets STDIN --output t1_t20_t31_moz2018_PMO.json.gz --overwrite

```


## Subsetting by specific specimen_ids 

You can subset to just to just select specimen_id, each specimen can have several experiments associated with it, by supplying the specimen_id all associated experiments will also be pulled 

Similar to above you can supply the specimen_ids either as comma separated values or in a plain text file where each line is a specimen_id

```{bash}
pmotools-runner.py extract_pmo_with_select_specimen_ids -h
```

The python code for `extract_pmo_with_select_specimen_ids` script is below

```{python}
#| echo: true
#| eval: false
#| code-fold: true
#| code-line-numbers: true
#| filename: pmotools-python/scripts/extractors_from_pmo/extract_pmo_with_select_specimen_ids.py
#| file: ../pmotools-python/scripts/extractors_from_pmo/extract_pmo_with_select_specimen_ids.py
```

```{bash}
cd example 

pmotools-runner.py extract_pmo_with_select_specimen_ids --specimen_ids 8025874217,8025875146,8034209589 --file ../../format/moz2018_PMO.json.gz --output 8025874217_8025875146_8034209589_moz2018_PMO.json.gz --overwrite

echo -e "8025874217\n8025875146\n8034209589" > select_specimen_ids.txt 

pmotools-runner.py extract_pmo_with_select_specimen_ids --specimen_ids select_specimen_ids.txt --file ../../format/moz2018_PMO.json.gz --output 8025874217_8025875146_8034209589_moz2018_PMO.json.gz --overwrite

echo -e "8025874217\n8025875146\n8034209589" | pmotools-runner.py extract_pmo_with_select_specimen_ids --specimen_ids STDIN --file ../../format/moz2018_PMO.json.gz --output 8025874217_8025875146_8034209589_moz2018_PMO.json.gz --overwrite
```

## Subsetting by specific experiment_sample_ids 

If you want just specific experiment_sample_id you can supply those instead too  

Similar to above you can supply the experiment_sample_ids either as comma separated values or in a plain text file where each line is a experiment_sample_id or from standard in (STDIN)

```{bash}
pmotools-runner.py extract_pmo_with_select_experiment_sample_ids -h
```

The python code for `extract_pmo_with_select_experiment_sample_ids` script is below

```{python}
#| echo: true
#| eval: false
#| code-fold: true
#| code-line-numbers: true
#| filename: pmotools-python/scripts/extractors_from_pmo/extract_pmo_with_select_experiment_sample_ids.py
#| file: ../pmotools-python/scripts/extractors_from_pmo/extract_pmo_with_select_experiment_sample_ids.py
```

```{bash}
cd example 

pmotools-runner.py extract_pmo_with_select_experiment_sample_ids --experiment_sample_ids 8025875029,8034209834,8034209115 --file ../../format/moz2018_PMO.json.gz --output 8025875029_8034209834_8034209115_moz2018_PMO.json.gz --overwrite

echo -e "8025875029\n8034209834\n8034209115" > select_experiment_sample_ids.txt 
  
pmotools-runner.py extract_pmo_with_select_experiment_sample_ids --experiment_sample_ids select_experiment_sample_ids.txt --file ../../format/moz2018_PMO.json.gz --output 8025875029_8034209834_8034209115_moz2018_PMO.json.gz --overwrite


echo -e "8025875029\n8034209834\n8034209115" | pmotools-runner.py extract_pmo_with_select_experiment_sample_ids --experiment_sample_ids STDIN --file ../../format/moz2018_PMO.json.gz --output 8025875029_8034209834_8034209115_moz2018_PMO.json.gz --overwrite
```


## Subsetting by samples within specific metafields  

If you want to get specific samples that match certain meta fields like specific collection_country or collection_date you can use ``

```{bash}
pmotools-runner.py extract_pmo_with_selected_meta -h 
```

The python code for `extract_pmo_with_selected_meta` script is below

```{python}
#| echo: true
#| eval: false
#| code-fold: true
#| code-line-numbers: true
#| filename: pmotools-python/scripts/extractors_from_pmo/extract_pmo_with_selected_meta.py
#| file: ../pmotools-python/scripts/extractors_from_pmo/extract_pmo_with_selected_meta.py
```

`pmotools-runner.py extract_pmo_with_selected_meta` is written to allow the extraction on multiple intersecting meta field requirments that can be either supplied in a file or with delimited on the command line 

You may also want to know what current meta fields are present and how many samples in each. This can be done with [`pmotools-runner.py list_specimen_meta_fields`](getting_basic_info_from_pmo.qmd#list_specimen_meta_fields) and `pmotools-runner.py count_specimen_meta`

```{bash, eval = F}
cd example 
wget https://plasmogenepi.github.io/PMO_Docs/format/PathWeaverHeome1_PMO.json.gz

```

```{bash}
cd example 
pmotools-runner.py list_specimen_meta_fields --file ../../format/PathWeaverHeome1_PMO.json.gz

```

```{bash}
cd example 
pmotools-runner.py count_specimen_meta --file ../../format/PathWeaverHeome1_PMO.json.gz  --meta_fields collection_country,collection_date | head -20

```


Extracting on matching 1 meta field, below will extract just the samples that have collection_country=Bangladesh

```{bash, eval = F}
cd example 
pmotools-runner.py extract_pmo_with_selected_meta  --file ../../format/PathWeaverHeome1_PMO.json.gz --metaFieldsValues "collection_country=Bangladesh" --output Bangladesh_moz2018_PMO.json.gz   --overwrite

```

If you want to see how many samples were extracted can use `--verbose`

```{bash}
cd example 
pmotools-runner.py extract_pmo_with_selected_meta  --file ../../format/PathWeaverHeome1_PMO.json.gz --metaFieldsValues "collection_country=Bangladesh" --output Bangladesh_moz2018_PMO.json.gz   --overwrite --verbose

```

Collecting more than 1 matching field separate by comma, for example to extract both Bangladesh,Benin

```{bash}
cd example 

pmotools-runner.py extract_pmo_with_selected_meta  --file ../../format/PathWeaverHeome1_PMO.json.gz --metaFieldsValues "collection_country=Bangladesh,Benin" --output Bangladesh_Benin_moz2018_PMO.json.gz   --overwrite --verbose 

```

Can add more extraction criteria meta, for example to extract samples with collection_country of Bangladesh or Benin and with collection_date of 2016 

```{bash}
cd example 

pmotools-runner.py extract_pmo_with_selected_meta  --file ../../format/PathWeaverHeome1_PMO.json.gz --metaFieldsValues "collection_country=Bangladesh,Benin:collection_date=2016" --output Bangladesh_Benin_2016_moz2018_PMO.json.gz   --overwrite --verbose 

```

To get more specific you can group meta field extraction criteria , for example if you wanted samples from Bangladesh from year 2015 but wanted Benin from year 2016 you can separate by a ; 

```{bash}
cd example 

pmotools-runner.py extract_pmo_with_selected_meta  --file ../../format/PathWeaverHeome1_PMO.json.gz --metaFieldsValues "collection_country=Bangladesh:collection_date=2015;collection_country=Benin:collection_date=2016" --output Bangladesh2015_Benin2016_moz2018_PMO.json.gz   --overwrite --verbose 

```


Rather than supplying with the command line a file can be created 

```{bash}
cd example 

echo -e "group\tfield\tvalues" > Bangladesh2015_Benin2016_extractionCriteria.tsv 
echo -e "Bangladesh2015\tcollection_country\tBangladesh" >> Bangladesh2015_Benin2016_extractionCriteria.tsv 
echo -e "Bangladesh2015\tcollection_date\t2015" >> Bangladesh2015_Benin2016_extractionCriteria.tsv 
echo -e "Benin2016\tcollection_country\tBenin" >> Bangladesh2015_Benin2016_extractionCriteria.tsv 
echo -e "Benin2016\tcollection_date\t2016" >> Bangladesh2015_Benin2016_extractionCriteria.tsv 


pmotools-runner.py extract_pmo_with_selected_meta  --file ../../format/PathWeaverHeome1_PMO.json.gz --metaFieldsValues Bangladesh2015_Benin2016_extractionCriteria.tsv --output Bangladesh2015_Benin2016_moz2018_PMO.json.gz   --overwrite --verbose 

```



```{bash}
cd example 
pmotools-runner.py extract_pmo_with_selected_meta  --file ../../format/moz2018_PMO.json.gz --metaFieldsValues "collection_country=Mozambique:geo_admin3=Inhassoro;collection_country=Mozambique:geo_admin3=Mandlakazi,Namaacha" --output Mozambique_moz2018_PMO.json.gz  --verbose  --overwrite

```


## Subsetting by a read filter for detected microhaplotypes 



```{bash}
pmotools-runner.py extract_pmo_with_read_filter -h
```

The python code for `extract_pmo_with_read_filter` script is below

```{python}
#| echo: true
#| eval: false
#| code-fold: true
#| code-line-numbers: true
#| filename: pmotools-python/scripts/extractors_from_pmo/extract_pmo_with_read_filter.py
#| file: ../pmotools-python/scripts/extractors_from_pmo/extract_pmo_with_read_filter.py
```

```{bash}
cd example 

pmotools-runner.py extract_pmo_with_read_filter --read_count_minimum 1000 --file ../../format/moz2018_PMO.json.gz --output moz2018_PMO_minReadCount1000.json.gz --overwrite
```


## Piping together extraction 

The extraction methods also allow for STDOUT and STDIN piping for example  
```{bash}
cd example 

echo -e "8025875029\n8034209834\n8034209115" | pmotools-runner.py extract_pmo_with_select_experiment_sample_ids --experiment_sample_ids STDIN --file ../../format/moz2018_PMO.json.gz --output STDOUT | pmotools-runner.py extract_pmo_with_select_targets --file STDIN --targets t1,t20,t31  --output t1_t20_t31_8025875029_8034209834_8034209115_moz2018_PMO.json.gz --overwrite

```

Can also pipe into other `pmotools-runner.py` functions like extracting allele tables 

```{bash}
cd example 

echo -e "8025875029\n8034209834\n8034209115" | pmotools-runner.py extract_pmo_with_select_experiment_sample_ids --experiment_sample_ids STDIN --file ../../format/moz2018_PMO.json.gz --output STDOUT | pmotools-runner.py extract_pmo_with_select_targets --file STDIN --targets t1,t20,t31  --output STDOUT | pmotools-runner.py extract_allele_table --file STDIN --bioid Mozambique2018-SeekDeep --output alleles_data_t1_t20_t31_8025875029_8034209834_8034209115_moz2018_PMO.tsv.gz --overwrite

```


Can pipe final output to STDOUT as well for further processing 

```{bash}
cd example 

echo -e "8025875029\n8034209834\n8034209115" | pmotools-runner.py extract_pmo_with_select_experiment_sample_ids --experiment_sample_ids STDIN --file ../../format/moz2018_PMO.json.gz --output STDOUT | pmotools-runner.py extract_pmo_with_select_targets --file STDIN --targets t1,t20,t31  --output STDOUT | pmotools-runner.py extract_allele_table --file STDIN --bioid Mozambique2018-SeekDeep --output STDOUT  --specimen_info_meta_fields specimen_id,collection_country

```

filter to a read amount and then write allele table 

```{bash}
cd example 

pmotools-runner.py extract_pmo_with_read_filter --read_count_minimum 1000 --file ../../format/moz2018_PMO.json.gz --output STDOUT | pmotools-runner.py extract_allele_table --file STDIN --bioid Mozambique2018-SeekDeep --output moz2018_PMO_minReadCount1000_allele_table.tsv.gz  --microhap_fields read_count --representative_haps_fields seq --default_base_col_names specimen_id,target_id,allele --overwrite

```