pmotools.pmo_builder.mhap_table_to_pmo module

pmotools.pmo_builder.mhap_table_to_pmo.build_detected_mhap_dict(df, bioinformatics_run_name, mhap_cols, always_include=None)[source]

pmotools.pmo_builder.mhap_table_to_pmo.create_detected_microhaplotype_dict(microhaplotype_table: DataFrame, representative_microhaplotype_dict: dict, bioinformatics_run_name: str | None = None, library_sample_name_col: str = 'library_sample_name', target_name_col: str = 'target_name', seq_col: str = 'seq', reads_col: str = 'reads', umis_col: str | None = None, additional_mhap_detected_cols: list | None = None)[source]

Convert the read-in microhaplotype calls table into the detected microhaplotype dictionary.

Parameters:

microhaplotype_table – Parsed microhaplotype calls table.
representative_microhaplotype_dict – Dictionary of representative microhaplotypes.
bioinformatics_run_name – Optional Unique name for the bioinformatics run that generated the data.
library_sample_name_col – Column containing the sample IDs.
target_name_col – Column containing the locus IDs.
seq_col – Column containing the microhaplotype sequences.
reads_col – Column containing the read counts.
umis_col – Optional Column with unique molecular identifier count associated with this microhaplotype
additional_mhap_detected_cols – Optional additional columns to add to the microhaplotypes detected, the key is the pandas column and the value is what to name it in the output.

Returns:

A dictionary of detected microhaplotype results.

pmotools.pmo_builder.mhap_table_to_pmo.create_minimum_library_specimen_dict_from_mhap_table(detected_microhaps: list[dict], panel_name: str, library_sample_field_name: str = 'library_sample_name', library_sample_specimen_key: dict[str, str] | DataFrame | None = None, library_sample_name_col: str = 'library_sample_name', specimen_name_col: str = 'specimen_name', missing_library_sample_becomes_specimen_name: bool = False)[source]

Create a minimum library_sample_info and specimen_info dicts from the detected microhaps

Parameters:

detected_microhaps – the detected microhaps object created by create_detected_microhaplotype_dict
panel_name – the panel_name for the library_sample
library_sample_field_name – the field name to use to extract the library_sample_name from the detected_michrohaplotypes
library_sample_specimen_key – a dict mapping library_sample_name -> specimen_name, or a pandas DataFrame with two columns for renaming controlled by library_sample_name_col and specimen_name_col if None, specimen_name == library_sample_name
library_sample_name_col – the column name in library_sample_specimen_key that contains the library_sample_name
specimen_name_col – the column name in library_sample_specimen_key that contains the specimen_name
missing_library_sample_becomes_specimen_name – if True and a library_sample_name is missing from library_sample_specimen_key, fall back to using the library_sample_name as the specimen_name; if False, raise an error

Returns:

dict with keys ‘library_sample_info’ and ‘specimen_info’

pmotools.pmo_builder.mhap_table_to_pmo.create_representative_microhaplotype_dict(microhaplotype_table: DataFrame, target_name_col: str = 'target_name', seq_col: str = 'seq', genome_id: int = 0, chrom_col: str | None = None, start_col: str | None = None, end_col: str | None = None, ref_seq_col: str | None = None, strand_col: str | None = None, alt_annotations_col: str | None = None, masking_seq_start_col: str | None = None, masking_seq_segment_size_col: str | None = None, masking_replacement_size_col: str | None = None, masking_delim: str = ',', microhaplotype_name_col: str | None = None, pseudocigar_col: str | None = None, quality_col: str | None = None, additional_representative_mhap_cols: list[str] | None = None)[source]

Convert the read-in microhaplotype calls table into a representative microhaplotype JSON-like dictionary.

Parameters:

microhaplotype_table (pd.DataFrame) – the dataframe containing microhaplotype calls
target_name_col (str) – the name of the column containing the targets. Default: target_name
seq_col (str) – the name of the column containing the microhaplotype sequences. Default: seq
genome_id (int) – the genome ID
chrom_col (str, optional) – the name of the column containing the chromosome name of the microhaplotype
start_col (str, optional) – the name of the column containing the start of the microhaplotype
end_col (str, optional) – the name of the column containing the end of the microhaplotype
ref_seq_col (str, optional) – the name of the column containing the reference sequence for the microhaplotype
strand_col (str, optional) – the name of the column containing the strand of the microhaplotype
alt_annotations_col (str, optional) – the name of the column containing any alternative annotations
masking_seq_start_col (str, optional) – the name of the column containing a list of start positions for masking
masking_seq_segment_size_col (str, optional) – the name of the column containing a list of lengths of the segments in seq being masked
masking_replacement_size_col (str, optional) – the name of the column containing a list of lengths of the masking replacements
masking_delim (str, optional) – delimiter of the masking information. Default: ‘,’
microhaplotype_name_col (str, optional) – the name of the column containing an optional name for this microhaplotype
pseudocigar_col (str, optional) – the name of the column containing a pseudocigar for the microhaplotype
quality_col (str, optional) – the name of the column containing the ANSI FASTQ per-base quality score for this sequence
additional_representative_mhap_cols (list of str, optional) – additional columns to add to the representative microhaplotypes table

Returns:

a dictionary formatted for JSON output with representative microhaplotype sequences

Return type:

dict

pmotools.pmo_builder.mhap_table_to_pmo.get_mhap_index_in_representative_mhaps(df, representative_dict)[source]

pmotools.pmo_builder.mhap_table_to_pmo.get_target_id_in_representative_mhaps(df, representative_dict)[source]

pmotools.pmo_builder.mhap_table_to_pmo.mhap_table_to_pmo(microhaplotype_table: DataFrame, bioinformatics_run_name: str | None = None, library_sample_name_col: str = 'library_sample_name', target_name_col: str = 'target_name', seq_col: str = 'seq', reads_col: str = 'reads', genome_id: int = 0, umis_col: str | None = None, chrom_col: str | None = None, start_col: str | None = None, end_col: str | None = None, ref_seq_col: str | None = None, strand_col: str | None = None, alt_annotations_col: str | None = None, masking_seq_start_col: str | None = None, masking_seq_segment_size_col: str | None = None, masking_replacement_size_col: str | None = None, masking_delim: str = ',', microhaplotype_name_col: str | None = None, pseudocigar_col: str | None = None, quality_col: str | None = None, additional_representative_mhap_cols: list | None = None, additional_mhap_detected_cols: list | None = None)[source]

Convert a dataframe of microhaplotype calls into a dictionary containing a dictionary for the haplotypes_detected and a dictionary for the representative_haplotype_sequences.

Parameters:

microhaplotype_table (pd.DataFrame) – the dataframe containing microhaplotype calls
bioinformatics_run_name (str, optional) – unique name for the bioinformatics run that generated the data (column name or individual run name). Default: None
library_sample_name_col (str) – the name of the column containing the library sample names. Default: library_sample_name
target_name_col (str) – the name of the column containing the targets. Default: target_name
seq_col (str) – the name of the column containing the microhaplotype sequences. Default: seq
reads_col (str) – the name of the column containing the read counts. Default: reads
genome_id (int, optional) – the ID of the genome used as reference. Default: None
umis_col (str, optional) – the name of the column with the unique molecular identifier count associated with this microhaplotype
chrom_col (str, optional) – the name of the column containing the chromosome name of the microhaplotype
start_col (str, optional) – the name of the column containing the start of the microhaplotype
end_col (str, optional) – the name of the column containing the end of the microhaplotype
ref_seq_col (str, optional) – the name of the column containing the reference sequence for the microhaplotype
strand_col (str, optional) – the name of the column containing the strand of the microhaplotype
alt_annotations_col (str, optional) – the name of the column containing any alternative annotations
masking_seq_start_col (str, optional) – the name of the column containing a list of start positions for masking
masking_seq_segment_size_col (str, optional) – the name of the column containing a list of lengths of the segments in seq being masked
masking_replacement_size_col (str, optional) – the name of the column containing a list of lengths of the masking replacements
masking_delim (str, optional) – delimiter of the masking information. Default: ‘,’
microhaplotype_name_col (str, optional) – the name of the column containing an optional name for this microhaplotype
pseudocigar_col (str, optional) – the name of the column containing a pseudocigar for the microhaplotype
quality_col (str, optional) – the name of the column containing the ANSI FASTQ per-base quality score for this sequence
additional_representative_mhap_cols (list of str, optional) – additional columns to add to the representative microhaplotypes table
additional_mhap_detected_cols (list of str, optional) – additional columns to add to the detected microhaplotypes table

Returns:

a dict of both the haplotypes_detected and representative_haplotype_sequences

Return type:

dict