pytximport.utils¶

Utility functions for converting data, creating maps and filtering data.

Most functions contained within this module are primarily destined for internal use but are exposed for advanced users who may want to use them directly.

Functions¶

`convert_abundance_to_counts`(counts, abundance, length, ...)	Convert transcript-level abundance to counts, either as TPM or TPM scaled by the length.
`convert_counts_to_tpm`(counts, length)	Convert transcript-level counts to TPM.
`convert_transcripts_to_genes`(transcript_data, ...[, ...])	Convert transcript-level expression to gene-level expression.
`create_transcript_gene_map`([species, host, ...])	Create a mapping from transcript ids to gene ids using the Ensembl Biomart.
`create_transcript_gene_map_from_annotation`(file_path)	Create a mapping from transcript ids to gene ids using a GTF annotation file.
`filter_by_biotype`(transcript_data[, ...])	Filter the transcripts by biotype.
`get_median_length_over_isoform`(transcript_data, ...)	Get the median length of the gene over all isoforms.
`remove_transcript_version`(transcript_data[, ...])	Remove the transcript version from the transcript data and the transcript target map.
`replace_missing_average_transcript_length`(length, ...)	Replace missing mean transcript length at the sample level with the gene mean across samples.
`replace_transcript_ids_with_names`(transcript_data, ...)	Replace transcript IDs with transcript names.
`summarize_rsem_gene_data`(file_paths, importer, ...[, ...])	Summarize gene-level RSEM quantification files.

Package Contents¶

pytximport.utils.convert_abundance_to_counts(counts, abundance, length, counts_from_abundance)[source]¶

Convert transcript-level abundance to counts, either as TPM or TPM scaled by the length.

Parameters:

counts (DataArray) – The original counts.
abundance (DataArray) – The transcript-level abundance.
length (DataArray) – The length of the transcripts.
counts_from_abundance (Literal["scaled_tpm", "length_scaled_tpm"]) – The type of counts to convert to.

Returns:

The transcript-level expression data with the counts.

Return type:

DataArray

pytximport.utils.convert_counts_to_tpm(counts, length)[source]¶

Convert transcript-level counts to TPM.

Parameters:

counts (NDArray) – The transcript-level counts.
length (NDArray) – The length of the transcripts.

Returns:

The transcript-level expression data with the TPM.

Return type:

NDArray

pytximport.utils.convert_transcripts_to_genes(transcript_data, transcript_gene_map, counts_from_abundance=None)[source]¶

Convert transcript-level expression to gene-level expression.

Parameters:

transcript_data (xr.Dataset) – The transcript-level expression data from multiple samples.
transcript_gene_map (pd.DataFrame) – The mapping from transcripts to genes. Contains two columns: transcript_id and gene_id.
counts_from_abundance (Optional[Literal["scaled_tpm", "length_scaled_tpm"]], optional) – The type of counts to convert to. Defaults to “length_scaled_tpm”.

Returns:

The gene-level expression data from multiple samples.

Return type:

xr.Dataset

pytximport.utils.create_transcript_gene_map(species='human', host='http://www.ensembl.org', source_field='ensembl_transcript_id', target_field='ensembl_gene_id', rename_columns=True, **kwargs)[source]¶

Create a mapping from transcript ids to gene ids using the Ensembl Biomart.

Warning

Choosing any target_field value other than ensembl_gene_id may not result in a full transcript to gene map since not all transcripts may have the respective variable. While this does not typically affect well defined transcripts, be aware of this possible source of bias.

Basic example:

from pytximport.utils import create_transcript_gene_map

transcript_gene_map = create_transcript_gene_map(
    species="human",
    host="https://may2024.archive.ensembl.org/",  # Use a specific Ensembl release
    target_field="external_gene_name",
)

# or get multiple fields
transcript_gene_map = create_transcript_gene_map(
    species="mouse",
    target_field=["external_gene_name", "gene_biotype"],
)

Parameters:

species (Literal["human", "mouse"], optional) – The species to use. Defaults to “human”.
host (str, optional) – The host to use. Defaults to “http://www.ensembl.org”.
source_field (Literal["ensembl_transcript_id", "external_transcript_name"], optional) – The identifier to get for each transcript id. Defaults to “ensembl_transcript_id”.
(Union[Literal["ensembl_gene_id" (target_field) – , List[Literal[“ensembl_gene_id”, “external_gene_name”, “external_transcript_name”, “gene_biotype”]]], optional): The corresponding identifier to get for each transcript. Defaults to “ensembl_gene_id”.
"external_gene_name" – , List[Literal[“ensembl_gene_id”, “external_gene_name”, “external_transcript_name”, “gene_biotype”]]], optional): The corresponding identifier to get for each transcript. Defaults to “ensembl_gene_id”.
"external_transcript_name" – , List[Literal[“ensembl_gene_id”, “external_gene_name”, “external_transcript_name”, “gene_biotype”]]], optional): The corresponding identifier to get for each transcript. Defaults to “ensembl_gene_id”.
"gene_biotype"] – , List[Literal[“ensembl_gene_id”, “external_gene_name”, “external_transcript_name”, “gene_biotype”]]], optional): The corresponding identifier to get for each transcript. Defaults to “ensembl_gene_id”.
rename_columns (bool, optional) – Whether to rename ensembl_transcript_id to transcript_id, ensembl_gene_id to gene_id, external_gene_name to gene_name if the gene id is also present or gene_id if no other gene id is present, and external_transcript_name to transcript_name. Defaults to True.
**kwargs – Additional arguments to pass to the function.
target_field (Union[Literal['ensembl_gene_id', 'external_gene_name', 'external_transcript_name', 'gene_biotype'], List[Literal['ensembl_gene_id', 'external_gene_name', 'external_transcript_name', 'gene_biotype']]])

Keyword Arguments:

field (str, optional) – The field to use for the mapping. Deprecated. Use source_field and target_field instead.

Returns:

The mapping from transcript ids to gene ids.

Return type:

pd.DataFrame

pytximport.utils.create_transcript_gene_map_from_annotation(file_path, source_field='transcript_id', target_field='gene_id', use_transcript_name_as_replacement_id=True, use_gene_name_as_replacement_id=True, chunk_size=100000, **kwargs)[source]¶

Create a mapping from transcript ids to gene ids using a GTF annotation file.

Basic example:

from pytximport.utils import create_transcript_gene_map_from_annotation

# Create a mapping from transcript ids to gene names
transcript_gene_map = create_transcript_gene_map_from_annotation(
    "path/to/annotation.gtf",
    target_field="gene_name",
)

# Create a mapping from transcript ids to transcript names and include the gene biotype
transcript_gene_map = create_transcript_gene_map_from_annotation(
    "path/to/annotation.gtf",
    target_field=["transcript_name", "gene_biotype"],
)

Parameters:

file_path (Union[str, Path]) – The path to the GTF annotation file.
source_field (Literal["transcript_id", "transcript_name"], optional) – The identifier to get for each transcript id. Defaults to “transcript_id”.
Literal["gene_id" (target_field (Union[) – “gene_biotype”]], optional): The corresponding identifier(s) to get for each transcript. Defaults to “gene_id”.
"gene_name" – “gene_biotype”]], optional): The corresponding identifier(s) to get for each transcript. Defaults to “gene_id”.
"gene_biotype"] – “gene_biotype”]], optional): The corresponding identifier(s) to get for each transcript. Defaults to “gene_id”.
List[Literal["gene_id" – “gene_biotype”]], optional): The corresponding identifier(s) to get for each transcript. Defaults to “gene_id”.
"gene_name" – “gene_biotype”]], optional): The corresponding identifier(s) to get for each transcript. Defaults to “gene_id”.
target_field (Union[Literal['gene_id', 'gene_name', 'gene_biotype', 'transcript_name'], List[Literal['gene_id', 'gene_name', 'gene_biotype', 'transcript_name']]])
use_transcript_name_as_replacement_id (bool)
use_gene_name_as_replacement_id (bool)
chunk_size (int)
kwargs (Dict[str, Any])

Return type:

pandas.DataFrame

:param“gene_biotype”]], optional): The corresponding identifier(s) to get for each transcript.: Defaults to “gene_id”.

Parameters:

use_transcript_name_as_replacement_id (bool, optional) – Whether to use the transcript name as the transcript id if the transcript id is missing. Defaults to True.
use_gene_name_as_replacement_id (bool, optional) – Whether to use the gene name as the gene id if the gene id is missing. Defaults to True.
chunk_size (int, optional) – The number of lines to read at a time. Defaults to 100000.
**kwargs – Additional arguments to pass to the function.
file_path (Union[str, pathlib.Path])
source_field (Literal['transcript_id', 'transcript_name'])
target_field (Union[Literal['gene_id', 'gene_name', 'gene_biotype', 'transcript_name'], List[Literal['gene_id', 'gene_name', 'gene_biotype', 'transcript_name']]])

Keyword Arguments:

field (str, optional) – The field to use for the mapping. Deprecated. Use source_field and target_field instead.
keep_biotype (bool, optional) – Whether to keep the gene_biotype column. Deprecated. Use target_field instead.

Returns:

The mapping from transcript ids to gene ids.

Return type:

pd.DataFrame

pytximport.utils.filter_by_biotype(transcript_data, transcript_gene_map=None, biotype_filter=None, id_column='transcript_id', recalculate_abundance=False)[source]¶

Filter the transcripts by biotype.

This function filters the transcripts by biotype. The biotype is assumed to be present in the transcript_id separated by a bar. The biotype is checked against the biotype_filter and the transcripts that match the biotype are kept. This function is provided mainly for internal use if biotype_filter is provided to the main function.

Parameters:

transcript_data (Union[xr.Dataset, ad.AnnData]) – The expression data.
transcript_gene_map (Union[pd.DataFrame, Path, str], optional) – The mapping from transcript to gene with the gene_biotype column. If None, the biotype is assumed to be present in the id_column. Defaults to None.
biotype_filter (List[str]) – The biotypes to keep. Defaults to None.
id_column (str, optional) – The column name for the transcript/gene ID. Defaults to “transcript_id”.
recalculate_abundance (bool, optional) – Whether to recalculate the abundance after filtering. This converts the abundance to TPM of the remaining transcripts but has implications for how the abundance can be used statistically. Defaults to False.

Returns:

The expression data filtered by biotype.

Return type:

Union[xr.Dataset, ad.AnnData]

pytximport.utils.get_median_length_over_isoform(transcript_data, transcript_gene_map)[source]¶

Get the median length of the gene over all isoforms.

Parameters:

transcript_data (xr.Dataset) – The transcript data containing the length of the transcripts.
transcript_gene_map (pd.DataFrame) – The mapping of transcripts to genes.

Returns:

The updated transcript data with the median gene length contained in the median_isoform_length: variable.

Return type:

xr.Dataset

pytximport.utils.remove_transcript_version(transcript_data, transcript_target_map=None, transcript_ids=None, id_column='transcript_id')[source]¶

Remove the transcript version from the transcript data and the transcript target map.

Parameters:

transcript_data (xr.Dataset) – The transcript data.
transcript_target_map (Optional[pd.DataFrame], optional) – The transcript target map. Defaults to None.
transcript_ids (Optional[List[str]], optional) – The transcript ids. Defaults to None.
id_column (str, optional) – The column name for the transcript ID. Defaults to “transcript_id”.

Returns:

The transcript data, the transcript target map, and the transcript ids.

Return type:

Tuple[xr.Dataset, pd.DataFrame, List[str]]

pytximport.utils.replace_missing_average_transcript_length(length, length_gene_mean)[source]¶

Replace missing mean transcript length at the sample level with the gene mean across samples.

Parameters:

length (xr.DataArray) – The average length of transcripts at the gene level with a sample dimension.
length_gene_mean (xr.DataArray) – The mean length of the transcripts of the genes across samples.

Returns:

The average length of transcripts at the gene level with a sample dimension.

Return type:

xr.DataArray

pytximport.utils.replace_transcript_ids_with_names(transcript_data, transcript_name_map)[source]¶

Replace transcript IDs with transcript names.

Parameters:

transcript_data (Union[ad.AnnData, xr.Dataset]) – The transcript-level expression data.
transcript_name_map (Union[pd.DataFrame, Union[str, Path]]) – The mapping from transcripts to names. Contains two columns: transcript_id and transcript_name.

Returns:

The transcript-level expression data with the transcript names.

Return type:

Union[ad.AnnData, xr.Dataset]

pytximport.utils.summarize_rsem_gene_data(file_paths, importer, importer_kwargs, existence_optional=False)[source]¶

Summarize gene-level RSEM quantification files.

Parameters:

file_paths (Union[List[str], List[Path]]) – The paths to the quantification files.
importer (Callable) – The importer function to read the quantification files.
importer_kwargs (Dict[str, Any]) – The keyword arguments for the importer function.
existence_optional (bool, optional) – Whether the files are optional. Defaults to False.

Returns:

The gene-level expression data.

Return type:

xr.Dataset