pytximport.utils¶
Utility functions for converting data, creating maps and filtering data.
Most functions contained within this module are primarily destined for internal use but are exposed for advanced users who may want to use them directly.
Functions¶
|
Convert transcript-level abundance to counts, either as TPM or TPM scaled by the length. |
|
Convert transcript-level counts to TPM. |
|
Convert transcript-level expression to gene-level expression. |
|
Create a mapping from transcript ids to gene ids using the Ensembl Biomart. |
|
Create a mapping from transcript ids to gene ids using a GTF annotation file. |
|
Filter the transcripts by biotype. |
|
Get the median length of the gene over all isoforms. |
|
Remove the transcript version from the transcript data and the transcript target map. |
|
Replace missing mean transcript length at the sample level with the gene mean across samples. |
|
Replace transcript IDs with transcript names. |
|
Summarize gene-level RSEM quantification files. |
Package Contents¶
- pytximport.utils.convert_abundance_to_counts(counts, abundance, length, counts_from_abundance)[source]¶
Convert transcript-level abundance to counts, either as TPM or TPM scaled by the length.
- Parameters:
counts (DataArray) – The original counts.
abundance (DataArray) – The transcript-level abundance.
length (DataArray) – The length of the transcripts.
counts_from_abundance (Literal["scaled_tpm", "length_scaled_tpm"]) – The type of counts to convert to.
- Returns:
The transcript-level expression data with the counts.
- Return type:
DataArray
- pytximport.utils.convert_counts_to_tpm(counts, length)[source]¶
Convert transcript-level counts to TPM.
- Parameters:
counts (np.ndarray) – The transcript-level counts.
length (np.ndarray) – The length of the transcripts.
- Returns:
The transcript-level expression data with the TPM.
- Return type:
np.ndarray
- pytximport.utils.convert_transcripts_to_genes(transcript_data, transcript_gene_map, counts_from_abundance=None)[source]¶
Convert transcript-level expression to gene-level expression.
- Parameters:
transcript_data (xr.Dataset) – The transcript-level expression data from multiple samples.
transcript_gene_map (pd.DataFrame) – The mapping from transcripts to genes. Contains two columns:
transcript_id
andgene_id
.counts_from_abundance (Optional[Literal["scaled_tpm", "length_scaled_tpm"]], optional) – The type of counts to convert to. Defaults to “length_scaled_tpm”.
- Returns:
The gene-level expression data from multiple samples.
- Return type:
xr.Dataset
- pytximport.utils.create_transcript_gene_map(species='human', host='http://www.ensembl.org', source_field='ensembl_transcript_id', target_field='ensembl_gene_id', rename_columns=True, **kwargs)[source]¶
Create a mapping from transcript ids to gene ids using the Ensembl Biomart.
Warning
Choosing any
target_field
value other thanensembl_gene_id
may not result in a full transcript to gene map since not all transcripts may have the respective variable. While this does not typically affect well defined transcripts, be aware of this possible source of bias.Basic example:
from pytximport.utils import create_transcript_gene_map transcript_gene_map = create_transcript_gene_map( species="human", host="https://may2024.archive.ensembl.org/", # Use a specific Ensembl release target_field="external_gene_name", ) # or get multiple fields transcript_gene_map = create_transcript_gene_map( species="mouse", target_field=["external_gene_name", "gene_biotype"], )
- Parameters:
species (Literal["human", "mouse"], optional) – The species to use. Defaults to “human”.
host (str, optional) – The host to use. Defaults to “http://www.ensembl.org”.
source_field (Literal["ensembl_transcript_id", "external_transcript_name"], optional) – The identifier to get for each transcript id. Defaults to “ensembl_transcript_id”.
(Union[Literal["ensembl_gene_id" (target_field) – , List[Literal[“ensembl_gene_id”, “external_gene_name”, “external_transcript_name”, “gene_biotype”]]], optional): The corresponding identifier to get for each transcript. Defaults to “ensembl_gene_id”.
"external_gene_name" – , List[Literal[“ensembl_gene_id”, “external_gene_name”, “external_transcript_name”, “gene_biotype”]]], optional): The corresponding identifier to get for each transcript. Defaults to “ensembl_gene_id”.
"external_transcript_name" – , List[Literal[“ensembl_gene_id”, “external_gene_name”, “external_transcript_name”, “gene_biotype”]]], optional): The corresponding identifier to get for each transcript. Defaults to “ensembl_gene_id”.
"gene_biotype"] – , List[Literal[“ensembl_gene_id”, “external_gene_name”, “external_transcript_name”, “gene_biotype”]]], optional): The corresponding identifier to get for each transcript. Defaults to “ensembl_gene_id”.
rename_columns (bool, optional) – Whether to rename
ensembl_transcript_id
totranscript_id
,ensembl_gene_id
togene_id
,external_gene_name
togene_name
if the gene id is also present orgene_id
if no other gene id is present, andexternal_transcript_name
totranscript_name
. Defaults to True.**kwargs – Additional arguments to pass to the function.
target_field (Union[Literal['ensembl_gene_id', 'external_gene_name', 'external_transcript_name', 'gene_biotype'], List[Literal['ensembl_gene_id', 'external_gene_name', 'external_transcript_name', 'gene_biotype']]])
- Keyword Arguments:
field (str, optional) – The field to use for the mapping. Deprecated. Use source_field and target_field instead.
- Returns:
The mapping from transcript ids to gene ids.
- Return type:
pd.DataFrame
- pytximport.utils.create_transcript_gene_map_from_annotation(file_path, source_field='transcript_id', target_field='gene_id', use_transcript_name_as_replacement_id=True, use_gene_name_as_replacement_id=True, chunk_size=100000, **kwargs)[source]¶
Create a mapping from transcript ids to gene ids using a GTF annotation file.
Basic example:
from pytximport.utils import create_transcript_gene_map_from_annotation # Create a mapping from transcript ids to gene names transcript_gene_map = create_transcript_gene_map_from_annotation( "path/to/annotation.gtf", target_field="gene_name", ) # Create a mapping from transcript ids to transcript names and include the gene biotype transcript_gene_map = create_transcript_gene_map_from_annotation( "path/to/annotation.gtf", target_field=["transcript_name", "gene_biotype"], )
- Parameters:
file_path (Union[str, Path]) – The path to the GTF annotation file.
source_field (Literal["transcript_id", "transcript_name"], optional) – The identifier to get for each transcript id. Defaults to “transcript_id”.
Literal["gene_id" (target_field (Union[) – “gene_biotype”]], optional): The corresponding identifier(s) to get for each transcript. Defaults to “gene_id”.
"gene_name" – “gene_biotype”]], optional): The corresponding identifier(s) to get for each transcript. Defaults to “gene_id”.
"gene_biotype"] – “gene_biotype”]], optional): The corresponding identifier(s) to get for each transcript. Defaults to “gene_id”.
List[Literal["gene_id" – “gene_biotype”]], optional): The corresponding identifier(s) to get for each transcript. Defaults to “gene_id”.
"gene_name" – “gene_biotype”]], optional): The corresponding identifier(s) to get for each transcript. Defaults to “gene_id”.
target_field (Union[Literal['gene_id', 'gene_name', 'gene_biotype', 'transcript_name'], List[Literal['gene_id', 'gene_name', 'gene_biotype', 'transcript_name']]])
use_transcript_name_as_replacement_id (bool)
use_gene_name_as_replacement_id (bool)
chunk_size (int)
kwargs (Dict[str, Any])
- Return type:
- :param“gene_biotype”]], optional): The corresponding identifier(s) to get for each transcript.
Defaults to “gene_id”.
- Parameters:
use_transcript_name_as_replacement_id (bool, optional) – Whether to use the transcript name as the transcript id if the transcript id is missing. Defaults to True.
use_gene_name_as_replacement_id (bool, optional) – Whether to use the gene name as the gene id if the gene id is missing. Defaults to True.
chunk_size (int, optional) – The number of lines to read at a time. Defaults to 100000.
**kwargs – Additional arguments to pass to the function.
file_path (Union[str, pathlib.Path])
source_field (Literal['transcript_id', 'transcript_name'])
target_field (Union[Literal['gene_id', 'gene_name', 'gene_biotype', 'transcript_name'], List[Literal['gene_id', 'gene_name', 'gene_biotype', 'transcript_name']]])
- Keyword Arguments:
- Returns:
The mapping from transcript ids to gene ids.
- Return type:
pd.DataFrame
- pytximport.utils.filter_by_biotype(transcript_data, transcript_gene_map=None, biotype_filter=None, id_column='transcript_id', recalculate_abundance=False)[source]¶
Filter the transcripts by biotype.
This function filters the transcripts by biotype. The biotype is assumed to be present in the transcript_id separated by a bar. The biotype is checked against the biotype_filter and the transcripts that match the biotype are kept. This function is provided mainly for internal use if
biotype_filter
is provided to the main function.- Parameters:
transcript_data (Union[xr.Dataset, ad.AnnData]) – The expression data.
transcript_gene_map (Union[pd.DataFrame, Path, str], optional) – The mapping from transcript to gene with the
gene_biotype
column. If None, the biotype is assumed to be present in the id_column. Defaults to None.biotype_filter (List[str]) – The biotypes to keep. Defaults to None.
id_column (str, optional) – The column name for the transcript/gene ID. Defaults to “transcript_id”.
recalculate_abundance (bool, optional) – Whether to recalculate the abundance after filtering. This converts the abundance to TPM of the remaining transcripts but has implications for how the abundance can be used statistically. Defaults to False.
- Returns:
The expression data filtered by biotype.
- Return type:
Union[xr.Dataset, ad.AnnData]
- pytximport.utils.get_median_length_over_isoform(transcript_data, transcript_gene_map)[source]¶
Get the median length of the gene over all isoforms.
- Parameters:
transcript_data (xr.Dataset) – The transcript data containing the length of the transcripts.
transcript_gene_map (pd.DataFrame) – The mapping of transcripts to genes.
- Returns:
- The updated transcript data with the median gene length contained in the
median_isoform_length
variable.
- The updated transcript data with the median gene length contained in the
- Return type:
xr.Dataset
- pytximport.utils.remove_transcript_version(transcript_data, transcript_target_map=None, transcript_ids=None, id_column='transcript_id')[source]¶
Remove the transcript version from the transcript data and the transcript target map.
- Parameters:
transcript_data (xr.Dataset) – The transcript data.
transcript_target_map (Optional[pd.DataFrame], optional) – The transcript target map. Defaults to None.
transcript_ids (Optional[List[str]], optional) – The transcript ids. Defaults to None.
id_column (str, optional) – The column name for the transcript ID. Defaults to “transcript_id”.
- Returns:
The transcript data, the transcript target map, and the transcript ids.
- Return type:
Tuple[xr.Dataset, pd.DataFrame, List[str]]
- pytximport.utils.replace_missing_average_transcript_length(length, length_gene_mean)[source]¶
Replace missing mean transcript length at the sample level with the gene mean across samples.
- Parameters:
length (xr.DataArray) – The average length of transcripts at the gene level with a sample dimension.
length_gene_mean (xr.DataArray) – The mean length of the transcripts of the genes across samples.
- Returns:
The average length of transcripts at the gene level with a sample dimension.
- Return type:
xr.DataArray
- pytximport.utils.replace_transcript_ids_with_names(transcript_data, transcript_name_map)[source]¶
Replace transcript IDs with transcript names.
- Parameters:
transcript_data (Union[ad.AnnData, xr.Dataset]) – The transcript-level expression data.
transcript_name_map (Union[pd.DataFrame, Union[str, Path]]) – The mapping from transcripts to names. Contains two columns:
transcript_id
andtranscript_name
.
- Returns:
The transcript-level expression data with the transcript names.
- Return type:
Union[ad.AnnData, xr.Dataset]
- pytximport.utils.summarize_rsem_gene_data(file_paths, importer, importer_kwargs, existence_optional=False)[source]¶
Summarize gene-level RSEM quantification files.
- Parameters:
file_paths (Union[List[str], List[Path]]) – The paths to the quantification files.
importer (Callable) – The importer function to read the quantification files.
importer_kwargs (Dict[str, Any]) – The keyword arguments for the importer function.
existence_optional (bool, optional) – Whether the files are optional. Defaults to False.
- Returns:
The gene-level expression data.
- Return type:
xr.Dataset