pytximport.utils¶
Utility functions for converting data, creating maps and filtering data.
Most functions contained within this module are primarily destined for internal use but are exposed for advanced users who may want to use them directly.
Functions¶
|
Convert transcript-level abundance to counts, either as TPM or TPM scaled by the length. |
|
Convert transcript-level counts to TPM. |
|
Convert transcript-level expression to gene-level expression. |
|
Create a mapping from transcript ids to gene ids using the Ensembl Biomart. |
|
Create a mapping from transcript ids to gene ids using a GTF annotation file. |
|
Filter the transcripts by biotype. |
|
Get the median length of the gene over all isoforms. |
|
Remove the transcript version from the transcript data and the transcript target map. |
|
Replace missing mean transcript length at the sample level with the gene mean across samples. |
|
Replace transcript IDs with transcript names. |
|
Summarize gene-level RSEM quantification files. |
Package Contents¶
- pytximport.utils.convert_abundance_to_counts(counts, abundance, length, counts_from_abundance)¶
Convert transcript-level abundance to counts, either as TPM or TPM scaled by the length.
- Parameters:
counts (DataArray) – The original counts.
abundance (DataArray) – The transcript-level abundance.
length (DataArray) – The length of the transcripts.
counts_from_abundance (Literal["scaled_tpm", "length_scaled_tpm"]) – The type of counts to convert to.
- Returns:
The transcript-level expression data with the counts.
- Return type:
DataArray
- pytximport.utils.convert_counts_to_tpm(counts, length)¶
Convert transcript-level counts to TPM.
- Parameters:
counts (np.ndarray) – The transcript-level counts.
length (np.ndarray) – The length of the transcripts.
- Returns:
The transcript-level expression data with the TPM.
- Return type:
np.ndarray
- pytximport.utils.convert_transcripts_to_genes(transcript_data, transcript_gene_map, counts_from_abundance=None)¶
Convert transcript-level expression to gene-level expression.
- Parameters:
transcript_data (xr.Dataset) – The transcript-level expression data from multiple samples.
transcript_gene_map (pd.DataFrame) – The mapping from transcripts to genes. Contains two columns:
transcript_idandgene_id.counts_from_abundance (Optional[Literal["scaled_tpm", "length_scaled_tpm"]], optional) – The type of counts to convert to. Defaults to “length_scaled_tpm”.
- Returns:
The gene-level expression data from multiple samples.
- Return type:
xr.Dataset
- pytximport.utils.create_transcript_gene_map(species='human', host='http://www.ensembl.org', source_field='ensembl_transcript_id', target_field='ensembl_gene_id', **kwargs)¶
Create a mapping from transcript ids to gene ids using the Ensembl Biomart.
Warning
Choosing any
target_fieldvalue other thanensembl_gene_idmay not result in a full transcript to gene map since not all transcripts may have the respective variable. While this does not typically affect well defined transcripts, be aware of this possible source of bias.Basic example:
from pytximport.utils import create_transcript_gene_map transcript_gene_map = create_transcript_gene_map( species="human", host="https://may2024.archive.ensembl.org/", # Use a specific Ensembl release target_field="external_gene_name", )
- Parameters:
species (Literal["human", "mouse"], optional) – The species to use. Defaults to “human”.
host (str, optional) – The host to use. Defaults to “http://www.ensembl.org”.
source_field (Literal["ensembl_transcript_id", "external_transcript_name"], optional) – The identifier to get for each transcript id. Defaults to “ensembl_transcript_id”.
target_field (Literal["ensembl_gene_id", "external_gene_name", "external_transcript_name"], optional) – The corresponding identifier to get for each transcript. Defaults to “ensembl_gene_id”.
kwargs (Dict[str, Any])
- Returns:
The mapping from transcript ids to gene ids.
- Return type:
pd.DataFrame
- pytximport.utils.create_transcript_gene_map_from_annotation(file_path, source_field='transcript_id', target_field='gene_id', chunk_size=100000, keep_biotype=False, **kwargs)¶
Create a mapping from transcript ids to gene ids using a GTF annotation file.
- Parameters:
file_path (Union[str, Path]) – The path to the GTF annotation file.
field (Literal["gene_id", "gene_name"], optional) – The identifier to get for each transcript id. Defaults to “gene_id”.
chunk_size (int, optional) – The number of lines to read at a time. Defaults to 100000.
keep_biotype (bool, optional) – Whether to keep the gene_biotype column. Defaults to False.
source_field (Literal['transcript_id', 'transcript_name'])
target_field (Literal['gene_id', 'gene_name'])
kwargs (Dict[str, Any])
- Returns:
The mapping from transcript ids to gene ids.
- Return type:
pd.DataFrame
- pytximport.utils.filter_by_biotype(transcript_data, biotype_filter, id_column='transcript_id')¶
Filter the transcripts by biotype.
This function filters the transcripts by biotype. The biotype is assumed to be present in the transcript_id separated by a bar. The biotype is checked against the biotype_filter and the transcripts that match the biotype are kept. This function is provided mainly for internal use if
biotype_filteris provided to the main function.- Parameters:
- Returns:
The transcript-level expression data from multiple samples with the transcripts filtered by biotype.
- Return type:
xr.Dataset
- pytximport.utils.get_median_length_over_isoform(transcript_data, transcript_gene_map)¶
Get the median length of the gene over all isoforms.
- Parameters:
length (xr.Dataset) – The transcript data containing the length of the transcripts.
transcript_gene_map (pd.DataFrame) – The mapping of transcripts to genes.
ignore_after_bar (bool, optional) – Whether to ignore the part of the transcript ID after the bar. Defaults to True.
transcript_data (xarray.Dataset)
- Returns:
- The updated transcript data with the median gene length contained in the
median_isoform_length variable.
- The updated transcript data with the median gene length contained in the
- Return type:
xr.Dataset
- pytximport.utils.remove_transcript_version(transcript_data, transcript_target_map=None, transcript_ids=None, id_column='transcript_id')¶
Remove the transcript version from the transcript data and the transcript target map.
- Parameters:
transcript_data (xr.Dataset) – The transcript data.
transcript_target_map (Optional[pd.DataFrame], optional) – The transcript target map. Defaults to None.
transcript_ids (Optional[List[str]], optional) – The transcript ids. Defaults to None.
id_column (str, optional) – The column name for the transcript ID. Defaults to “transcript_id”.
- Returns:
The transcript data, the transcript target map, and the transcript ids.
- Return type:
Tuple[xr.Dataset, pd.DataFrame, List[str]]
- pytximport.utils.replace_missing_average_transcript_length(length, length_gene_mean)¶
Replace missing mean transcript length at the sample level with the gene mean across samples.
- Parameters:
length (xr.DataArray) – The average length of transcripts at the gene level with a sample dimension.
length_gene_mean (xr.DataArray) – The mean length of the transcripts of the genes across samples.
- Returns:
The average length of transcripts at the gene level with a sample dimension.
- Return type:
xr.DataArray
- pytximport.utils.replace_transcript_ids_with_names(transcript_data, transcript_name_map)¶
Replace transcript IDs with transcript names.
- Parameters:
transcript_data (Union[ad.AnnData, xr.Dataset]) – The transcript-level expression data.
transcript_name_map (Union[pd.DataFrame, Union[str, Path]]) – The mapping from transcripts to names. Contains two columns:
transcript_idandtranscript_name.
- Returns:
The transcript-level expression data with the transcript names.
- Return type:
Union[ad.AnnData, xr.Dataset]
- pytximport.utils.summarize_rsem_gene_data(file_paths, importer, importer_kwargs, existence_optional=False)¶
Summarize gene-level RSEM quantification files.
- Parameters:
file_paths (Union[List[str], List[Path]]) – The paths to the quantification files.
importer (Callable) – The importer function to read the quantification files.
importer_kwargs (Dict[str, Any]) – The keyword arguments for the importer function.
existence_optional (bool, optional) – Whether the files are optional. Defaults to False.
- Returns:
The gene-level expression data.
- Return type:
xr.Dataset