pytximport¶
pytximport: Gene-level count estimation from transcript quantification files.
The pytximport
package provides a Python implementation of the tximport
R package, which provides an easy-to-use
interface for importing transcript quantification files from various tools (e.g., salmon
, kallisto
, RSEM
) into
Python. The package is designed to work with the output of these tools and provide isoform bias-corrected gene-level
counts for downstream analysis.
The package provides a single function, tximport
, as the main entry point, as well as a command-line interface
(pytximport
). Further utility functions (e.g., to create a transcript-to-gene map) are provided through the
pytximport.utils
module.
pytximport
can output data as AnnData objects and xarray objects or save the data as a CSV file, enabling seamless
integration with other Python packages such as PyDESeq2
.
Note
Please consider citing both pytximport
and the original R implementation of tximport
when using pytximport
.
For pytximport
, please refer to the README or CITATION.cff file for the appropriate citation information.
The original tximport
publication can be found at: https://doi.org/10.12688/f1000research.7563.1
Warning
Though pytximport
aims to provide the same functionality as tximport
, there are some differences between the two
packages. While the same configuration will result in identical output, the configuration options and defaults may
differ between the two packages. Please refer to the documentation for more information.
Submodules¶
Attributes¶
Functions¶
Package Contents¶
- pytximport.__version__ = '0.12.0'¶
- pytximport.cli(ctx)[source]¶
Welcome to the pytximport command-line interface for importing transcript-level quantification files.
- Parameters:
ctx (click.Context)
- pytximport.tximport(file_paths, data_type='salmon', transcript_gene_map=None, counts_from_abundance=None, gene_level=False, return_transcript_data=False, inferential_replicates=False, inferential_replicate_transformer=None, inferential_replicate_variance=False, ignore_transcript_version=True, ignore_after_bar=True, id_column=None, counts_column=None, length_column=None, abundance_column=None, custom_importer=None, existence_optional=False, read_length=None, output_type='anndata', output_format='csv', output_path=None, output_path_overwrite=False, return_data=True, biotype_filter=None)[source]¶
Import transcript-level quantification files and convert them to gene-level expression estimates.
Basic usage:
from pytximport import tximport txi = tximport( ["quant_1.sf", "quant_2.sf"], data_type="salmon", transcript_gene_map="transcript_to_gene_map.tsv", counts_from_abundance="length_scaled_tpm", )
- Parameters:
file_paths (List[Union[str, Path]]) – The paths to the quantification files.
data_type (Literal["kallisto", "salmon", "sailfish", "oarfish", "piscem", "stringtie", "rsem", "tsv"]) – The type of quantification files. Defaults to “salmon”.
transcript_gene_map (Optional[Union[pd.DataFrame, Union[str, Path]], optional) – The mapping from transcripts to genes. Has to contain two columns:
transcript_id
andgene_id
. If you provide a path to a file, it has to be either a tab-separated (.tsv) or comma-separated (.csv) file with a header. Defaults to None.counts_from_abundance (Optional[Literal["scaled_tpm", "length_scaled_tpm", "dtu_scaled_tpm"]], optional) – Whether to calculate count estimates based on the abundance. When using scaled_tpm or length_scaled_tpm the counts no longer correlate with the the average transcript length per sample. In those cases, the length offset matrix should not be used for downstream analysis. Note, that this does not normalize the sequencing depth, only the difference in transcript length. When using the gene-summarized counts and not count estimates based on the abundance, the length offset matrix included in the output from this function should be used for downstream analysis. If your downstream analysis tool does not support the length offset matrix, you should probably use
length_scaled_tpm
for gene-level analysis. For transcript-level analysis, we recommend that you usescaled_tpm
ordtu_scaled_tpm
. For further guidance on transcript-level analysis, please refer to: https://doi.org/10.12688/f1000research.15398.3. Defaults to None.gene_level (bool, optional) – Whether the input files are at the gene level. This is only the case for some RSEM quantification files. Defaults to False.
return_transcript_data (bool, optional) – Whether to return the transcript-level expression. Defaults to False.
inferential_replicates (bool, optional) – Whether to parse and include inferential replicates in the output. If you want to recalculate the counts from inferential replicates, please set this option to True and provide a
inferential_replicate_transformer
. Defaults to False.inferential_replicate_transformer (Optional[Callable], optional) – A custom function to transform the inferential replicates. Defaults to None.
inferential_replicate_variance (bool, optional) – Whether to return the variance of the inferential replicates. Defaults to False.
ignore_transcript_version (bool, optional) – Whether to ignore the transcript version. Defaults to True.
ignore_after_bar (bool, optional) – Whether to split the transcript id after the bar character (
|
). Defaults to True.id_column (Optional[str], optional) – The column name for the transcript id. Defaults to None.
counts_column (Optional[str], optional) – The column name for the counts. Defaults to None.
length_column (Optional[str], optional) – The column name for the length. Defaults to None.
abundance_column (Optional[str], optional) – The column name for the abundance. Defaults to None.
custom_importer (Optional[Callable], optional) – A custom importer function. Defaults to None.
existence_optional (bool, optional) – Whether the existence of the files is optional. Defaults to False.
read_length (Optional[int], optional) – The read length for the stringtie quantification. Defaults to None.
output_type (Literal["xarray", "anndata"], optional) – The type of output. Defaults to “anndata”.
output_format (Literal["csv", "h5ad"], optional) – The type of output file. Defaults to “csv”.
output_path (Optional[Union[str, Path]], optional) – The path to save the gene-level expression. Defaults to None.
output_path_overwrite (bool, optional) – Whether to overwrite the save path if it already exists. Defaults to False.
return_data (bool, optional) – Whether to return the gene-level expression. Defaults to True.
biotype_filter (List[str], optional) – Filter the transcripts by biotype, including only those provided. Enables post-hoc filtering of the data based on the biotype of the transcripts. Assumes that the biotype is present in the transcript_id of the data, bar-separated. If this is not the case, please use the
filter_by_biotype
function from thepytximport.utils
module instead. Please note that the abundance will NOT be recalculated after filtering to avoid introducing bias. If you wish to recalculate the abundance, please use thefilter_by_biotype
function from thepytximport.utils
module instead. Defaults to None.
- Returns:
- The estimated gene-level or transcript-level
expression data if
return_data
is True, else None.
- Return type:
Union[xr.Dataset, ad.AnnData, SummarizedExperiment, None]