pytximport

pytximport: Gene-level count estimation from transcript quantification files.

The pytximport package provides a Python implementation of the tximport R package, which provides an easy-to-use interface for importing transcript quantification files from various tools (e.g., salmon, kallisto, RSEM) into Python. The package is designed to work with the output of these tools and provide isoform bias-corrected gene-level counts for downstream analysis.

The package provides a single function, tximport, as the main entry point, as well as a command-line interface (pytximport). Further utility functions (e.g., to create a transcript-to-gene map) are provided through the pytximport.utils module.

pytximport can output data as AnnData objects and xarray objects or save the data as a CSV file, enabling seamless integration with other Python packages such as PyDESeq2.

Note

Please consider citing both pytximport and the original R implementation of tximport when using pytximport. For pytximport, please refer to the README or CITATION.cff file for the appropriate citation information. The original tximport publication can be found at: https://doi.org/10.12688/f1000research.7563.1

Warning

Though pytximport aims to provide the same functionality as tximport, there are some differences between the two packages. While the same configuration will result in identical output, the configuration options and defaults may differ between the two packages. Please refer to the documentation for more information.

Submodules

Attributes

Functions

cli(ctx)

Welcome to the pytximport command-line interface for importing transcript-level quantification files.

tximport(file_paths[, data_type, transcript_gene_map, ...])

Import transcript-level quantification files and convert them to gene-level expression estimates.

Package Contents

pytximport.__version__ = '0.12.0'
pytximport.cli(ctx)[source]

Welcome to the pytximport command-line interface for importing transcript-level quantification files.

Parameters:

ctx (click.Context)

pytximport.pytximport[source]
pytximport.tximport(file_paths, data_type='salmon', transcript_gene_map=None, counts_from_abundance=None, gene_level=False, return_transcript_data=False, inferential_replicates=False, inferential_replicate_transformer=None, inferential_replicate_variance=False, ignore_transcript_version=True, ignore_after_bar=True, id_column=None, counts_column=None, length_column=None, abundance_column=None, custom_importer=None, existence_optional=False, read_length=None, output_type='anndata', output_format='csv', output_path=None, output_path_overwrite=False, return_data=True, biotype_filter=None)[source]

Import transcript-level quantification files and convert them to gene-level expression estimates.

Basic usage:

from pytximport import tximport

txi = tximport(
    ["quant_1.sf", "quant_2.sf"],
    data_type="salmon",
    transcript_gene_map="transcript_to_gene_map.tsv",
    counts_from_abundance="length_scaled_tpm",
)
Parameters:
  • file_paths (List[Union[str, Path]]) – The paths to the quantification files.

  • data_type (Literal["kallisto", "salmon", "sailfish", "oarfish", "piscem", "stringtie", "rsem", "tsv"]) – The type of quantification files. Defaults to “salmon”.

  • transcript_gene_map (Optional[Union[pd.DataFrame, Union[str, Path]], optional) – The mapping from transcripts to genes. Has to contain two columns: transcript_id and gene_id. If you provide a path to a file, it has to be either a tab-separated (.tsv) or comma-separated (.csv) file with a header. Defaults to None.

  • counts_from_abundance (Optional[Literal["scaled_tpm", "length_scaled_tpm", "dtu_scaled_tpm"]], optional) – Whether to calculate count estimates based on the abundance. When using scaled_tpm or length_scaled_tpm the counts no longer correlate with the the average transcript length per sample. In those cases, the length offset matrix should not be used for downstream analysis. Note, that this does not normalize the sequencing depth, only the difference in transcript length. When using the gene-summarized counts and not count estimates based on the abundance, the length offset matrix included in the output from this function should be used for downstream analysis. If your downstream analysis tool does not support the length offset matrix, you should probably use length_scaled_tpm for gene-level analysis. For transcript-level analysis, we recommend that you use scaled_tpm or dtu_scaled_tpm. For further guidance on transcript-level analysis, please refer to: https://doi.org/10.12688/f1000research.15398.3. Defaults to None.

  • gene_level (bool, optional) – Whether the input files are at the gene level. This is only the case for some RSEM quantification files. Defaults to False.

  • return_transcript_data (bool, optional) – Whether to return the transcript-level expression. Defaults to False.

  • inferential_replicates (bool, optional) – Whether to parse and include inferential replicates in the output. If you want to recalculate the counts from inferential replicates, please set this option to True and provide a inferential_replicate_transformer. Defaults to False.

  • inferential_replicate_transformer (Optional[Callable], optional) – A custom function to transform the inferential replicates. Defaults to None.

  • inferential_replicate_variance (bool, optional) – Whether to return the variance of the inferential replicates. Defaults to False.

  • ignore_transcript_version (bool, optional) – Whether to ignore the transcript version. Defaults to True.

  • ignore_after_bar (bool, optional) – Whether to split the transcript id after the bar character (|). Defaults to True.

  • id_column (Optional[str], optional) – The column name for the transcript id. Defaults to None.

  • counts_column (Optional[str], optional) – The column name for the counts. Defaults to None.

  • length_column (Optional[str], optional) – The column name for the length. Defaults to None.

  • abundance_column (Optional[str], optional) – The column name for the abundance. Defaults to None.

  • custom_importer (Optional[Callable], optional) – A custom importer function. Defaults to None.

  • existence_optional (bool, optional) – Whether the existence of the files is optional. Defaults to False.

  • read_length (Optional[int], optional) – The read length for the stringtie quantification. Defaults to None.

  • output_type (Literal["xarray", "anndata"], optional) – The type of output. Defaults to “anndata”.

  • output_format (Literal["csv", "h5ad"], optional) – The type of output file. Defaults to “csv”.

  • output_path (Optional[Union[str, Path]], optional) – The path to save the gene-level expression. Defaults to None.

  • output_path_overwrite (bool, optional) – Whether to overwrite the save path if it already exists. Defaults to False.

  • return_data (bool, optional) – Whether to return the gene-level expression. Defaults to True.

  • biotype_filter (List[str], optional) – Filter the transcripts by biotype, including only those provided. Enables post-hoc filtering of the data based on the biotype of the transcripts. Assumes that the biotype is present in the transcript_id of the data, bar-separated. If this is not the case, please use the filter_by_biotype function from the pytximport.utils module instead. Please note that the abundance will NOT be recalculated after filtering to avoid introducing bias. If you wish to recalculate the abundance, please use the filter_by_biotype function from the pytximport.utils module instead. Defaults to None.

Returns:

The estimated gene-level or transcript-level

expression data if return_data is True, else None.

Return type:

Union[xr.Dataset, ad.AnnData, SummarizedExperiment, None]