pytximport.core

Expose the functions in the core module.

Attributes

Functions

tximport(file_paths[, data_type, transcript_gene_map, ...])

Import transcript-level quantification files and convert them to gene-level expression estimates.

Package Contents

pytximport.core.pytximport[source]
pytximport.core.tximport(file_paths, data_type='salmon', transcript_gene_map=None, counts_from_abundance=None, gene_level=False, return_transcript_data=False, inferential_replicates=False, inferential_replicate_transformer=None, inferential_replicate_variance=False, ignore_transcript_version=True, ignore_after_bar=True, id_column=None, counts_column=None, length_column=None, abundance_column=None, custom_importer=None, existence_optional=False, read_length=None, output_type='anndata', output_format='csv', output_path=None, output_path_overwrite=False, return_data=True, biotype_filter=None)[source]

Import transcript-level quantification files and convert them to gene-level expression estimates.

Basic usage:

from pytximport import tximport

txi = tximport(
    ["quant_1.sf", "quant_2.sf"],
    data_type="salmon",
    transcript_gene_map="transcript_to_gene_map.tsv",
    counts_from_abundance="length_scaled_tpm",
)
Parameters:
  • file_paths (List[Union[str, Path]]) – The paths to the quantification files.

  • data_type (Literal["kallisto", "salmon", "sailfish", "oarfish", "piscem", "stringtie", "rsem", "tsv"]) – The type of quantification files. Defaults to “salmon”.

  • transcript_gene_map (Optional[Union[pd.DataFrame, Union[str, Path]], optional) – The mapping from transcripts to genes. Has to contain two columns: transcript_id and gene_id. If you provide a path to a file, it has to be either a tab-separated (.tsv) or comma-separated (.csv) file with a header. Defaults to None.

  • counts_from_abundance (Optional[Literal["scaled_tpm", "length_scaled_tpm", "dtu_scaled_tpm"]], optional) – Whether to calculate count estimates based on the abundance. When using scaled_tpm or length_scaled_tpm the counts no longer correlate with the the average transcript length per sample. In those cases, the length offset matrix should not be used for downstream analysis. Note, that this does not normalize the sequencing depth, only the difference in transcript length. When using the gene-summarized counts and not count estimates based on the abundance, the length offset matrix included in the output from this function should be used for downstream analysis. If your downstream analysis tool does not support the length offset matrix, you should probably use length_scaled_tpm for gene-level analysis. For transcript-level analysis, we recommend that you use scaled_tpm or dtu_scaled_tpm. For further guidance on transcript-level analysis, please refer to: https://doi.org/10.12688/f1000research.15398.3. Defaults to None.

  • gene_level (bool, optional) – Whether the input files are at the gene level. This is only the case for some RSEM quantification files. Defaults to False.

  • return_transcript_data (bool, optional) – Whether to return the transcript-level expression. Defaults to False.

  • inferential_replicates (bool, optional) – Whether to parse and include inferential replicates in the output. If you want to recalculate the counts from inferential replicates, please set this option to True and provide a inferential_replicate_transformer. Defaults to False.

  • inferential_replicate_transformer (Optional[Callable], optional) – A custom function to transform the inferential replicates. Defaults to None.

  • inferential_replicate_variance (bool, optional) – Whether to return the variance of the inferential replicates. Defaults to False.

  • ignore_transcript_version (bool, optional) – Whether to ignore the transcript version. Defaults to True.

  • ignore_after_bar (bool, optional) – Whether to split the transcript id after the bar character (|). Defaults to True.

  • id_column (Optional[str], optional) – The column name for the transcript id. Defaults to None.

  • counts_column (Optional[str], optional) – The column name for the counts. Defaults to None.

  • length_column (Optional[str], optional) – The column name for the length. Defaults to None.

  • abundance_column (Optional[str], optional) – The column name for the abundance. Defaults to None.

  • custom_importer (Optional[Callable], optional) – A custom importer function. Defaults to None.

  • existence_optional (bool, optional) – Whether the existence of the files is optional. Defaults to False.

  • read_length (Optional[int], optional) – The read length for the stringtie quantification. Defaults to None.

  • output_type (Literal["xarray", "anndata"], optional) – The type of output. Defaults to “anndata”.

  • output_format (Literal["csv", "h5ad"], optional) – The type of output file. Defaults to “csv”.

  • output_path (Optional[Union[str, Path]], optional) – The path to save the gene-level expression. Defaults to None.

  • output_path_overwrite (bool, optional) – Whether to overwrite the save path if it already exists. Defaults to False.

  • return_data (bool, optional) – Whether to return the gene-level expression. Defaults to True.

  • biotype_filter (List[str], optional) – Filter the transcripts by biotype, including only those provided. Enables post-hoc filtering of the data based on the biotype of the transcripts. Assumes that the biotype is present in the transcript_id of the data, bar-separated. If this is not the case, please use the filter_by_biotype function from the pytximport.utils module instead. Please note that the abundance will NOT be recalculated after filtering to avoid introducing bias. If you wish to recalculate the abundance, please use the filter_by_biotype function from the pytximport.utils module instead. Defaults to None.

Returns:

The estimated gene-level or transcript-level

expression data if return_data is True, else None.

Return type:

Union[xr.Dataset, ad.AnnData, SummarizedExperiment, None]