Vignette

This vignette extends the vignette for the R-version of tximport. If you are unfamiliar with tximport or curious about the motivation behind it, please check it out.

If you are looking for a full-featured end-to-end workflow for Pythonic bulk RNA-sequencing analysis, check out our Snakemake workflow based on pytximport.

Creating your transcript to gene map

Here, we will show you how to generate a transcript-to-gene mapping based on the Ensembl reference or a gene transfer format file.

Build it from Ensembl

This example requires pybiomart which is installed together with pytximport. Providing a host is optional, for a list of available archives that correspond to Ensembl releases, please consult https://www.ensembl.org/info/website/archives/index.html. By default, the transcript ids will be mapped to the ensembl_gene_id field. If you prefer to use gene names, choose external_gene_name. Be aware that not all proposed transcripts have been assigned a name yet and thus will not be included if you use gene names. The first time you run this function, it may take a few seconds to download the data.

[1]:
from pytximport.utils import create_transcript_gene_map

transcript_gene_map_human = create_transcript_gene_map(
    species="human",
    target_field="external_gene_name",
)
transcript_gene_map_human.head(5)
[1]:
transcript_id gene_id
0 ENST00000387314 MT-TF
1 ENST00000389680 MT-RNR1
2 ENST00000387342 MT-TV
3 ENST00000387347 MT-RNR2
4 ENST00000386347 MT-TL1
[2]:
transcript_gene_map_mouse = create_transcript_gene_map(species="mouse")
transcript_gene_map_mouse.head(5)
[2]:
transcript_id gene_id
0 ENSMUST00000082387 ENSMUSG00000064336
1 ENSMUST00000082388 ENSMUSG00000064337
2 ENSMUST00000082389 ENSMUSG00000064338
3 ENSMUST00000082390 ENSMUSG00000064339
4 ENSMUST00000082391 ENSMUSG00000064340

You can also provide a list of mapping targets as the target_field argument. For example, to map transcripts to both gene names and gene biotypes, you can use the following code:

[3]:
transcript_gene_map_human_biotype = create_transcript_gene_map(
    species="human",
    target_field=["external_gene_name", "gene_biotype"],
)
transcript_gene_map_human_biotype = transcript_gene_map_human_biotype[
    transcript_gene_map_human_biotype["gene_biotype"] == "lncRNA"
]
transcript_gene_map_human_biotype.head(5)
[3]:
transcript_id gene_id gene_biotype
58 ENST00000673100 LINC03015 lncRNA
59 ENST00000673009 LINC03015 lncRNA
60 ENST00000671859 LINC03015 lncRNA
61 ENST00000673474 LINC03015 lncRNA
62 ENST00000671974 LINC03015 lncRNA

Use a gene transfer format file

If you already have an annotation file in .gtf format (e.g. from the GENCODE or Ensembl references), you can use the create_transcript_gene_map_from_annotation function contained in pytximport.utils to generate a transcript to gene map. Using an annotation file is considered best practice, as it ensures that your annotation and alignment reference are the same if you download them from the same source, e.g., an Ensembl release.

Set target_field to gene_name to map transcript ids to gene names instead of gene ids. If the transcript does not have a corresponding gene name in the annotation file, it will be mapped to the gene id.

[4]:
from pytximport.utils import create_transcript_gene_map_from_annotation

transcript_gene_map_from_gtf = create_transcript_gene_map_from_annotation(
    "../../test/data/annotation.gtf",
    target_field="gene_name",
)
transcript_gene_map_from_gtf.head(5)
[4]:
transcript_id gene_id
0 ENST00000424215 ENSG00000228037
1 ENST00000511072 PRDM16
2 ENST00000607632 PRDM16
3 ENST00000378391 PRDM16
4 ENST00000514189 PRDM16

Map transcript names to gene names

You can also optionally map transcript names to either gene names or gene ids if your data requires it.

[5]:
transcript_name_gene_map_human = create_transcript_gene_map(
    "human",
    source_field="external_transcript_name",
    target_field="external_gene_name",
)
transcript_name_gene_map_human.head(5)
[5]:
transcript_name gene_id
0 MT-TF-201 MT-TF
1 MT-RNR1-201 MT-RNR1
2 MT-TV-201 MT-TV
3 MT-RNR2-201 MT-RNR2
4 MT-TL1-201 MT-TL1

Importing transcript quantification files

You can easily import quantification files from tools like salmon with pytximport.

[ ]:
import numpy as np
import pandas as pd

from pytximport import tximport
[7]:
txi = tximport(
    [
        "../../test/data/salmon/multiple/Sample_1.sf",
        "../../test/data/salmon/multiple/Sample_2.sf",
    ],
    "salmon",
    transcript_gene_map_mouse,
    counts_from_abundance="length_scaled_tpm",
    output_type="xarray",  # or "anndata"
)
txi
Reading quantification files: 2it [00:00, 180.15it/s]
[7]:
<xarray.Dataset> Size: 28kB
Dimensions:    (gene_id: 496, file: 2, file_path: 2)
Coordinates:
  * gene_id    (gene_id) object 4kB 'ENSMUSG00000083355' ... 'ENSMUSG00000067...
  * file_path  (file_path) <U43 344B '../../test/data/salmon/multiple/Sample_...
Dimensions without coordinates: file
Data variables:
    abundance  (gene_id, file) float64 8kB 0.08291 0.0 0.09854 ... 0.4618 0.0
    counts     (gene_id, file) float64 8kB 1.005 0.0 1.086 ... 1.957 6.208 0.0
    length     (gene_id, file) float64 8kB 509.1 509.1 445.8 ... 564.6 564.6

Exporting transcript-level count estimates

You can also export the transformed transcript counts directly for transcript-level analysis.

[8]:
txi = tximport(
    ["../../test/data/salmon/quant.sf"],
    "salmon",
    counts_from_abundance="scaled_tpm",
    return_transcript_data=True,
)
txi
Reading quantification files: 1it [00:00, 293.78it/s]
[8]:
AnnData object with n_obs × n_vars = 1 × 14
    uns: 'counts_from_abundance'
    obsm: 'length', 'abundance'

Note that the example above works without a transcript to gene map. If you want to use the transcript names instead of the transcript ids, you can optionally use the replace_transcript_ids_with_names function together with a transcript id to transcript name map. We can use the create_transcript_gene_map function to create a map between transcript ids and transcript names, too.

[9]:
from pytximport.utils import replace_transcript_ids_with_names
[10]:
transcript_name_map_human = create_transcript_gene_map("human", target_field="external_transcript_name")
transcript_name_map_human.head(5)
[10]:
transcript_id transcript_name
0 ENST00000387314 MT-TF-201
1 ENST00000389680 MT-RNR1-201
2 ENST00000387342 MT-TV-201
3 ENST00000387347 MT-RNR2-201
4 ENST00000386347 MT-TL1-201
[11]:
txi = replace_transcript_ids_with_names(txi, transcript_name_map_human)
pd.DataFrame(txi.X.T, index=txi.var.index, columns=txi.obs.index).sort_values(
    by=txi.obs.index[0],
    ascending=False,
).head(5)
[11]:
../../test/data/salmon/quant.sf
HOXC8-201 4486.940412
UGT3A2-201 1307.314695
HOXC9-201 886.909534
HOXC4-202 749.069369
HOXC12-201 544.817685

The same data summarized to genes:

[12]:
txi = tximport(
    ["../../test/data/salmon/quant.sf"],
    "salmon",
    transcript_gene_map_human,
    counts_from_abundance="scaled_tpm",
    output_type="xarray",
    return_transcript_data=False,
)
pd.DataFrame(txi["counts"], index=txi.coords["gene_id"], columns=txi.coords["file_path"]).sort_values(
    by=txi.coords["file_path"].data[0],
    ascending=False,
).head(5)
Reading quantification files: 1it [00:00, 105.05it/s]
[12]:
../../test/data/salmon/quant.sf
HOXC8 4486.940412
UGT3A2 1506.597257
HOXC4 1152.964133
HOXC9 886.909534
HOXC12 544.817685

Note that the top transcript corresponds to the top expressed gene in this case.

Exporting AnnData files

pytximport integrates well with other packages from the scverse through its AnnData export option.

[13]:
txi_ad = tximport(
    ["../../test/data/salmon/quant.sf"],
    "salmon",
    transcript_gene_map_human,
    output_type="anndata",
    # the output can optionally be saved to a file by uncommenting the following lines
    # output_format="h5ad",
    # output_path="txi_ad.h5ad",
)
txi_ad
Reading quantification files: 1it [00:00, 366.57it/s]
[13]:
AnnData object with n_obs × n_vars = 1 × 10
    uns: 'counts_from_abundance'
    obsm: 'length', 'abundance'

Exporting SummarizedExperiment files

Experiment support for SummarizedExperiment files is available through the BiocPy ecosystem (unaffiliated). While not part of the core functionality of pytximport, this output type/format may be useful for interacting with other R software packages. The optional dependencies necessary to support SummarizedExperiment files can be installed via pip install pytximport[biocpy].

[14]:
txi_se = tximport(
    ["../../test/data/salmon/quant.sf"],
    "salmon",
    transcript_gene_map_human,
    output_type="summarizedexperiment",
    # the output can optionally be saved to disk by uncommenting the following lines
    # output_format="summarizedexperiment",
    # output_path="txi_se",
)
txi_se.assay_names, txi_se.get_row_names(), txi_se.get_column_names()
Reading quantification files: 1it [00:00, 357.27it/s]
WARNING:root:Support for the SummarizedExperiment output type is experimental.
[14]:
(['counts'],
 Names(['UGT3A2', 'HOXC6', 'HOXC9', 'HOXC11', 'HOXC4', 'HOXC10', 'HOXC13', 'HOXC5', 'HOXC8', 'HOXC12']),
 Names(['../../test/data/salmon/quant.sf']))

Use from the command line

You can run pytximport from the command line, too. Available options can be viewed via the pytximport --help command.

[15]:
!pytximport -i ../../test/data/salmon/quant.sf -t "salmon" -m ../../test/data/gencode.v46.metadata.HGNC.tsv -of "h5ad" -ow -o ../../test/data/salmon/quant.h5ad
2024-11-26 15:53:23,385: Starting the import.
Reading quantification files: 1it [00:00, 94.06it/s]
2024-11-26 15:53:23,666: Converting transcript-level expression to gene-level expression.
2024-11-26 15:53:23,722: Matching gene_ids.
2024-11-26 15:53:23,747: Creating gene abundance.
2024-11-26 15:53:23,927: Creating gene counts.
2024-11-26 15:53:23,927: Creating lengths.
2024-11-26 15:53:23,927: Replacing missing lengths.
2024-11-26 15:53:23,932: Creating gene expression dataset.
2024-11-26 15:53:23,934: Saving the gene-level expression to: ../../test/data/salmon/quant.h5ad.
2024-11-26 15:53:23,939: Finished the import in 0.55 seconds.

You can also create a transcript-to-gene mapping via the pytximport create-map command.

[16]:
!pytximport create-map -i ../../test/data/annotation.gtf -o ./transcript_gene_map.tsv -ow
2024-11-26 15:53:25,049: Creating a transcript-to-gene mapping file.
2024-11-26 15:53:25,057: Created the transcript-to-gene mapping file. Saving the file...
2024-11-26 15:53:25,059: Saved the transcript-to-gene mapping file to ./transcript_gene_map.tsv.

Inferential replicates

pytximport can handle bootstraping replicates provided by salmon and kallisto. When inferential_replicate_transformer is set, the provided function is used to recalculate the counts and abundances for each sample based on the bootstraps.

[17]:
result = tximport(
    [
        "../../test/data/fabry_disease/SRR16504309_wt/",
        "../../test/data/fabry_disease/SRR16504310_wt/",
        "../../test/data/fabry_disease/SRR16504311_ko/",
        "../../test/data/fabry_disease/SRR16504312_ko/",
    ],
    "salmon",
    transcript_gene_map_human,
    inferential_replicates=True,
    inferential_replicate_variance=True,  # whether to calculate the variance of the inferential replicates
    inferential_replicate_transformer=lambda x: np.median(x, axis=1),
    counts_from_abundance="length_scaled_tpm",
)
result
Reading quantification files: 4it [00:00,  5.80it/s]
WARNING:root:Not all transcripts are present in the mapping. 31380 out of 253181 missing. Removing the missing transcripts.
[17]:
AnnData object with n_obs × n_vars = 4 × 40768
    uns: 'counts_from_abundance', 'inferential_replicates'
    obsm: 'length', 'abundance', 'variance'

Biotype filtering

For some use cases, it may be of interest to restrict the results to certain gene_biotypes, e.g., protein-coding only. First, generate a transcript-to-gene mapping from an annotation GTF file that includes the gene biotype.

[18]:
transcript_gene_map_from_gtf_with_biotype = create_transcript_gene_map_from_annotation(
    "../../test/data/annotation.gtf",
    target_field=["gene_name", "gene_biotype"],
)
transcript_gene_map_from_gtf_with_biotype.head(5)
[18]:
transcript_id gene_id gene_biotype
0 ENST00000424215 ENSG00000228037 lncRNA
1 ENST00000511072 PRDM16 protein_coding
2 ENST00000607632 PRDM16 protein_coding
3 ENST00000378391 PRDM16 protein_coding
4 ENST00000514189 PRDM16 protein_coding

Then, import quantification files using the transcript-to-gene mapping and filter the counts by biotype.

[19]:
from pytximport.utils import filter_by_biotype

result_small = tximport(
    ["../../test/data/fabry_disease/SRR16504309_wt/"],
    "salmon",
    transcript_gene_map_from_gtf_with_biotype,
    counts_from_abundance="length_scaled_tpm",
)

result_small_filtered = filter_by_biotype(
    result_small,
    transcript_gene_map_from_gtf_with_biotype,
    biotype_filter=["protein_coding"],
    # Since the data is already at the gene level, we have to use the gene_id from the transcript_gene_map
    id_column="gene_id",
)

len(result_small.var_names), len(result_small_filtered.var_names)
Reading quantification files: 1it [00:00, 19.53it/s]
WARNING:root:Not all transcripts are present in the mapping. 253161 out of 253181 missing. Removing the missing transcripts.
[19]:
(12, 4)

Downstream analysis with PyDESeq2

The output from pytximport can easily be used for downstream analysis with PyDESeq2. For more information on PyDESeq2, please consult its documentation.

[20]:
%pip install pydeseq2 numba==0.60.0 decoupler adjustText omnipath tqdm ipywidgets seaborn -q
Note: you may need to restart the kernel to use updated packages.
[ ]:
import decoupler as dc
from pydeseq2.dds import DeseqDataSet
from pydeseq2.default_inference import DefaultInference
from pydeseq2.ds import DeseqStats

Load the .csv file generated by pytximport via the output_path argument or create it directly from the output of pytximport. In this case, we are working with the salmon quantification files from a public bulk RNA sequencing dataset: Podocyte injury in Fabry nephropathy

Round count estimates (required by PyDESeq2) and add the corresponding metadata.

[22]:
result.X = result.X.round().astype(int)
result.obs["condition"] = ["Control", "Control", "Disease", "Disease"]
result
[22]:
AnnData object with n_obs × n_vars = 4 × 40768
    obs: 'condition'
    uns: 'counts_from_abundance', 'inferential_replicates'
    obsm: 'length', 'abundance', 'variance'

Filter genes with low counts out.

[23]:
result = result[:, result.X.max(axis=0) > 10].copy()
result
[23]:
AnnData object with n_obs × n_vars = 4 × 14714
    obs: 'condition'
    uns: 'counts_from_abundance', 'inferential_replicates'
    obsm: 'length', 'abundance', 'variance'

Now perform your PyDESeq2 analysis.

[24]:
dds = DeseqDataSet(
    adata=result,
    design_factors="condition",
    refit_cooks=True,
    inference=DefaultInference(n_cpus=8),
    quiet=True,
)
[25]:
dds.deseq2()
/Users/au734063/Documents/code/pytximport-publish/pytximport/.venv/lib/python3.12/site-packages/pydeseq2/dds.py:490: UserWarning: As the residual degrees of freedom is less than 3, the distribution of log dispersions is especially asymmetric and likely to be poorly estimated by the MAD.
  self.fit_dispersion_prior()
[26]:
stat_result = DeseqStats(dds, contrast=("condition", "Disease", "Control"), quiet=True)
[27]:
%%capture
stat_result.summary()
stat_result.lfc_shrink()
[28]:
dc.plot_volcano_df(
    stat_result.results_df,
    x="log2FoldChange",
    y="padj",
    sign_thr=0.01,
    top=20,
    figsize=(10, 5),
)
_images/example_60_0.png
[29]:
collectri = dc.get_collectri(organism="human", split_complexes=False)
mat = stat_result.results_df[["stat"]].T.rename(index={"stat": "disease.vs.control"})
tf_acts, tf_pvals = dc.run_ulm(mat=mat, net=collectri, verbose=True)
dc.plot_barplot(acts=tf_acts, contrast="disease.vs.control", top=10, vertical=False, figsize=(5, 3))
Running ulm on mat with 1 samples and 14714 targets for 660 sources.
_images/example_61_1.png

We can also evaluate known pathways.

[30]:
progeny = dc.get_progeny(top=500)
pathway_acts, pathway_pvals = dc.run_mlm(mat=mat, net=progeny, min_n=5)
[31]:
dc.plot_barplot(
    acts=pathway_acts,
    contrast="disease.vs.control",
    top=40,
    vertical=False,
    figsize=(5, 3),
)
_images/example_64_0.png

Please refer to the PyDESeq2 and the decoupler documentation for additional analyses.