Vignette¶
This vignette extends the vignette for the R-version of tximport. If you are unfamiliar with tximport or curious about the motivation behind it, please check it out.
If you are looking for a full-featured end-to-end workflow for Pythonic bulk RNA-sequencing analysis, check out our Snakemake workflow based on pytximport.
Creating your transcript to gene map¶
Here, we will show you how to generate a transcript-to-gene mapping based on the Ensembl reference or a gene transfer format file.
Build it from Ensembl¶
This example requires pybiomart which is installed together with pytximport. Providing a host is optional, for a list of available archives that correspond to Ensembl releases, please consult https://www.ensembl.org/info/website/archives/index.html. By default, the transcript ids will be mapped to the ensembl_gene_id field. If you prefer to use gene names, choose external_gene_name. Be aware that not all proposed transcripts have been assigned a name yet and thus will not be
included if you use gene names. The first time you run this function, it may take a few seconds to download the data.
[1]:
from pytximport.utils import create_transcript_gene_map
transcript_gene_map_human = create_transcript_gene_map(
species="human",
target_field="external_gene_name",
)
transcript_gene_map_human.head(5)
[1]:
| transcript_id | gene_id | |
|---|---|---|
| 0 | ENST00000387314 | MT-TF |
| 1 | ENST00000389680 | MT-RNR1 |
| 2 | ENST00000387342 | MT-TV |
| 3 | ENST00000387347 | MT-RNR2 |
| 4 | ENST00000386347 | MT-TL1 |
[2]:
transcript_gene_map_mouse = create_transcript_gene_map(species="mouse")
transcript_gene_map_mouse.head(5)
[2]:
| transcript_id | gene_id | |
|---|---|---|
| 0 | ENSMUST00000082387 | ENSMUSG00000064336 |
| 1 | ENSMUST00000082388 | ENSMUSG00000064337 |
| 2 | ENSMUST00000082389 | ENSMUSG00000064338 |
| 3 | ENSMUST00000082390 | ENSMUSG00000064339 |
| 4 | ENSMUST00000082391 | ENSMUSG00000064340 |
Use a gene transfer format file¶
If you already have an annotation file in .gtf format (e.g. from the GENCODE or Ensembl references), you can use the create_transcript_gene_map_from_annotation function contained in pytximport.utils to generate a transcript to gene map. Using an annotation file is considered best practice, as it ensures that your annotation and alignment reference are the same if you download them from the same source, e.g., an Ensembl release.
Set target_field to gene_name to map transcript ids to gene names instead of gene ids. If the transcript does not have a corresponding gene name in the annotation file, it will be mapped to the gene id.
[3]:
from pytximport.utils import create_transcript_gene_map_from_annotation
transcript_gene_map_from_gtf = create_transcript_gene_map_from_annotation(
"../../test/data/annotation.gtf",
target_field="gene_name",
)
transcript_gene_map_from_gtf.head(5)
[3]:
| transcript_id | gene_id | |
|---|---|---|
| 0 | ENST00000424215 | ENSG00000228037 |
| 1 | ENST00000511072 | PRDM16 |
| 2 | ENST00000607632 | PRDM16 |
| 3 | ENST00000378391 | PRDM16 |
| 4 | ENST00000514189 | PRDM16 |
Map transcript names to gene names¶
You can also optionally map transcript names to either gene names or gene ids if your data requires it.
[4]:
transcript_name_gene_map_human = create_transcript_gene_map(
"human",
source_field="external_transcript_name",
target_field="external_gene_name",
)
transcript_name_gene_map_human.head(5)
[4]:
| transcript_id | gene_id | |
|---|---|---|
| 0 | MT-TF-201 | MT-TF |
| 1 | MT-RNR1-201 | MT-RNR1 |
| 2 | MT-TV-201 | MT-TV |
| 3 | MT-RNR2-201 | MT-RNR2 |
| 4 | MT-TL1-201 | MT-TL1 |
Importing transcript quantification files¶
You can easily import quantification files from tools like salmon with pytximport.
[5]:
import numpy as np
import pandas as pd
from pytximport import tximport
[6]:
txi = tximport(
[
"../../test/data/salmon/multiple/Sample_1.sf",
"../../test/data/salmon/multiple/Sample_2.sf",
],
"salmon",
transcript_gene_map_mouse,
counts_from_abundance="length_scaled_tpm",
output_type="xarray", # or "anndata"
)
txi
Reading quantification files: 2it [00:00, 170.88it/s]
[6]:
<xarray.Dataset> Size: 28kB
Dimensions: (gene_id: 496, file: 2, file_path: 2)
Coordinates:
* gene_id (gene_id) object 4kB 'ENSMUSG00000083355' ... 'ENSMUSG00000067...
* file_path (file_path) <U43 344B '../../test/data/salmon/multiple/Sample_...
Dimensions without coordinates: file
Data variables:
abundance (gene_id, file) float64 8kB 0.08291 0.0 0.09854 ... 0.4618 0.0
counts (gene_id, file) float64 8kB 1.005 0.0 1.086 ... 1.957 6.208 0.0
length (gene_id, file) float64 8kB 509.1 509.1 445.8 ... 564.6 564.6Exporting transcript-level count estimates¶
You can also export the transformed transcript counts directly for transcript-level analysis.
[7]:
txi = tximport(
["../../test/data/salmon/quant.sf"],
"salmon",
counts_from_abundance="scaled_tpm",
return_transcript_data=True,
)
txi
Reading quantification files: 1it [00:00, 323.11it/s]
[7]:
AnnData object with n_obs × n_vars = 1 × 14
uns: 'counts_from_abundance'
obsm: 'length', 'abundance'
Note that the example above works without a transcript to gene map. If you want to use the transcript names instead of the transcript ids, you can optionally use the replace_transcript_ids_with_names function together with a transcript id to transcript name map. We can use the create_transcript_gene_map function to create a map between transcript ids and transcript names, too.
[8]:
from pytximport.utils import replace_transcript_ids_with_names
[9]:
transcript_name_map_human = create_transcript_gene_map("human", target_field="external_transcript_name")
transcript_name_map_human.head(5)
[9]:
| transcript_id | transcript_name | |
|---|---|---|
| 0 | ENST00000387314 | MT-TF-201 |
| 1 | ENST00000389680 | MT-RNR1-201 |
| 2 | ENST00000387342 | MT-TV-201 |
| 3 | ENST00000387347 | MT-RNR2-201 |
| 4 | ENST00000386347 | MT-TL1-201 |
[10]:
txi = replace_transcript_ids_with_names(txi, transcript_name_map_human)
pd.DataFrame(txi.X.T, index=txi.var.index, columns=txi.obs.index).sort_values(
by=txi.obs.index[0],
ascending=False,
).head(5)
[10]:
| ../../test/data/salmon/quant.sf | |
|---|---|
| HOXC8-201 | 4486.940412 |
| UGT3A2-201 | 1307.314695 |
| HOXC9-201 | 886.909534 |
| HOXC4-202 | 749.069369 |
| HOXC12-201 | 544.817685 |
The same data summarized to genes:
[11]:
txi = tximport(
["../../test/data/salmon/quant.sf"],
"salmon",
transcript_gene_map_human,
counts_from_abundance="scaled_tpm",
output_type="xarray",
return_transcript_data=False,
)
pd.DataFrame(txi["counts"], index=txi.coords["gene_id"], columns=txi.coords["file_path"]).sort_values(
by=txi.coords["file_path"].data[0],
ascending=False,
).head(5)
Reading quantification files: 1it [00:00, 114.91it/s]
[11]:
| ../../test/data/salmon/quant.sf | |
|---|---|
| HOXC8 | 4486.940412 |
| UGT3A2 | 1506.597257 |
| HOXC4 | 1152.964133 |
| HOXC9 | 886.909534 |
| HOXC12 | 544.817685 |
Note that the top transcript corresponds to the top expressed gene in this case.
Exporting AnnData files¶
pytximport integrates well with other packages from the scverse through its AnnData export option.
[12]:
txi_ad = tximport(
["../../test/data/salmon/quant.sf"],
"salmon",
transcript_gene_map_human,
output_type="anndata",
# the output can optionally be saved to a file by uncommenting the following lines
# output_format="h5ad",
# output_path="txi_ad.h5ad",
)
txi_ad
Reading quantification files: 1it [00:00, 332.99it/s]
[12]:
AnnData object with n_obs × n_vars = 1 × 10
uns: 'counts_from_abundance'
obsm: 'length', 'abundance'
Exporting SummarizedExperiment files¶
Experiment support for SummarizedExperiment files is available through the BiocPy ecosystem (unaffiliated). While not part of the core functionality of pytximport, this output type/format may be useful for interacting with other R software packages. The optional dependencies necessary to support SummarizedExperiment files can be installed via pip install pytximport[biocpy].
[13]:
txi_se = tximport(
["../../test/data/salmon/quant.sf"],
"salmon",
transcript_gene_map_human,
output_type="summarizedexperiment",
# the output can optionally be saved to disk by uncommenting the following lines
# output_format="summarizedexperiment",
# output_path="txi_se",
)
txi_se.assay_names, txi_se.get_row_names(), txi_se.get_column_names()
Reading quantification files: 1it [00:00, 372.03it/s]
WARNING:root:Support for the SummarizedExperiment output type is experimental.
[13]:
(['counts'],
Names(['UGT3A2', 'HOXC6', 'HOXC9', 'HOXC11', 'HOXC4', 'HOXC10', 'HOXC13', 'HOXC5', 'HOXC8', 'HOXC12']),
Names(['../../test/data/salmon/quant.sf']))
Use from the command line¶
You can run pytximport from the command line, too. Available options can be viewed via the pytximport --help command.
[14]:
!pytximport -i ../../test/data/salmon/quant.sf -t "salmon" -m ../../test/data/gencode.v46.metadata.HGNC.tsv -of "h5ad" -ow -o ../../test/data/salmon/quant.h5ad
2024-10-01 12:36:11,849: Starting the import.
Reading quantification files: 1it [00:00, 114.33it/s]
2024-10-01 12:36:12,137: Converting transcript-level expression to gene-level expression.
2024-10-01 12:36:12,239: Matching gene_ids.
2024-10-01 12:36:12,259: Creating gene abundance.
2024-10-01 12:36:12,409: Creating gene counts.
2024-10-01 12:36:12,409: Creating lengths.
2024-10-01 12:36:12,409: Replacing missing lengths.
2024-10-01 12:36:12,414: Creating gene expression dataset.
2024-10-01 12:36:12,415: Saving the gene-level expression to: ../../test/data/salmon/quant.h5ad.
2024-10-01 12:36:12,419: Finished the import in 0.57 seconds.
Inferential replicates¶
pytximport can handle bootstraping replicates provided by salmon and kallisto. When inferential_replicate_transformer is set, the provided function is used to recalculate the counts and abundances for each sample based on the bootstraps.
[15]:
result = tximport(
[
"../../test/data/fabry_disease/SRR16504309_wt/",
"../../test/data/fabry_disease/SRR16504310_wt/",
"../../test/data/fabry_disease/SRR16504311_ko/",
"../../test/data/fabry_disease/SRR16504312_ko/",
],
"salmon",
transcript_gene_map_human,
inferential_replicates=True,
inferential_replicate_variance=True, # whether to calculate the variance of the inferential replicates
inferential_replicate_transformer=lambda x: np.median(x, axis=1),
counts_from_abundance="length_scaled_tpm",
)
result
Reading quantification files: 4it [00:00, 9.55it/s]
WARNING:root:Not all transcripts are present in the mapping. 31380 out of 253181 missing. Removing the missing transcripts.
[15]:
AnnData object with n_obs × n_vars = 4 × 40768
uns: 'counts_from_abundance', 'inferential_replicates'
obsm: 'length', 'abundance', 'variance'
Downstream analysis with PyDESeq2¶
The output from pytximport can easily be used for downstream analysis with PyDESeq2. For more information on PyDESeq2, please consult its documentation.
[16]:
%pip install pydeseq2 numba==0.60.0 decoupler adjustText omnipath tqdm ipywidgets seaborn -q
Note: you may need to restart the kernel to use updated packages.
[17]:
from pydeseq2.dds import DeseqDataSet
from pydeseq2.default_inference import DefaultInference
from pydeseq2.ds import DeseqStats
import decoupler as dc
Load the .csv file generated by pytximport via the output_path argument or create it directly from the output of pytximport. In this case, we are working with the salmon quantification files from a public bulk RNA sequencing dataset: Podocyte injury in Fabry nephropathy
Round count estimates (required by PyDESeq2) and add the corresponding metadata.
[18]:
result.X = result.X.round().astype(int)
result.obs["condition"] = ["Control", "Control", "Disease", "Disease"]
result
[18]:
AnnData object with n_obs × n_vars = 4 × 40768
obs: 'condition'
uns: 'counts_from_abundance', 'inferential_replicates'
obsm: 'length', 'abundance', 'variance'
Filter genes with low counts out.
[19]:
result = result[:, result.X.max(axis=0) > 10].copy()
result
[19]:
AnnData object with n_obs × n_vars = 4 × 14714
obs: 'condition'
uns: 'counts_from_abundance', 'inferential_replicates'
obsm: 'length', 'abundance', 'variance'
Now perform your PyDESeq2 analysis.
[20]:
dds = DeseqDataSet(
adata=result,
design_factors="condition",
refit_cooks=True,
inference=DefaultInference(n_cpus=8),
quiet=True,
)
[21]:
dds.deseq2()
/Users/au734063/Documents/code/pytximport-publish/pytximport/.venv/lib/python3.12/site-packages/pydeseq2/dds.py:490: UserWarning: As the residual degrees of freedom is less than 3, the distribution of log dispersions is especially asymmetric and likely to be poorly estimated by the MAD.
self.fit_dispersion_prior()
[22]:
stat_result = DeseqStats(dds, contrast=("condition", "Disease", "Control"), quiet=True)
[23]:
%%capture
stat_result.summary()
stat_result.lfc_shrink()
[24]:
dc.plot_volcano_df(
stat_result.results_df,
x="log2FoldChange",
y="padj",
sign_thr=0.01,
top=20,
figsize=(10, 5),
)
[25]:
collectri = dc.get_collectri(organism="human", split_complexes=False)
mat = stat_result.results_df[["stat"]].T.rename(index={"stat": "disease.vs.control"})
tf_acts, tf_pvals = dc.run_ulm(mat=mat, net=collectri, verbose=True)
dc.plot_barplot(acts=tf_acts, contrast="disease.vs.control", top=10, vertical=False, figsize=(5, 3))
Running ulm on mat with 1 samples and 14714 targets for 660 sources.
We can also evaluate known pathways.
[26]:
progeny = dc.get_progeny(top=500)
pathway_acts, pathway_pvals = dc.run_mlm(mat=mat, net=progeny, min_n=5)
[27]:
dc.plot_barplot(
acts=pathway_acts,
contrast="disease.vs.control",
top=40,
vertical=False,
figsize=(5, 3),
)
Please refer to the PyDESeq2 and the decoupler documentation for additional analyses.