{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Vignette" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This vignette extends the [vignette for the R-version of tximport](https://bioconductor.org/packages/release/bioc/vignettes/tximport/inst/doc/tximport.html). If you are unfamiliar with `tximport` or curious about the motivation behind it, please check it out." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you are looking for a full-featured end-to-end workflow for Pythonic bulk RNA-sequencing analysis, check out our [Snakemake workflow](https://github.com/complextissue/snakemake-bulk-rna-seq-workflow/) based on pytximport." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating your transcript to gene map\n", "\n", "Here, we will show you how to generate a transcript-to-gene mapping based on the Ensembl reference or a gene transfer format file." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Build it from Ensembl" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This example requires `pybiomart` which is installed together with `pytximport`. Providing a host is optional, for a list of available archives that correspond to Ensembl releases, please consult [https://www.ensembl.org/info/website/archives/index.html](https://www.ensembl.org/info/website/archives/index.html). By default, the transcript ids will be mapped to the `ensembl_gene_id` field. If you prefer to use gene names, choose `external_gene_name`. Be aware that not all proposed transcripts have been assigned a name yet and thus will not be included if you use gene names. The first time you run this function, it may take a few seconds to download the data." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | transcript_id | \n", "gene_id | \n", "
---|---|---|
0 | \n", "ENST00000387314 | \n", "MT-TF | \n", "
1 | \n", "ENST00000389680 | \n", "MT-RNR1 | \n", "
2 | \n", "ENST00000387342 | \n", "MT-TV | \n", "
3 | \n", "ENST00000387347 | \n", "MT-RNR2 | \n", "
4 | \n", "ENST00000386347 | \n", "MT-TL1 | \n", "
\n", " | transcript_id | \n", "gene_id | \n", "
---|---|---|
0 | \n", "ENSMUST00000082387 | \n", "ENSMUSG00000064336 | \n", "
1 | \n", "ENSMUST00000082388 | \n", "ENSMUSG00000064337 | \n", "
2 | \n", "ENSMUST00000082389 | \n", "ENSMUSG00000064338 | \n", "
3 | \n", "ENSMUST00000082390 | \n", "ENSMUSG00000064339 | \n", "
4 | \n", "ENSMUST00000082391 | \n", "ENSMUSG00000064340 | \n", "
\n", " | transcript_id | \n", "gene_id | \n", "gene_biotype | \n", "
---|---|---|---|
58 | \n", "ENST00000673100 | \n", "LINC03015 | \n", "lncRNA | \n", "
59 | \n", "ENST00000673009 | \n", "LINC03015 | \n", "lncRNA | \n", "
60 | \n", "ENST00000671859 | \n", "LINC03015 | \n", "lncRNA | \n", "
61 | \n", "ENST00000673474 | \n", "LINC03015 | \n", "lncRNA | \n", "
62 | \n", "ENST00000671974 | \n", "LINC03015 | \n", "lncRNA | \n", "
\n", " | transcript_id | \n", "gene_id | \n", "
---|---|---|
0 | \n", "ENST00000424215 | \n", "ENSG00000228037 | \n", "
1 | \n", "ENST00000511072 | \n", "PRDM16 | \n", "
2 | \n", "ENST00000607632 | \n", "PRDM16 | \n", "
3 | \n", "ENST00000378391 | \n", "PRDM16 | \n", "
4 | \n", "ENST00000514189 | \n", "PRDM16 | \n", "
\n", " | transcript_name | \n", "gene_id | \n", "
---|---|---|
0 | \n", "MT-TF-201 | \n", "MT-TF | \n", "
1 | \n", "MT-RNR1-201 | \n", "MT-RNR1 | \n", "
2 | \n", "MT-TV-201 | \n", "MT-TV | \n", "
3 | \n", "MT-RNR2-201 | \n", "MT-RNR2 | \n", "
4 | \n", "MT-TL1-201 | \n", "MT-TL1 | \n", "
<xarray.Dataset> Size: 28kB\n", "Dimensions: (gene_id: 496, file: 2, file_path: 2)\n", "Coordinates:\n", " * gene_id (gene_id) object 4kB 'ENSMUSG00000083355' ... 'ENSMUSG00000067...\n", " * file_path (file_path) <U43 344B '../../test/data/salmon/multiple/Sample_...\n", "Dimensions without coordinates: file\n", "Data variables:\n", " abundance (gene_id, file) float64 8kB 0.08291 0.0 0.09854 ... 0.4618 0.0\n", " counts (gene_id, file) float64 8kB 1.005 0.0 1.086 ... 1.957 6.208 0.0\n", " length (gene_id, file) float64 8kB 509.1 509.1 445.8 ... 564.6 564.6
\n", " | transcript_id | \n", "transcript_name | \n", "
---|---|---|
0 | \n", "ENST00000387314 | \n", "MT-TF-201 | \n", "
1 | \n", "ENST00000389680 | \n", "MT-RNR1-201 | \n", "
2 | \n", "ENST00000387342 | \n", "MT-TV-201 | \n", "
3 | \n", "ENST00000387347 | \n", "MT-RNR2-201 | \n", "
4 | \n", "ENST00000386347 | \n", "MT-TL1-201 | \n", "
\n", " | ../../test/data/salmon/quant.sf | \n", "
---|---|
HOXC8-201 | \n", "4486.940412 | \n", "
UGT3A2-201 | \n", "1307.314695 | \n", "
HOXC9-201 | \n", "886.909534 | \n", "
HOXC4-202 | \n", "749.069369 | \n", "
HOXC12-201 | \n", "544.817685 | \n", "
\n", " | ../../test/data/salmon/quant.sf | \n", "
---|---|
HOXC8 | \n", "4486.940412 | \n", "
UGT3A2 | \n", "1506.597257 | \n", "
HOXC4 | \n", "1152.964133 | \n", "
HOXC9 | \n", "886.909534 | \n", "
HOXC12 | \n", "544.817685 | \n", "
\n", " | transcript_id | \n", "gene_id | \n", "gene_biotype | \n", "
---|---|---|---|
0 | \n", "ENST00000424215 | \n", "ENSG00000228037 | \n", "lncRNA | \n", "
1 | \n", "ENST00000511072 | \n", "PRDM16 | \n", "protein_coding | \n", "
2 | \n", "ENST00000607632 | \n", "PRDM16 | \n", "protein_coding | \n", "
3 | \n", "ENST00000378391 | \n", "PRDM16 | \n", "protein_coding | \n", "
4 | \n", "ENST00000514189 | \n", "PRDM16 | \n", "protein_coding | \n", "