pytximport.utils
================

.. py:module:: pytximport.utils

.. autoapi-nested-parse::

   Utility functions for converting data, creating maps and filtering data.

   Most functions contained within this module are primarily destined for internal use but are exposed for advanced users
   who may want to use them directly.


Functions
---------

.. autoapisummary::

   pytximport.utils.convert_abundance_to_counts
   pytximport.utils.convert_counts_to_tpm
   pytximport.utils.convert_transcripts_to_genes
   pytximport.utils.create_transcript_gene_map
   pytximport.utils.create_transcript_gene_map_from_annotation
   pytximport.utils.filter_by_biotype
   pytximport.utils.get_median_length_over_isoform
   pytximport.utils.remove_transcript_version
   pytximport.utils.replace_missing_average_transcript_length
   pytximport.utils.replace_transcript_ids_with_names
   pytximport.utils.summarize_rsem_gene_data


Package Contents
----------------

.. py:function:: convert_abundance_to_counts(counts, abundance, length, counts_from_abundance)

   Convert transcript-level abundance to counts, either as TPM or TPM scaled by the length.

   :param counts: The original counts.
   :type counts: DataArray
   :param abundance: The transcript-level abundance.
   :type abundance: DataArray
   :param length: The length of the transcripts.
   :type length: DataArray
   :param counts_from_abundance: The type of counts to convert to.
   :type counts_from_abundance: Literal["scaled_tpm", "length_scaled_tpm"]

   :returns: The transcript-level expression data with the counts.
   :rtype: DataArray


.. py:function:: convert_counts_to_tpm(counts, length)

   Convert transcript-level counts to TPM.

   :param counts: The transcript-level counts.
   :type counts: np.ndarray
   :param length: The length of the transcripts.
   :type length: np.ndarray

   :returns: The transcript-level expression data with the TPM.
   :rtype: np.ndarray


.. py:function:: convert_transcripts_to_genes(transcript_data, transcript_gene_map, counts_from_abundance = None)

   Convert transcript-level expression to gene-level expression.

   :param transcript_data: The transcript-level expression data from multiple samples.
   :type transcript_data: xr.Dataset
   :param transcript_gene_map: The mapping from transcripts to genes. Contains two columns: `transcript_id`
                               and `gene_id`.
   :type transcript_gene_map: pd.DataFrame
   :param counts_from_abundance: The type of counts to
                                 convert to. Defaults to "length_scaled_tpm".
   :type counts_from_abundance: Optional[Literal["scaled_tpm", "length_scaled_tpm"]], optional

   :returns: The gene-level expression data from multiple samples.
   :rtype: xr.Dataset


.. py:function:: create_transcript_gene_map(species = 'human', host = 'http://www.ensembl.org', source_field = 'ensembl_transcript_id', target_field = 'ensembl_gene_id', **kwargs)

   Create a mapping from transcript ids to gene ids using the Ensembl Biomart.

   .. warning ::
       Choosing any `target_field` value other than `ensembl_gene_id` may not result in a full transcript to gene map
       since not all transcripts may have the respective variable. While this does not typically affect well defined
       transcripts, be aware of this possible source of bias.

   Basic example:

   .. code-block:: python

       from pytximport.utils import create_transcript_gene_map

       transcript_gene_map = create_transcript_gene_map(
           species="human",
           host="https://may2024.archive.ensembl.org/", # Use a specific Ensembl release
           target_field="external_gene_name",
       )

   :param species: The species to use. Defaults to "human".
   :type species: Literal["human", "mouse"], optional
   :param host: The host to use. Defaults to "http://www.ensembl.org".
   :type host: str, optional
   :param source_field: The identifier to get for
                        each transcript id. Defaults to "ensembl_transcript_id".
   :type source_field: Literal["ensembl_transcript_id", "external_transcript_name"], optional
   :param target_field: The
                        corresponding identifier to get for each transcript. Defaults to "ensembl_gene_id".
   :type target_field: Literal["ensembl_gene_id", "external_gene_name", "external_transcript_name"], optional

   :returns: The mapping from transcript ids to gene ids.
   :rtype: pd.DataFrame


.. py:function:: create_transcript_gene_map_from_annotation(file_path, source_field = 'transcript_id', target_field = 'gene_id', chunk_size = 100000, keep_biotype = False, **kwargs)

   Create a mapping from transcript ids to gene ids using a GTF annotation file.

   :param file_path: The path to the GTF annotation file.
   :type file_path: Union[str, Path]
   :param field: The identifier to get for each transcript id.
                 Defaults to "gene_id".
   :type field: Literal["gene_id", "gene_name"], optional
   :param chunk_size: The number of lines to read at a time. Defaults to 100000.
   :type chunk_size: int, optional
   :param keep_biotype: Whether to keep the gene_biotype column. Defaults to False.
   :type keep_biotype: bool, optional

   :returns: The mapping from transcript ids to gene ids.
   :rtype: pd.DataFrame


.. py:function:: filter_by_biotype(transcript_data, biotype_filter, id_column = 'transcript_id')

   Filter the transcripts by biotype.

   This function filters the transcripts by biotype. The biotype is assumed to be present in the transcript_id
   separated by a bar. The biotype is checked against the biotype_filter and the transcripts that match the biotype
   are kept. This function is provided mainly for internal use if `biotype_filter` is provided to the main function.

   :param transcript_data: The transcript-level expression data from multiple samples.
   :type transcript_data: xr.Dataset
   :param biotype_filter: The biotypes to filter the transcripts by.
   :type biotype_filter: List[str]
   :param id_column: The column name for the transcript ID. Defaults to "transcript_id".
   :type id_column: str, optional

   :returns: The transcript-level expression data from multiple samples with the transcripts filtered by biotype.
   :rtype: xr.Dataset


.. py:function:: get_median_length_over_isoform(transcript_data, transcript_gene_map)

   Get the median length of the gene over all isoforms.

   :param length: The transcript data containing the length of the transcripts.
   :type length: xr.Dataset
   :param transcript_gene_map: The mapping of transcripts to genes.
   :type transcript_gene_map: pd.DataFrame
   :param ignore_after_bar: Whether to ignore the part of the transcript ID after the bar.
                            Defaults to True.
   :type ignore_after_bar: bool, optional

   :returns:

             The updated transcript data with the median gene length contained in the `median_isoform_length`
                 variable.
   :rtype: xr.Dataset


.. py:function:: remove_transcript_version(transcript_data, transcript_target_map = None, transcript_ids = None, id_column = 'transcript_id')

   Remove the transcript version from the transcript data and the transcript target map.

   :param transcript_data: The transcript data.
   :type transcript_data: xr.Dataset
   :param transcript_target_map: The transcript target map. Defaults to None.
   :type transcript_target_map: Optional[pd.DataFrame], optional
   :param transcript_ids: The transcript ids. Defaults to None.
   :type transcript_ids: Optional[List[str]], optional
   :param id_column: The column name for the transcript ID. Defaults to "transcript_id".
   :type id_column: str, optional

   :returns: The transcript data, the transcript target map, and the transcript
             ids.
   :rtype: Tuple[xr.Dataset, pd.DataFrame, List[str]]


.. py:function:: replace_missing_average_transcript_length(length, length_gene_mean)

   Replace missing mean transcript length at the sample level with the gene mean across samples.

   :param length: The average length of transcripts at the gene level with a sample dimension.
   :type length: xr.DataArray
   :param length_gene_mean: The mean length of the transcripts of the genes across samples.
   :type length_gene_mean: xr.DataArray

   :returns: The average length of transcripts at the gene level with a sample dimension.
   :rtype: xr.DataArray


.. py:function:: replace_transcript_ids_with_names(transcript_data, transcript_name_map)

   Replace transcript IDs with transcript names.

   :param transcript_data: The transcript-level expression data.
   :type transcript_data: Union[ad.AnnData, xr.Dataset]
   :param transcript_name_map: The mapping from transcripts to
                               names. Contains two columns: `transcript_id` and `transcript_name`.
   :type transcript_name_map: Union[pd.DataFrame, Union[str, Path]]

   :returns: The transcript-level expression data with the transcript names.
   :rtype: Union[ad.AnnData, xr.Dataset]


.. py:function:: summarize_rsem_gene_data(file_paths, importer, importer_kwargs, existence_optional = False)

   Summarize gene-level RSEM quantification files.

   :param file_paths: The paths to the quantification files.
   :type file_paths: Union[List[str], List[Path]]
   :param importer: The importer function to read the quantification files.
   :type importer: Callable
   :param importer_kwargs: The keyword arguments for the importer function.
   :type importer_kwargs: Dict[str, Any]
   :param existence_optional: Whether the files are optional. Defaults to False.
   :type existence_optional: bool, optional

   :returns: The gene-level expression data.
   :rtype: xr.Dataset