Infer GRN

Compute ATAC peak-to-gene links using TSS annotation. It uses the CellOracle motif_analysis module to get TSS information for the provided ATAC peaks. It returns a DataFrame with the peak-to-gene links.

Parameters:
  • atac (anndata.AnnData) – AnnData object containing the ATAC-seq data. The peak names should be in the format ‘chr_start_end’.

  • rna (anndata.AnnData) – AnnData object containing the RNA-seq data. The gene names should match those in the ATAC-seq data.

  • ref_genome (str) – Reference genome to use for TSS annotation. E.g., ‘hg38’, ‘mm10’, etc.

Returns:

DataFrame with columns [‘source’, ‘target’] representing the peak-to-gene links.

Return type:

pd.DataFrame

recon.infer_grn.layers.compute_rna_network(df_exp_mtx: DataFrame | AnnData, tf_names: List[str], temp_dir: Path | None = None, method: Literal['GBM', 'RF'] = 'GBM', n_cpu: int = 1, seed: int = 666) DataFrame

# Inspired from SCENICPLUS: https://github.com/aertslab/scenicplus/blob/main/src/scenicplus/TF_to_gene.py

Calculate TF-to-gene relationships using either Gradient Boosting Machine (GBM) or Random Forest (RF) regression.

It is a wrapper around the infer_partial_network function from the arboreto package, similarly to GRNBoost2. It uses joblib to parallelize the inference of the relationships for each target gene. It returns a DataFrame with the TF-to-gene relationships and their importance scores.

Parameters:
  • df_exp_mtx (pd.DataFrame, ad.AnnData) – Gene expression matrix with genes as columns and cells as rows. If an AnnData object is provided, the expression matrix is extracted using the to_df() method.

  • tf_names (List[str]) – List of transcription factor names to consider as potential regulators.

  • temp_dir (pathlib.Path) – Path to a temporary directory to store intermediate files during parallel processing. If None, a temporary directory will be created and deleted after use.

  • method (Literal['GBM', 'RF'], optional) – Method to use for regression. Either ‘GBM’ for Gradient Boosting Machine or ‘RF’ for Random Forest. Default is ‘GBM’.

  • n_cpu (int, optional) – Number of CPU cores to use for parallel processing. Default is 1.

  • seed (int, optional) – Random seed for reproducibility. Default is 666.

Returns:

DataFrame with columns [‘tf’, ‘target’, ‘importance’] representing the TF-to-gene relationships and their importance scores.

Return type:

pd.DataFrame

recon.infer_grn.layers.compute_tf_network(rna, tfs_list, method=None)

Compute TF-to-ATAC peak links using motif scanning. It uses the CellOracle motif_analysis module to scan for motifs in the provided ATAC peaks. It returns a DataFrame with the TF-to-peak links.

Parameters:
  • atac (anndata.AnnData) – AnnData object containing the ATAC-seq data. The peak names should be in the format ‘chr_start_end’.

  • ref_genome (str) – Reference genome to use for motif scanning. E.g., ‘hg38’, ‘mm10’, etc.

  • genomes_dir (str, optional) – Directory containing the reference genomes. If None, the default CellOracle genomes directory will be used.

  • motifs (list, optional) – List of motifs to use for scanning. If None, the default CellOracle motifs will be used.

  • fpr (float, optional) – False positive rate for motif scanning. Default is 0.02.

  • verbose (bool, optional) – Whether to print progress messages. Default is True.

  • indirect (bool, optional) – Whether to include TF-to-peak links from indirect evidences. Default is True.

  • n_cpus (int, optional) – Number of CPUs to use for parallel processing. Default is -1 (use all available CPUs).

Returns:

DataFrame with columns [‘source’, ‘target’] representing the TF-to-peak links.

Return type:

pd.DataFrame

recon.infer_grn.layers.generate_grn(rna_network, atac_network, tf_network, tf_to_atac_links, atac_to_rna_links, n_jobs=1)

Generate a Gene Regulatory Network (GRN) by integrating TF-to-gene, peak-to-gene, and TF-to-peak relationships. It uses the HuMMuS package to create a multiplex network and perform random walks to rank the nodes. It returns a DataFrame with the ranked nodes.

Parameters:
  • rna_network (pd.DataFrame) – DataFrame with columns [‘source’, ‘target’, ‘weight’] representing the TF-to-gene relationships.

  • atac_network (pd.DataFrame) – DataFrame with columns [‘source’, ‘target’, ‘weight’] representing the peak-to-gene relationships.

  • tf_network (pd.DataFrame) – DataFrame with columns [‘source’, ‘target’, ‘weight’] representing the TF-to-TF relationships.

  • tf_to_atac_links (pd.DataFrame) – DataFrame with columns [‘source’, ‘target’] representing the TF-to-peak relationships.

  • atac_to_rna_links (pd.DataFrame) – DataFrame with columns [‘source’, ‘target’] representing the peak-to-gene relationships.

  • n_jobs (int, optional) – Number of jobs to use for parallel processing. Default is 1.

Returns:

DataFrame with ranked nodes from the GRN.

Return type:

pd.DataFrame