Installation troubleshooting & FAQ

Installation

Basic installation

For basic ReCoN functionality (without GRN inference from ATAC-seq):

pip install recon

Or install from source for development:

git clone https://github.com/cantinilab/recon.git
cd recon
pip install -e .

Installation with GRN inference (optional)

You’ll need gimmemotifs<=0.17.2, celloracle (lite branch), and llvmlite to install ReCoN with full GRN inference capabilities.

# Create environment with required dependencies
conda create -n recon  python=3.10
conda activate recon

# Install ReCoN with GRN extras
pip install recon[grn-lite]

Installation with macOS

Some packages may be tricky to install with pip on macOS due to system library dependencies. We recommend using conda to manage these dependencies: gimmemotifs and llvmlite

# Create environment with required dependencies
conda create -n recon -c bioconda -c conda-forge python=3.10 gimmemotifs llvmlite cmake

conda activate recon

# Install ReCoN with GRN extras
pip install recon[grn-lite]

You’ll need cmake, gimmemotifs, and llvmlite to install ReCoN with full GRN inference capabilities including ATAC-seq motif scanning.

# Create environment with required dependencies
conda create -n recon -c bioconda -c conda-forge python=3.10 gimmemotifs llvmlite cmake
conda activate recon

# Install ReCoN with GRN extras
pip install recon[grn-lite]

Why these dependencies?

  • gimmemotifs: TF motif scanning for ATAC-seq peaks (requires pre-compiled binaries from conda)

  • llvmlite: Required by numba for JIT compilation (system LLVM libraries needed on macOS)

  • cmake: Build tool for compiling C/C++ extensions

Installing reference genomes for ATAC-seq

If you plan to use ATAC-seq data for TF-to-peak motif scanning, you need to install reference genomes:

# Install genomepy (included with celloracle)
pip install genomepy

# Install mouse genome (mm10)
genomepy install mm10 -p UCSC -a

# Install human genome (hg38)
genomepy install hg38 -p UCSC -a

# Check installed genomes
genomepy genomes

# List available genomes
genomepy search mouse

Where are genomes stored?

Genomes are downloaded to ~/.local/share/genomes/ by default. There are usually large files: Genomes are typically 1-3 GB each.

You can customize the location:

# Install to custom directory
genomepy install mm10 -p UCSC -a -g /path/to/genomes

# Or set environment variable
export GENOMES_DIR=/path/to/genomes
genomepy install mm10 -p UCSC -a

Available genome providers:

  • UCSC: University of California Santa Cruz (recommended)

  • Ensembl: European Bioinformatics Institute

  • NCBI: National Center for Biotechnology Information

Common genomes:

  • mm10: Mouse (GRCm38/mm10)

  • mm39: Mouse (GRCm39, latest)

  • hg38: Human (GRCh38)

  • hg19: Human (GRCh37, older)

Common installation issues

Problem: llvmlite build fails on macOS

# Install via conda instead of pip
conda install -c conda-forge llvmlite numba

Problem: gimmemotifs installation fails

Gimmemotifs has C dependencies that may not compile on all systems. Install from conda-forge and bioconda channels:

conda install -c conda-forge -c bioconda gimmemotifs

Problem: genomepy install command fails with “Got unexpected extra argument”

The correct syntax uses -p flag for provider:

# WRONG: genomepy install mm10 UCSC --annotation
# RIGHT:
genomepy install mm10 -p UCSC -a

Problem: “Genomes_dir does not exist” error

Create the directory first or specify a custom location:

# Option 1: Create default directory
mkdir -p ~/.local/share/genomes

# Option 2: Use custom directory
genomepy install mm10 -p UCSC -a -g /path/to/genomes

Problem: ATAC tests are skipped with “mm10 genome not installed”

This is expected when the genome isn’t downloaded. To run ATAC-seq tests:

# Install celloracle
pip install 'git+https://github.com/cantinilab/celloracle@lite'

# Install genome
genomepy install mm10 -p UCSC -a

# Verify installation
ls ~/.local/share/genomes/mm10/

# Run tests
pytest tests/test_infer_grn.py -v

Problem: I cannot compute GRNs

CellOracle is required for GRN inference with ATAC-seq data. Options:

  1. Install our ‘lite’ branch direclty with recon: pip install recon[grn-lite]

  2. Install it separately: pip install 'git+https://github.com/cantinilab/celloracle@lite'

  3. Compute your GRN externally and provide it to ReCoN.

Problem: Tests are skipped for celloracle functions

This is expected behavior when celloracle is not installed. The tests use @pytest.mark.skipif to gracefully skip ATAC-seq tests when celloracle is unavailable. Install with [grn] extras to run all tests.

Python version compatibility

  • Minimum: Python 3.8

  • Recommended: Python 3.10+

  • Note: Some dependencies (like circe-py) use Python 3.10+ type syntax (Type | None). If you encounter TypeError: unsupported operand type(s) for |, upgrade to Python 3.10+.

GRN inference

What GRN inference methods are available?

ReCoN supports multiple approaches:

  1. TF-to-gene (RNA-seq only): Uses GRNBoost2-style gradient boosting (GBM) or Random Forest (RF)

  2. TF-to-gene with ATAC-seq: Adds TF-to-peak motif scanning via CellOracle

  3. Receptor-to-gene: Custom connections for cell surface receptors

When should I use ATAC-seq integration?

Use ATAC-seq when:

  • You want more accurate TF-gene regulatory links

  • You have matched scRNA-seq + scATAC-seq data

  • You’re interested in chromatin accessibility effects

Skip ATAC-seq when:

  • You only have scRNA-seq data

  • Computational resources are limited (motif scanning is slow)

  • You’re doing quick exploratory analysis

How accurate is GRN inference?

GRN inference is probabilistic and noisy. ReCoN combines:

  • Expression correlation (GRNBoost2 importance scores)

  • Motif evidence (TF binding site predictions from ATAC peaks)

  • Network propagation (RWR to capture indirect effects)

Best practices:

  • Use biological validation (ChIP-seq, literature)

  • Focus on highly-ranked edges (top 10-20%)

  • Combine with perturbation data when available

Can I use my own GRN?

Yes! Provide a custom DataFrame with columns ['source', 'target', 'weight']:

import pandas as pd
from recon.explore import Celltype

# Custom GRN from literature or ChIP-seq
custom_grn = pd.DataFrame({
    'source': ['TF1', 'TF1', 'TF2'],
    'target': ['GENE1', 'GENE2', 'GENE3'],
    'weight': [0.8, 0.6, 0.9]
})

celltype = Celltype(
    grn=custom_grn,
    receptor_grn=receptor_grn,
    name="MyCell"
)

ReCoN results

What do the scores mean?

ReCoN outputs Random Walk with Restart (RWR) scores representing:

  • Treatment propagation: How molecular perturbations flow through the network

  • Values 0-1: Higher = more affected by treatment

  • Relative ranking: Compare scores across genes/cells, not absolute magnitudes

The alpha parameter (default 0.8) controls:

  • High alpha (0.8-0.9): Treatment stays local to seeds

  • Low alpha (0.3-0.5): Treatment diffuses widely across network

What seeds should I use?

Seeds are your treatment entry points. Common choices:

  1. Differentially expressed genes from treated vs control

  2. Drug targets (e.g., receptor targeted by therapy)

  3. Pathway genes (e.g., all genes in immune response pathway)

# Dictionary format: {gene: score}
seeds = {'RECEPTOR1': 1.0, 'TF1': 0.8}

# Or list format (all seeds weighted equally)
seeds = ['RECEPTOR1', 'TF1', 'GENE1']

How do I interpret multicellular results?

In Multicell objects:

  • Node names: Suffixed with ::celltype (e.g., CD8::T_cell)

  • Ligand-receptor connections: LIGAND-celltypeRECEPTOR_receptor::celltype

  • Cell communication layer: Bipartite graphs between cell types

  • Lamb matrix: Controls transition probabilities between layers

Higher scores in receiving cells indicate:

  • Strong cell-cell communication effects

  • Potential for coordinated responses

  • Targets for combination therapies

Why are some scores zero?

Possible reasons:

  1. Disconnected components: Gene not reachable from seeds in network

  2. Low edge weights: Weak connections filtered out

  3. High restart probability: Treatment didn’t diffuse far enough (try lower restart probability)

  4. Missing edges: Incomplete GRN (add more regulatory links)

ReCoN interpretation

How to validate ReCoN predictions?

  1. Literature search: Check if predicted genes are known treatment targets

  2. Pathway analysis: Are high-scoring genes in expected pathways?

  3. Perturbation data: Compare with experimental knockdown/overexpression

  4. Cross-validation: Split cells into train/test, validate predictions

  5. Temporal data: Do predictions match time-series gene expression?

What biological insights can ReCoN provide?

ReCoN helps answer:

  • Which genes are affected by a treatment beyond direct targets?

  • How do cell types coordinate their responses?

  • What off-target effects might occur?

  • Why do some cells respond differently than others?

  • And maybe your own biological question! :)

How to compare conditions?

Compare RWR scores between conditions:

# Run ReCoN on both conditions
results_control = celltype.Multixrank(seeds=control_seeds, alpha=0.8)
results_treated = celltype.Multixrank(seeds=treated_seeds, alpha=0.8)

# Compare scores
import pandas as pd
comparison = pd.DataFrame({
    'control': results_control['GRN'],
    'treated': results_treated['GRN']
})
comparison['delta'] = comparison['treated'] - comparison['control']

# Genes with largest changes
top_changes = comparison.nlargest(20, 'delta')

ReCoN Visualization

What visualization tools are available?

  1. Sankey diagrams: Trace treatment flow from seeds through network

  2. Network plots: Visualize multicellular architecture

  3. Heatmaps: Compare scores across cell types or conditions

  4. Custom plotting: Export scores to pandas DataFrames for ggplot/matplotlib

Example Sankey diagram:

from recon.plot import plot_sankey

plot_sankey(
    multicell=multicell,
    results=results,
    source_celltype='Tumor',
    target_celltype='T_cell',
    top_n=10
)

How to export results?

Results are pandas DataFrames - use standard methods:

# Save to CSV
results['GRN'].to_csv('recon_results.csv')

# Save to Excel with multiple sheets
with pd.ExcelWriter('recon_results.xlsx') as writer:
    results['GRN'].to_excel(writer, sheet_name='GRN')
    results['Receptor'].to_excel(writer, sheet_name='Receptors')

Reproducibility

How to ensure reproducible results?

  1. Set random seeds: ReCoN uses deterministic algorithms, but upstream GRN inference may not

  2. Version dependencies: Document versions of recon, scanpy, etc.

  3. Save parameters: Record alpha, seeds, network sizes

  4. Archive networks: Save GRN/receptor_grn DataFrames

# Save complete configuration
import json

config = {
    'recon_version': '0.1.0',
    'alpha': 0.8,
    'seeds': seeds,
    'graph_types': {'GRN': '01', 'Receptor': '01'},
    'n_genes': len(celltype.multiplexes['GRN'])
}

with open('recon_config.json', 'w') as f:
    json.dump(config, f, indent=2)

Performance & Scalability

How long does ReCoN take?

GRN inference (slowest):

  • GRNBoost2: 10-60 minutes for 5000 genes

  • ATAC motif scanning: 1-10 hours depending on peak count

RWR computation (fast):

  • Single celltype: Seconds to minutes

  • Multicellular (3-5 celltypes): 1-5 minutes

Tips for speed:

  • Pre-compute GRN once, reuse for multiple seed sets

  • Limit GRN to top expressed genes (2000-5000)

  • Use sparse matrices (automatically handled)

Memory requirements?

  • Minimal: ~2 GB for small networks (<1000 genes)

  • Typical: 4-8 GB for realistic scRNA-seq data

  • Large: 16+ GB for 10+ cell types with full GRNs

Reduce memory by:

  • Filtering low-expressed genes before GRN inference

  • Using fewer celltypes in multicellular models

  • Clearing intermediate results: del results

Can I parallelize ReCoN?

  • GRN inference: Parallel by default (set n_cpu parameter)

  • RWR computation: Single-threaded (already very fast)

  • Multiple conditions: Run in parallel with multiprocessing

from multiprocessing import Pool

def run_recon(seed_set):
    return celltype.Multixrank(seeds=seed_set, alpha=0.8)

with Pool(4) as p:
    results = p.map(run_recon, [seeds1, seeds2, seeds3, seeds4])

Getting Help

Where to find more information?

How to report bugs?

Open a GitHub issue with:

  1. Python/package versions: pip list | grep recon

  2. Minimal example: Code that reproduces the error

  3. Error message: Full traceback

  4. Expected behavior: What should happen instead

# Get version info for bug report
python -c "import recon; print(recon.__version__)"
python --version
pip list | grep -E "recon|scanpy|numpy|pandas"

Contributing

ReCoN is open source (GPL-3.0 license). Contributions welcome:

  • Code: Submit pull requests on GitHub

  • Documentation: Fix typos, add examples

  • Testing: Report issues, suggest improvements

  • Citations: Cite ReCoN in your publications

License: GPL-3.0 allows free use, modification, and distribution, but derivative works must also be open source.