Installation troubleshooting & FAQ
====================================

Installation
--------------

Basic installation
~~~~~~~~~~~~~~~~~~

For basic ReCoN functionality (without GRN inference from ATAC-seq):

..  code-block:: bash

    pip install recon

Or install from source for development:

..  code-block:: bash

    git clone https://github.com/cantinilab/recon.git
    cd recon
    pip install -e .

Installation with GRN inference (optional)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

You'll need gimmemotifs<=0.17.2, celloracle (lite branch), and llvmlite to install ReCoN with full GRN inference capabilities.

..  code-block:: bash

    # Create environment with required dependencies
    conda create -n recon  python=3.10
    conda activate recon
    
    # Install ReCoN with GRN extras
    pip install recon[grn-lite]


Installation with macOS
~~~~~~~~~~~~~~~~~~~~~~~

Some packages may be tricky to install with pip on macOS due to system library dependencies. We recommend using conda to manage these dependencies: gimmemotifs and llvmlite

..  code-block:: bash

    # Create environment with required dependencies
    conda create -n recon -c bioconda -c conda-forge python=3.10 gimmemotifs llvmlite cmake

    conda activate recon
    
    # Install ReCoN with GRN extras
    pip install recon[grn-lite]


You'll need cmake, gimmemotifs, and llvmlite to install ReCoN with full GRN inference capabilities including ATAC-seq motif scanning.

..  code-block:: bash

    # Create environment with required dependencies
    conda create -n recon -c bioconda -c conda-forge python=3.10 gimmemotifs llvmlite cmake
    conda activate recon
    
    # Install ReCoN with GRN extras
    pip install recon[grn-lite]

**Why these dependencies?**

- ``gimmemotifs``: TF motif scanning for ATAC-seq peaks (requires pre-compiled binaries from conda)
- ``llvmlite``: Required by numba for JIT compilation (system LLVM libraries needed on macOS)
- ``cmake``: Build tool for compiling C/C++ extensions

Installing reference genomes for ATAC-seq
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If you plan to use ATAC-seq data for TF-to-peak motif scanning, you need to install reference genomes:

..  code-block:: bash

    # Install genomepy (included with celloracle)
    pip install genomepy
    
    # Install mouse genome (mm10)
    genomepy install mm10 -p UCSC -a
    
    # Install human genome (hg38)
    genomepy install hg38 -p UCSC -a
    
    # Check installed genomes
    genomepy genomes
    
    # List available genomes
    genomepy search mouse

**Where are genomes stored?**

Genomes are downloaded to ``~/.local/share/genomes/`` by default. 
There are usually **large files**: Genomes are typically 1-3 GB each.

You can customize the location:

..  code-block:: bash

    # Install to custom directory
    genomepy install mm10 -p UCSC -a -g /path/to/genomes
    
    # Or set environment variable
    export GENOMES_DIR=/path/to/genomes
    genomepy install mm10 -p UCSC -a

**Available genome providers:**

- ``UCSC``: University of California Santa Cruz (recommended)
- ``Ensembl``: European Bioinformatics Institute
- ``NCBI``: National Center for Biotechnology Information

**Common genomes:**

- ``mm10``: Mouse (GRCm38/mm10)
- ``mm39``: Mouse (GRCm39, latest)
- ``hg38``: Human (GRCh38)
- ``hg19``: Human (GRCh37, older)

Common installation issues
~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Problem: llvmlite build fails on macOS**

..  code-block:: bash

    # Install via conda instead of pip
    conda install -c conda-forge llvmlite numba

**Problem: gimmemotifs installation fails**

Gimmemotifs has C dependencies that may not compile on all systems. Install from conda-forge and bioconda channels:

..  code-block:: bash

    conda install -c conda-forge -c bioconda gimmemotifs

**Problem: genomepy install command fails with "Got unexpected extra argument"**

The correct syntax uses ``-p`` flag for provider:

..  code-block:: bash

    # WRONG: genomepy install mm10 UCSC --annotation
    # RIGHT:
    genomepy install mm10 -p UCSC -a

**Problem: "Genomes_dir does not exist" error**

Create the directory first or specify a custom location:

..  code-block:: bash

    # Option 1: Create default directory
    mkdir -p ~/.local/share/genomes
    
    # Option 2: Use custom directory
    genomepy install mm10 -p UCSC -a -g /path/to/genomes

**Problem: ATAC tests are skipped with "mm10 genome not installed"**

This is expected when the genome isn't downloaded. To run ATAC-seq tests:

..  code-block:: bash

    # Install celloracle
    pip install 'git+https://github.com/cantinilab/celloracle@lite'
    
    # Install genome
    genomepy install mm10 -p UCSC -a
    
    # Verify installation
    ls ~/.local/share/genomes/mm10/
    
    # Run tests
    pytest tests/test_infer_grn.py -v

**Problem: I cannot compute GRNs**

CellOracle is required for GRN inference with ATAC-seq data. Options:

1. Install our 'lite' branch direclty with recon: ``pip install recon[grn-lite]``
2. Install it separately: ``pip install 'git+https://github.com/cantinilab/celloracle@lite'``
3. Compute your GRN externally and provide it to ReCoN.

**Problem: Tests are skipped for celloracle functions**

This is expected behavior when celloracle is not installed. The tests use ``@pytest.mark.skipif`` to gracefully skip ATAC-seq tests when celloracle is unavailable. Install with ``[grn]`` extras to run all tests.

Python version compatibility
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- **Minimum**: Python 3.8
- **Recommended**: Python 3.10+
- **Note**: Some dependencies (like circe-py) use Python 3.10+ type syntax (``Type | None``). If you encounter ``TypeError: unsupported operand type(s) for |``, upgrade to Python 3.10+.

GRN inference
--------------

What GRN inference methods are available?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

ReCoN supports multiple approaches:

1. **TF-to-gene (RNA-seq only)**: Uses GRNBoost2-style gradient boosting (GBM) or Random Forest (RF)
2. **TF-to-gene with ATAC-seq**: Adds TF-to-peak motif scanning via CellOracle
3. **Receptor-to-gene**: Custom connections for cell surface receptors

When should I use ATAC-seq integration?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Use ATAC-seq when:

- You want more accurate TF-gene regulatory links
- You have matched scRNA-seq + scATAC-seq data
- You're interested in chromatin accessibility effects

Skip ATAC-seq when:

- You only have scRNA-seq data
- Computational resources are limited (motif scanning is slow)
- You're doing quick exploratory analysis

How accurate is GRN inference?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

GRN inference is probabilistic and noisy. ReCoN combines:

- **Expression correlation** (GRNBoost2 importance scores)
- **Motif evidence** (TF binding site predictions from ATAC peaks)
- **Network propagation** (RWR to capture indirect effects)

**Best practices:**

- Use biological validation (ChIP-seq, literature)
- Focus on highly-ranked edges (top 10-20%)
- Combine with perturbation data when available

Can I use my own GRN?
~~~~~~~~~~~~~~~~~~~~~~

Yes! Provide a custom DataFrame with columns ``['source', 'target', 'weight']``:

..  code-block:: python

    import pandas as pd
    from recon.explore import Celltype
    
    # Custom GRN from literature or ChIP-seq
    custom_grn = pd.DataFrame({
        'source': ['TF1', 'TF1', 'TF2'],
        'target': ['GENE1', 'GENE2', 'GENE3'],
        'weight': [0.8, 0.6, 0.9]
    })
    
    celltype = Celltype(
        grn=custom_grn,
        receptor_grn=receptor_grn,
        name="MyCell"
    )

ReCoN results
--------------

What do the scores mean?
~~~~~~~~~~~~~~~~~~~~~~~~~

ReCoN outputs **Random Walk with Restart (RWR) scores** representing:

- **Treatment propagation**: How molecular perturbations flow through the network
- **Values 0-1**: Higher = more affected by treatment
- **Relative ranking**: Compare scores across genes/cells, not absolute magnitudes

The ``alpha`` parameter (default 0.8) controls:

- **High alpha (0.8-0.9)**: Treatment stays local to seeds
- **Low alpha (0.3-0.5)**: Treatment diffuses widely across network

What seeds should I use?
~~~~~~~~~~~~~~~~~~~~~~~~~

**Seeds** are your treatment entry points. Common choices:

1. **Differentially expressed genes** from treated vs control
2. **Drug targets** (e.g., receptor targeted by therapy)
3. **Pathway genes** (e.g., all genes in immune response pathway)

..  code-block:: python

    # Dictionary format: {gene: score}
    seeds = {'RECEPTOR1': 1.0, 'TF1': 0.8}
    
    # Or list format (all seeds weighted equally)
    seeds = ['RECEPTOR1', 'TF1', 'GENE1']

How do I interpret multicellular results?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In ``Multicell`` objects:

- **Node names**: Suffixed with ``::celltype`` (e.g., ``CD8::T_cell``)
- **Ligand-receptor connections**: ``LIGAND-celltype`` → ``RECEPTOR_receptor::celltype``
- **Cell communication layer**: Bipartite graphs between cell types
- **Lamb matrix**: Controls transition probabilities between layers

Higher scores in receiving cells indicate:

- Strong cell-cell communication effects
- Potential for coordinated responses
- Targets for combination therapies

Why are some scores zero?
~~~~~~~~~~~~~~~~~~~~~~~~~~

Possible reasons:

1. **Disconnected components**: Gene not reachable from seeds in network
2. **Low edge weights**: Weak connections filtered out
3. **High restart probability**: Treatment didn't diffuse far enough (try lower restart probability)
4. **Missing edges**: Incomplete GRN (add more regulatory links)

ReCoN interpretation
----------------------

How to validate ReCoN predictions?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

1. **Literature search**: Check if predicted genes are known treatment targets
2. **Pathway analysis**: Are high-scoring genes in expected pathways?
3. **Perturbation data**: Compare with experimental knockdown/overexpression
4. **Cross-validation**: Split cells into train/test, validate predictions
5. **Temporal data**: Do predictions match time-series gene expression?

What biological insights can ReCoN provide?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

ReCoN helps answer:

- **Which genes are affected** by a treatment beyond direct targets?
- **How do cell types coordinate** their responses?
- **What off-target effects** might occur?
- **Why do some cells respond** differently than others?
- **And maybe your own biological question!** :)

How to compare conditions?
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Compare RWR scores between conditions:

..  code-block:: python

    # Run ReCoN on both conditions
    results_control = celltype.Multixrank(seeds=control_seeds, alpha=0.8)
    results_treated = celltype.Multixrank(seeds=treated_seeds, alpha=0.8)
    
    # Compare scores
    import pandas as pd
    comparison = pd.DataFrame({
        'control': results_control['GRN'],
        'treated': results_treated['GRN']
    })
    comparison['delta'] = comparison['treated'] - comparison['control']
    
    # Genes with largest changes
    top_changes = comparison.nlargest(20, 'delta')

ReCoN Visualization
----------------------

What visualization tools are available?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

1. **Sankey diagrams**: Trace treatment flow from seeds through network
2. **Network plots**: Visualize multicellular architecture
3. **Heatmaps**: Compare scores across cell types or conditions
4. **Custom plotting**: Export scores to pandas DataFrames for ggplot/matplotlib

Example Sankey diagram:

..  code-block:: python

    from recon.plot import plot_sankey
    
    plot_sankey(
        multicell=multicell,
        results=results,
        source_celltype='Tumor',
        target_celltype='T_cell',
        top_n=10
    )

How to export results?
~~~~~~~~~~~~~~~~~~~~~~

Results are pandas DataFrames - use standard methods:

..  code-block:: python

    # Save to CSV
    results['GRN'].to_csv('recon_results.csv')
    
    # Save to Excel with multiple sheets
    with pd.ExcelWriter('recon_results.xlsx') as writer:
        results['GRN'].to_excel(writer, sheet_name='GRN')
        results['Receptor'].to_excel(writer, sheet_name='Receptors')

Reproducibility
---------------

How to ensure reproducible results?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

1. **Set random seeds**: ReCoN uses deterministic algorithms, but upstream GRN inference may not
2. **Version dependencies**: Document versions of recon, scanpy, etc.
3. **Save parameters**: Record alpha, seeds, network sizes
4. **Archive networks**: Save GRN/receptor_grn DataFrames

..  code-block:: python

    # Save complete configuration
    import json
    
    config = {
        'recon_version': '0.1.0',
        'alpha': 0.8,
        'seeds': seeds,
        'graph_types': {'GRN': '01', 'Receptor': '01'},
        'n_genes': len(celltype.multiplexes['GRN'])
    }
    
    with open('recon_config.json', 'w') as f:
        json.dump(config, f, indent=2)

Performance & Scalability
-------------------------

How long does ReCoN take?
~~~~~~~~~~~~~~~~~~~~~~~~~~

**GRN inference** (slowest):

- GRNBoost2: 10-60 minutes for 5000 genes
- ATAC motif scanning: 1-10 hours depending on peak count

**RWR computation** (fast):

- Single celltype: Seconds to minutes
- Multicellular (3-5 celltypes): 1-5 minutes

**Tips for speed:**

- Pre-compute GRN once, reuse for multiple seed sets
- Limit GRN to top expressed genes (2000-5000)
- Use sparse matrices (automatically handled)

Memory requirements?
~~~~~~~~~~~~~~~~~~~~

- **Minimal**: ~2 GB for small networks (<1000 genes)
- **Typical**: 4-8 GB for realistic scRNA-seq data
- **Large**: 16+ GB for 10+ cell types with full GRNs

Reduce memory by:

- Filtering low-expressed genes before GRN inference
- Using fewer celltypes in multicellular models
- Clearing intermediate results: ``del results``

Can I parallelize ReCoN?
~~~~~~~~~~~~~~~~~~~~~~~~~

- **GRN inference**: Parallel by default (set ``n_cpu`` parameter)
- **RWR computation**: Single-threaded (already very fast)
- **Multiple conditions**: Run in parallel with multiprocessing

..  code-block:: python

    from multiprocessing import Pool
    
    def run_recon(seed_set):
        return celltype.Multixrank(seeds=seed_set, alpha=0.8)
    
    with Pool(4) as p:
        results = p.map(run_recon, [seeds1, seeds2, seeds3, seeds4])

Getting Help
------------

Where to find more information?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- **Documentation**: https://recon.readthedocs.io
- **GitHub Issues**: https://github.com/cantinilab/recon/issues
- **Examples**: See notebooks in ``docs/source/recon_examples/``
- **Paper**: [Add citation when published]

How to report bugs?
~~~~~~~~~~~~~~~~~~~

Open a GitHub issue with:

1. **Python/package versions**: ``pip list | grep recon``
2. **Minimal example**: Code that reproduces the error
3. **Error message**: Full traceback
4. **Expected behavior**: What should happen instead

..  code-block:: bash

    # Get version info for bug report
    python -c "import recon; print(recon.__version__)"
    python --version
    pip list | grep -E "recon|scanpy|numpy|pandas"

Contributing
~~~~~~~~~~~~

ReCoN is open source (GPL-3.0 license). Contributions welcome:

- **Code**: Submit pull requests on GitHub
- **Documentation**: Fix typos, add examples
- **Testing**: Report issues, suggest improvements
- **Citations**: Cite ReCoN in your publications

License: GPL-3.0 allows free use, modification, and distribution, but derivative works must also be open source.