|
| 1 | +# CopyKAT-Py — Singularity Container Guide |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +This guide covers building and running the CopyKAT-Py Singularity container. |
| 6 | +The image bundles Python 3.11 + all required libraries in an isolated conda environment, so no local Python installation is needed on the compute node. |
| 7 | + |
| 8 | +--- |
| 9 | + |
| 10 | +## Prerequisites |
| 11 | + |
| 12 | +| Requirement | Notes | |
| 13 | +|-------------|-------| |
| 14 | +| Singularity ≥ 3.8 (or Apptainer ≥ 1.0) | Available on most HPC clusters; check with `singularity --version` | |
| 15 | +| Root / `fakeroot` or a build node | Required only for **building**; running needs no special privilege | |
| 16 | +| ~4 GB free disk space | For the `.sif` image | |
| 17 | + |
| 18 | +--- |
| 19 | + |
| 20 | +## 1. Build the Container |
| 21 | + |
| 22 | +Run from the `copykat_py/` directory (where `copykat_py.def` lives): |
| 23 | + |
| 24 | +```bash |
| 25 | +cd /path/to/copykat_py |
| 26 | + |
| 27 | +# with root (local workstation) |
| 28 | +sudo singularity build copykat_py.sif copykat_py.def |
| 29 | + |
| 30 | +# without root — fakeroot (many HPC clusters) |
| 31 | +singularity build --fakeroot copykat_py.sif copykat_py.def |
| 32 | +``` |
| 33 | + |
| 34 | +The build copies the local package source into the image (`%files` section) and installs it via `pip`, so **no internet access is needed at run time**. |
| 35 | + |
| 36 | +> **Tip:** If you build on a login node that restricts root, transfer the source to a build node first, or ask your sysadmin to pre-build it. |
| 37 | +
|
| 38 | +--- |
| 39 | + |
| 40 | +## 2. Verify the Image |
| 41 | + |
| 42 | +```bash |
| 43 | +singularity exec copykat_py.sif copykat-py --help |
| 44 | +``` |
| 45 | + |
| 46 | +--- |
| 47 | + |
| 48 | +## 3. Running CopyKAT-Py |
| 49 | + |
| 50 | +### Basic syntax |
| 51 | + |
| 52 | +```bash |
| 53 | +singularity run [singularity-flags] copykat_py.sif [copykat-py-flags] |
| 54 | +# equivalent to: |
| 55 | +singularity exec copykat_py.sif copykat-py [copykat-py-flags] |
| 56 | +``` |
| 57 | + |
| 58 | +### Input formats |
| 59 | + |
| 60 | +| Format | Description | |
| 61 | +|--------|-------------| |
| 62 | +| `.mtx` / `.mtx.gz` | 10x Genomics sparse matrix; `genes.tsv` and `barcodes.tsv` are auto-detected from the same directory | |
| 63 | +| `.csv` / `.tsv` / `.txt` | Dense count matrix, genes × cells, row names = gene symbols | |
| 64 | + |
| 65 | +--- |
| 66 | + |
| 67 | +## 4. Usage Examples |
| 68 | + |
| 69 | +### 4a. 10x MTX input (auto-detect gene/barcode files) |
| 70 | + |
| 71 | +```bash |
| 72 | +singularity run copykat_py.sif \ |
| 73 | + -i /data/sample1/matrix.mtx \ |
| 74 | + -o /results/sample1/ \ |
| 75 | + --n-cores 8 |
| 76 | +``` |
| 77 | + |
| 78 | +### 4b. CSV/TSV count matrix |
| 79 | + |
| 80 | +```bash |
| 81 | +singularity run copykat_py.sif \ |
| 82 | + -i /data/counts.csv \ |
| 83 | + -o /results/sample1/ \ |
| 84 | + --sample-name sample1 |
| 85 | +``` |
| 86 | + |
| 87 | +### 4c. With explicit gene / barcode files (MTX) |
| 88 | + |
| 89 | +```bash |
| 90 | +singularity run copykat_py.sif \ |
| 91 | + -i /data/matrix.mtx \ |
| 92 | + --genes /data/genes.tsv \ |
| 93 | + --barcodes /data/barcodes.tsv \ |
| 94 | + -o /results/sample1/ |
| 95 | +``` |
| 96 | + |
| 97 | +### 4d. Provide known normal-cell barcodes |
| 98 | + |
| 99 | +```bash |
| 100 | +singularity run copykat_py.sif \ |
| 101 | + -i /data/matrix.mtx \ |
| 102 | + --norm-cells /data/normal_barcodes.txt \ |
| 103 | + -o /results/sample1/ |
| 104 | +``` |
| 105 | + |
| 106 | +### 4e. Mouse genome + IGV .seg output |
| 107 | + |
| 108 | +```bash |
| 109 | +singularity run copykat_py.sif \ |
| 110 | + -i /data/matrix.mtx \ |
| 111 | + --genome mm10 \ |
| 112 | + --output-seg \ |
| 113 | + -o /results/sample1/ |
| 114 | +``` |
| 115 | + |
| 116 | +### 4f. Full parameter set |
| 117 | + |
| 118 | +```bash |
| 119 | +singularity run copykat_py.sif \ |
| 120 | + -i /data/matrix.mtx \ |
| 121 | + -o /results/sample1/ \ |
| 122 | + --sample-name sample1 \ |
| 123 | + --genome hg20 \ |
| 124 | + --id-type S \ |
| 125 | + --cell-line no \ |
| 126 | + --ngene-chr 5 \ |
| 127 | + --min-genes 200 \ |
| 128 | + --low-dr 0.05 \ |
| 129 | + --up-dr 0.1 \ |
| 130 | + --win-size 25 \ |
| 131 | + --ks-cut 0.1 \ |
| 132 | + --distance euclidean \ |
| 133 | + --n-cores 16 \ |
| 134 | + --output-seg |
| 135 | +``` |
| 136 | + |
| 137 | +--- |
| 138 | + |
| 139 | +## 5. Binding Host Paths |
| 140 | + |
| 141 | +By default Singularity only mounts `$HOME` and `$CWD`. For data elsewhere, bind explicitly: |
| 142 | + |
| 143 | +```bash |
| 144 | +singularity run \ |
| 145 | + --bind /scratch/mydata:/data \ |
| 146 | + --bind /scratch/results:/results \ |
| 147 | + copykat_py.sif \ |
| 148 | + -i /data/matrix.mtx \ |
| 149 | + -o /results/sample1/ |
| 150 | +``` |
| 151 | + |
| 152 | +--- |
| 153 | + |
| 154 | +## 6. SLURM Job Script Template |
| 155 | + |
| 156 | +```bash |
| 157 | +#!/usr/bin/env bash |
| 158 | +#SBATCH --job-name=copykat_py |
| 159 | +#SBATCH --cpus-per-task=16 |
| 160 | +#SBATCH --mem=64G |
| 161 | +#SBATCH --time=04:00:00 |
| 162 | +#SBATCH --output=logs/%x_%j.out |
| 163 | + |
| 164 | +SIF=/path/to/copykat_py.sif |
| 165 | +INPUT=/scratch/$USER/data/matrix.mtx |
| 166 | +OUTDIR=/scratch/$USER/results/sample1 |
| 167 | + |
| 168 | +singularity run \ |
| 169 | + --bind /scratch/$USER:/scratch/$USER \ |
| 170 | + "$SIF" \ |
| 171 | + -i "$INPUT" \ |
| 172 | + -o "$OUTDIR" \ |
| 173 | + --n-cores "$SLURM_CPUS_PER_TASK" \ |
| 174 | + --sample-name sample1 |
| 175 | +``` |
| 176 | + |
| 177 | +--- |
| 178 | + |
| 179 | +## 7. All CLI Options |
| 180 | + |
| 181 | +| Flag | Default | Description | |
| 182 | +|------|---------|-------------| |
| 183 | +| `-i / --input` | *(required)* | Input matrix file (`.mtx`, `.csv`, `.tsv`, `.txt`) | |
| 184 | +| `-o / --output-dir` | `.` | Output directory (created if absent) | |
| 185 | +| `--genes` | auto-detect | Gene names file (for `.mtx` input) | |
| 186 | +| `--barcodes` | auto-detect | Barcode names file (for `.mtx` input) | |
| 187 | +| `--sample-name` | `""` | Prefix for all output files | |
| 188 | +| `--genome` | `hg20` | Reference genome: `hg20` or `mm10` | |
| 189 | +| `--id-type` | `S` | Gene ID type: `S` = symbol, `E` = Ensembl | |
| 190 | +| `--cell-line` | `no` | Pure cell-line mode: `yes` / `no` | |
| 191 | +| `--ngene-chr` | `5` | Minimum genes per chromosome to keep | |
| 192 | +| `--min-genes` | `200` | Minimum genes expressed per cell | |
| 193 | +| `--low-dr` | `0.05` | Min detection rate for smoothing window | |
| 194 | +| `--up-dr` | `0.1` | Min detection rate for segmentation | |
| 195 | +| `--win-size` | `25` | Window size for CBS segmentation | |
| 196 | +| `--ks-cut` | `0.1` | KS-test p-value cutoff for breakpoints | |
| 197 | +| `--distance` | `euclidean` | Distance metric: `euclidean`, `pearson`, `spearman` | |
| 198 | +| `--norm-cells` | `""` | File with known normal-cell barcodes (one per line) | |
| 199 | +| `--output-seg` | off | Emit `.seg` file compatible with IGV | |
| 200 | +| `--n-cores` | `1` | CPU cores for parallel steps | |
| 201 | + |
| 202 | +--- |
| 203 | + |
| 204 | +## 8. Output Files |
| 205 | + |
| 206 | +| File | Description | |
| 207 | +|------|-------------| |
| 208 | +| `<sample>_copykat_CNA_results.csv` | Per-cell CNA matrix (genes × cells) | |
| 209 | +| `<sample>_copykat_prediction.txt` | Aneuploid / diploid prediction per cell | |
| 210 | +| `<sample>_copykat_heatmap.png` | CNA heatmap with dendrogram | |
| 211 | +| `<sample>_copykat_CNA_raw_results.csv` | Raw (un-binned) CNA values | |
| 212 | +| `<sample>.seg` | IGV-compatible segment file *(if `--output-seg`)* | |
| 213 | + |
| 214 | +--- |
| 215 | + |
| 216 | +## Troubleshooting |
| 217 | + |
| 218 | +**`FATAL: container creation failed`** — ensure Singularity ≥ 3.8 and that `--fakeroot` is supported on your cluster, or build with sudo on a workstation. |
| 219 | + |
| 220 | +**`ModuleNotFoundError: numba`** — numba is installed via conda (not pip) in the definition file to ensure LLVM compatibility; rebuilding the image should resolve it. |
| 221 | + |
| 222 | +**`ModuleNotFoundError: fastcluster`** — fastcluster is installed via conda-forge; if it fails to resolve during build, replace the conda line with `pip install fastcluster>=1.3.0` after the conda block. |
| 223 | + |
| 224 | +**Blank / missing plots** — `MPLBACKEND=Agg` is set in the container so matplotlib writes files without a display. If you still see Qt/Tk errors, add `--env MPLBACKEND=Agg` to your `singularity run` call. |
| 225 | + |
| 226 | +**Out of memory** — reduce `--n-cores` or request more RAM in your scheduler job; the DLM smoothing step scales with `n_cells × n_genes`. |
0 commit comments