Skip to content

Commit 12004d8

Browse files
author
Chenling Tang
committed
Initial commit
0 parents  commit 12004d8

58 files changed

Lines changed: 1550319 additions & 0 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

README.md

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
# CopyKAT-Py: Python Implementation of CopyKAT
2+
3+
A robust, high-efficiency Python implementation of CopyKAT (Copynumber Karyotyping of Aneuploid Tumors)
4+
for inferring genomic copy number profiles from single-cell RNA-seq data.
5+
6+
## Key Improvements over R version
7+
8+
- **No cell limit**: Overcomes the R `hclust` 65,536 cell barrier using mini-batch and approximate methods
9+
- **Faster execution**: Leverages NumPy vectorization, sparse matrices, and multiprocessing
10+
- **Same algorithm**: Faithfully reimplements the Bayesian segmentation approach (FTT → DLM smoothing → GMM baseline → MCMC/KS segmentation)
11+
12+
## Installation
13+
14+
```bash
15+
conda env create -f environment.yml
16+
conda activate copykat_py
17+
pip install -e .
18+
```
19+
20+
## Quick Start
21+
22+
```python
23+
import copykat_py
24+
25+
results = copykat_py.copykat(
26+
rawmat="path/to/matrix.mtx", # or a pandas DataFrame / scipy sparse matrix
27+
id_type="S", # "S" for Symbol, "E" for Ensembl
28+
genome="hg20",
29+
n_cores=8,
30+
sam_name="test",
31+
)
32+
33+
# Access results
34+
predictions = results["prediction"] # DataFrame: cell.names, copykat.pred
35+
cna_mat = results["CNAmat"] # DataFrame: chrom, chrompos, abspos, cell1, cell2, ...
36+
```
37+
38+
## Parameters
39+
40+
| Parameter | Default | Description |
41+
|-----------|---------|-------------|
42+
| rawmat | required | UMI count matrix (genes×cells), DataFrame, sparse matrix, or path |
43+
| id_type | "S" | Gene ID type: "S" (Symbol) or "E" (Ensembl) |
44+
| cell_line | "no" | "yes" for pure cell line data |
45+
| ngene_chr | 5 | Min genes per chromosome for cell filtering |
46+
| LOW_DR | 0.05 | Min gene detection rate for smoothing |
47+
| UP_DR | 0.1 | Min gene detection rate for segmentation |
48+
| win_size | 25 | Window size for segmentation |
49+
| norm_cell_names | "" | Known normal cell barcodes (list or "") |
50+
| KS_cut | 0.1 | KS test cutoff for breakpoint calling |
51+
| sam_name | "" | Sample name prefix for output files |
52+
| distance | "euclidean" | Distance metric: "euclidean", "pearson", "spearman" |
53+
| n_cores | 1 | Number of CPU cores |
54+
| genome | "hg20" | Genome: "hg20" or "mm10" |
55+
| output_seg | False | Output .seg file for IGV |
56+
| plot_genes | True | Plot gene-level heatmap |
57+
| min_gene_per_cell | 200 | Minimum genes per cell |
58+
59+
## Reference
60+
61+
Gao, R., et al. "Delineating copy number and clonal substructure in human tumors from single-cell transcriptomes."
62+
Nature Biotechnology 39, 599–608 (2021).

SINGULARITY_GUIDE.md

Lines changed: 226 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,226 @@
1+
# CopyKAT-Py — Singularity Container Guide
2+
3+
## Overview
4+
5+
This guide covers building and running the CopyKAT-Py Singularity container.
6+
The image bundles Python 3.11 + all required libraries in an isolated conda environment, so no local Python installation is needed on the compute node.
7+
8+
---
9+
10+
## Prerequisites
11+
12+
| Requirement | Notes |
13+
|-------------|-------|
14+
| Singularity ≥ 3.8 (or Apptainer ≥ 1.0) | Available on most HPC clusters; check with `singularity --version` |
15+
| Root / `fakeroot` or a build node | Required only for **building**; running needs no special privilege |
16+
| ~4 GB free disk space | For the `.sif` image |
17+
18+
---
19+
20+
## 1. Build the Container
21+
22+
Run from the `copykat_py/` directory (where `copykat_py.def` lives):
23+
24+
```bash
25+
cd /path/to/copykat_py
26+
27+
# with root (local workstation)
28+
sudo singularity build copykat_py.sif copykat_py.def
29+
30+
# without root — fakeroot (many HPC clusters)
31+
singularity build --fakeroot copykat_py.sif copykat_py.def
32+
```
33+
34+
The build copies the local package source into the image (`%files` section) and installs it via `pip`, so **no internet access is needed at run time**.
35+
36+
> **Tip:** If you build on a login node that restricts root, transfer the source to a build node first, or ask your sysadmin to pre-build it.
37+
38+
---
39+
40+
## 2. Verify the Image
41+
42+
```bash
43+
singularity exec copykat_py.sif copykat-py --help
44+
```
45+
46+
---
47+
48+
## 3. Running CopyKAT-Py
49+
50+
### Basic syntax
51+
52+
```bash
53+
singularity run [singularity-flags] copykat_py.sif [copykat-py-flags]
54+
# equivalent to:
55+
singularity exec copykat_py.sif copykat-py [copykat-py-flags]
56+
```
57+
58+
### Input formats
59+
60+
| Format | Description |
61+
|--------|-------------|
62+
| `.mtx` / `.mtx.gz` | 10x Genomics sparse matrix; `genes.tsv` and `barcodes.tsv` are auto-detected from the same directory |
63+
| `.csv` / `.tsv` / `.txt` | Dense count matrix, genes × cells, row names = gene symbols |
64+
65+
---
66+
67+
## 4. Usage Examples
68+
69+
### 4a. 10x MTX input (auto-detect gene/barcode files)
70+
71+
```bash
72+
singularity run copykat_py.sif \
73+
-i /data/sample1/matrix.mtx \
74+
-o /results/sample1/ \
75+
--n-cores 8
76+
```
77+
78+
### 4b. CSV/TSV count matrix
79+
80+
```bash
81+
singularity run copykat_py.sif \
82+
-i /data/counts.csv \
83+
-o /results/sample1/ \
84+
--sample-name sample1
85+
```
86+
87+
### 4c. With explicit gene / barcode files (MTX)
88+
89+
```bash
90+
singularity run copykat_py.sif \
91+
-i /data/matrix.mtx \
92+
--genes /data/genes.tsv \
93+
--barcodes /data/barcodes.tsv \
94+
-o /results/sample1/
95+
```
96+
97+
### 4d. Provide known normal-cell barcodes
98+
99+
```bash
100+
singularity run copykat_py.sif \
101+
-i /data/matrix.mtx \
102+
--norm-cells /data/normal_barcodes.txt \
103+
-o /results/sample1/
104+
```
105+
106+
### 4e. Mouse genome + IGV .seg output
107+
108+
```bash
109+
singularity run copykat_py.sif \
110+
-i /data/matrix.mtx \
111+
--genome mm10 \
112+
--output-seg \
113+
-o /results/sample1/
114+
```
115+
116+
### 4f. Full parameter set
117+
118+
```bash
119+
singularity run copykat_py.sif \
120+
-i /data/matrix.mtx \
121+
-o /results/sample1/ \
122+
--sample-name sample1 \
123+
--genome hg20 \
124+
--id-type S \
125+
--cell-line no \
126+
--ngene-chr 5 \
127+
--min-genes 200 \
128+
--low-dr 0.05 \
129+
--up-dr 0.1 \
130+
--win-size 25 \
131+
--ks-cut 0.1 \
132+
--distance euclidean \
133+
--n-cores 16 \
134+
--output-seg
135+
```
136+
137+
---
138+
139+
## 5. Binding Host Paths
140+
141+
By default Singularity only mounts `$HOME` and `$CWD`. For data elsewhere, bind explicitly:
142+
143+
```bash
144+
singularity run \
145+
--bind /scratch/mydata:/data \
146+
--bind /scratch/results:/results \
147+
copykat_py.sif \
148+
-i /data/matrix.mtx \
149+
-o /results/sample1/
150+
```
151+
152+
---
153+
154+
## 6. SLURM Job Script Template
155+
156+
```bash
157+
#!/usr/bin/env bash
158+
#SBATCH --job-name=copykat_py
159+
#SBATCH --cpus-per-task=16
160+
#SBATCH --mem=64G
161+
#SBATCH --time=04:00:00
162+
#SBATCH --output=logs/%x_%j.out
163+
164+
SIF=/path/to/copykat_py.sif
165+
INPUT=/scratch/$USER/data/matrix.mtx
166+
OUTDIR=/scratch/$USER/results/sample1
167+
168+
singularity run \
169+
--bind /scratch/$USER:/scratch/$USER \
170+
"$SIF" \
171+
-i "$INPUT" \
172+
-o "$OUTDIR" \
173+
--n-cores "$SLURM_CPUS_PER_TASK" \
174+
--sample-name sample1
175+
```
176+
177+
---
178+
179+
## 7. All CLI Options
180+
181+
| Flag | Default | Description |
182+
|------|---------|-------------|
183+
| `-i / --input` | *(required)* | Input matrix file (`.mtx`, `.csv`, `.tsv`, `.txt`) |
184+
| `-o / --output-dir` | `.` | Output directory (created if absent) |
185+
| `--genes` | auto-detect | Gene names file (for `.mtx` input) |
186+
| `--barcodes` | auto-detect | Barcode names file (for `.mtx` input) |
187+
| `--sample-name` | `""` | Prefix for all output files |
188+
| `--genome` | `hg20` | Reference genome: `hg20` or `mm10` |
189+
| `--id-type` | `S` | Gene ID type: `S` = symbol, `E` = Ensembl |
190+
| `--cell-line` | `no` | Pure cell-line mode: `yes` / `no` |
191+
| `--ngene-chr` | `5` | Minimum genes per chromosome to keep |
192+
| `--min-genes` | `200` | Minimum genes expressed per cell |
193+
| `--low-dr` | `0.05` | Min detection rate for smoothing window |
194+
| `--up-dr` | `0.1` | Min detection rate for segmentation |
195+
| `--win-size` | `25` | Window size for CBS segmentation |
196+
| `--ks-cut` | `0.1` | KS-test p-value cutoff for breakpoints |
197+
| `--distance` | `euclidean` | Distance metric: `euclidean`, `pearson`, `spearman` |
198+
| `--norm-cells` | `""` | File with known normal-cell barcodes (one per line) |
199+
| `--output-seg` | off | Emit `.seg` file compatible with IGV |
200+
| `--n-cores` | `1` | CPU cores for parallel steps |
201+
202+
---
203+
204+
## 8. Output Files
205+
206+
| File | Description |
207+
|------|-------------|
208+
| `<sample>_copykat_CNA_results.csv` | Per-cell CNA matrix (genes × cells) |
209+
| `<sample>_copykat_prediction.txt` | Aneuploid / diploid prediction per cell |
210+
| `<sample>_copykat_heatmap.png` | CNA heatmap with dendrogram |
211+
| `<sample>_copykat_CNA_raw_results.csv` | Raw (un-binned) CNA values |
212+
| `<sample>.seg` | IGV-compatible segment file *(if `--output-seg`)* |
213+
214+
---
215+
216+
## Troubleshooting
217+
218+
**`FATAL: container creation failed`** — ensure Singularity ≥ 3.8 and that `--fakeroot` is supported on your cluster, or build with sudo on a workstation.
219+
220+
**`ModuleNotFoundError: numba`** — numba is installed via conda (not pip) in the definition file to ensure LLVM compatibility; rebuilding the image should resolve it.
221+
222+
**`ModuleNotFoundError: fastcluster`** — fastcluster is installed via conda-forge; if it fails to resolve during build, replace the conda line with `pip install fastcluster>=1.3.0` after the conda block.
223+
224+
**Blank / missing plots**`MPLBACKEND=Agg` is set in the container so matplotlib writes files without a display. If you still see Qt/Tk errors, add `--env MPLBACKEND=Agg` to your `singularity run` call.
225+
226+
**Out of memory** — reduce `--n-cores` or request more RAM in your scheduler job; the DLM smoothing step scales with `n_cells × n_genes`.

copykat_py.def

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
Bootstrap: docker
2+
From: ubuntu:22.04
3+
4+
%labels
5+
Maintainer CopyKAT-Py
6+
Version 1.0.0
7+
Description Python implementation of CopyKAT — copy-number karyotyping of aneuploid tumors from scRNA-seq
8+
9+
%files
10+
# packed conda env (generated with: conda-pack -n copykit_py -o copykat_py_env.tar.gz)
11+
copykat_py_env.tar.gz /opt/copykat_py_env.tar.gz
12+
# package source
13+
. /opt/copykat_py_src
14+
15+
%post
16+
set -eu
17+
18+
# ── minimal system libs needed at runtime ───────────────────────────────
19+
apt-get update -qq && apt-get install -y --no-install-recommends \
20+
libgomp1 \
21+
&& rm -rf /var/lib/apt/lists/*
22+
23+
# ── unpack the pre-built conda env ───────────────────────────────────────
24+
mkdir -p /opt/conda/envs/copykit_py
25+
tar -xzf /opt/copykat_py_env.tar.gz -C /opt/conda/envs/copykit_py --no-same-owner
26+
rm /opt/copykat_py_env.tar.gz
27+
28+
# fix hard-coded paths left by conda-pack
29+
/opt/conda/envs/copykit_py/bin/python /opt/conda/envs/copykit_py/bin/conda-unpack
30+
31+
# ── install the copykat-py package into the unpacked env ─────────────────
32+
/opt/conda/envs/copykit_py/bin/python -m pip install --no-deps /opt/copykat_py_src
33+
34+
%environment
35+
export PATH="/opt/conda/envs/copykit_py/bin:${PATH}"
36+
export MPLBACKEND=Agg
37+
38+
%runscript
39+
exec copykat-py "$@"
40+
41+
%help
42+
CopyKAT-Py Singularity container (offline build — no internet required).
43+
44+
Quick-start examples:
45+
singularity run copykat_py.sif -i matrix.mtx -o results/ --n-cores 8
46+
singularity run copykat_py.sif -i counts.csv -o results/ --sample-name s1
47+
singularity exec copykat_py.sif copykat-py --help

0 commit comments

Comments
 (0)