Skip to content

Commit d60bd22

Browse files
author
Chenling Tang
committed
Merge remote initial files
2 parents 12004d8 + 038ad65 commit d60bd22

2 files changed

Lines changed: 93 additions & 54 deletions

File tree

LICENSE

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2026 navinlabcode
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

README.md

Lines changed: 72 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -1,62 +1,80 @@
1-
# CopyKAT-Py: Python Implementation of CopyKAT
1+
# CopyKAT-Python
22

3-
A robust, high-efficiency Python implementation of CopyKAT (Copynumber Karyotyping of Aneuploid Tumors)
4-
for inferring genomic copy number profiles from single-cell RNA-seq data.
3+
CopyKAT-Python is a Python implementation of the CopyKAT workflow for inferring large-scale copy number alterations from single-cell RNA-seq data. It is designed for Python-based single-cell analysis pipelines and aims to improve usability, scalability, and integration with modern `AnnData`/`Scanpy` workflows.
54

6-
## Key Improvements over R version
5+
## Why CopyKAT-Python?
76

8-
- **No cell limit**: Overcomes the R `hclust` 65,536 cell barrier using mini-batch and approximate methods
9-
- **Faster execution**: Leverages NumPy vectorization, sparse matrices, and multiprocessing
10-
- **Same algorithm**: Faithfully reimplements the Bayesian segmentation approach (FTT → DLM smoothing → GMM baseline → MCMC/KS segmentation)
7+
The original CopyKAT-R package is widely used for distinguishing aneuploid tumor cells from diploid normal cells using scRNA-seq data. However, users have reported practical limitations when applying CopyKAT-R to large modern datasets.
8+
9+
Open GitHub issues in the original CopyKAT repository highlight several recurring needs:
10+
11+
- Long runtime, including reports of >1 hour for ~8,000 cells.
12+
- Difficulty running very large datasets, including hundreds of thousands to millions of cells.
13+
14+
CopyKAT-Python was developed to address these practical issues while preserving the main biological idea of CopyKAT: large-scale chromosomal expression patterns can be used to infer copy number profiles and separate malignant from non-malignant cells.
15+
16+
## What Is Improved?
17+
18+
Compared with CopyKAT-R, CopyKAT-Python focuses on:
19+
20+
- Native Python workflow support.
21+
- Easier integration with `AnnData`, `Scanpy`, and Python pipelines.
22+
- Improved handling of large datasets.
23+
- More transparent intermediate outputs.
24+
- Clearer confidence reporting for uncertain cells.
25+
- More flexible downstream use of CNV matrices and cell-level annotations.
26+
- Better reproducibility through Python package management and scripted workflows.
27+
28+
CopyKAT-Python is not intended to be a line-by-line clone of CopyKAT-R. It is a Python reimplementation designed to reproduce the core CopyKAT strategy while improving scalability and usability.
29+
30+
## Confidence of Results
31+
32+
CopyKAT-Python reports tumor/normal predictions together with confidence-related outputs. High-confidence results usually show clear chromosome-arm or whole-chromosome CNV patterns, consistent CNV profiles within clusters, and strong separation between inferred diploid and aneuploid cells.
33+
34+
Lower-confidence results may occur in samples with weak CNV signal, low sequencing depth, few normal reference cells, strong batch effects, or tumors with near-diploid genomes.
35+
36+
## Why Results May Differ from CopyKAT-R
37+
38+
CopyKAT-Python results may not be identical to CopyKAT-R because of differences in:
39+
40+
- Gene annotation versions.
41+
- Filtering and preprocessing.
42+
- Numerical implementation.
43+
- Smoothing and segmentation details.
44+
- Clustering behavior and random seeds.
45+
- Handling of uncertain or `not.defined` cells.
46+
47+
These differences are expected for an independent Python implementation.
48+
49+
## Figures and Tables to Add
50+
51+
### Figure 1. Workflow overview
52+
53+
Input expression matrix → gene genomic ordering → smoothing → CNV inference → clustering → tumor/normal prediction.
54+
55+
### Figure 2. Example CNV heatmap
56+
57+
Show inferred CNV profiles across chromosomes with cells grouped by predicted tumor/normal status.
58+
59+
### Figure 3. CopyKAT-R vs CopyKAT-Python comparison
60+
61+
Show side-by-side CNV heatmaps or classification agreement on the same dataset.
62+
63+
### Table 1. Runtime and scalability benchmark
64+
65+
| Dataset | Cells | Genes | CopyKAT-R runtime | CopyKAT-Python runtime | Notes |
66+
|---|---:|---:|---:|---:|---|
67+
| TODO | TODO | TODO | TODO | TODO | TODO |
68+
69+
### Table 2. Classification concordance
70+
71+
| Dataset | Tumor/normal agreement | Aneuploid agreement | Diploid agreement | Uncertain cells | Notes |
72+
|---|---:|---:|---:|---:|---|
73+
| TODO | TODO | TODO | TODO | TODO | TODO |
1174

1275
## Installation
1376

1477
```bash
15-
conda env create -f environment.yml
16-
conda activate copykat_py
78+
git clone https://github.com/NavinLab/copykat-python.git
79+
cd copykat-python
1780
pip install -e .
18-
```
19-
20-
## Quick Start
21-
22-
```python
23-
import copykat_py
24-
25-
results = copykat_py.copykat(
26-
rawmat="path/to/matrix.mtx", # or a pandas DataFrame / scipy sparse matrix
27-
id_type="S", # "S" for Symbol, "E" for Ensembl
28-
genome="hg20",
29-
n_cores=8,
30-
sam_name="test",
31-
)
32-
33-
# Access results
34-
predictions = results["prediction"] # DataFrame: cell.names, copykat.pred
35-
cna_mat = results["CNAmat"] # DataFrame: chrom, chrompos, abspos, cell1, cell2, ...
36-
```
37-
38-
## Parameters
39-
40-
| Parameter | Default | Description |
41-
|-----------|---------|-------------|
42-
| rawmat | required | UMI count matrix (genes×cells), DataFrame, sparse matrix, or path |
43-
| id_type | "S" | Gene ID type: "S" (Symbol) or "E" (Ensembl) |
44-
| cell_line | "no" | "yes" for pure cell line data |
45-
| ngene_chr | 5 | Min genes per chromosome for cell filtering |
46-
| LOW_DR | 0.05 | Min gene detection rate for smoothing |
47-
| UP_DR | 0.1 | Min gene detection rate for segmentation |
48-
| win_size | 25 | Window size for segmentation |
49-
| norm_cell_names | "" | Known normal cell barcodes (list or "") |
50-
| KS_cut | 0.1 | KS test cutoff for breakpoint calling |
51-
| sam_name | "" | Sample name prefix for output files |
52-
| distance | "euclidean" | Distance metric: "euclidean", "pearson", "spearman" |
53-
| n_cores | 1 | Number of CPU cores |
54-
| genome | "hg20" | Genome: "hg20" or "mm10" |
55-
| output_seg | False | Output .seg file for IGV |
56-
| plot_genes | True | Plot gene-level heatmap |
57-
| min_gene_per_cell | 200 | Minimum genes per cell |
58-
59-
## Reference
60-
61-
Gao, R., et al. "Delineating copy number and clonal substructure in human tumors from single-cell transcriptomes."
62-
Nature Biotechnology 39, 599–608 (2021).

0 commit comments

Comments
 (0)