|
1 | | -# CopyKAT-Py: Python Implementation of CopyKAT |
| 1 | +# CopyKAT-Python |
2 | 2 |
|
3 | | -A robust, high-efficiency Python implementation of CopyKAT (Copynumber Karyotyping of Aneuploid Tumors) |
4 | | -for inferring genomic copy number profiles from single-cell RNA-seq data. |
| 3 | +CopyKAT-Python is a Python implementation of the CopyKAT workflow for inferring large-scale copy number alterations from single-cell RNA-seq data. It is designed for Python-based single-cell analysis pipelines and aims to improve usability, scalability, and integration with modern `AnnData`/`Scanpy` workflows. |
5 | 4 |
|
6 | | -## Key Improvements over R version |
| 5 | +## Why CopyKAT-Python? |
7 | 6 |
|
8 | | -- **No cell limit**: Overcomes the R `hclust` 65,536 cell barrier using mini-batch and approximate methods |
9 | | -- **Faster execution**: Leverages NumPy vectorization, sparse matrices, and multiprocessing |
10 | | -- **Same algorithm**: Faithfully reimplements the Bayesian segmentation approach (FTT → DLM smoothing → GMM baseline → MCMC/KS segmentation) |
| 7 | +The original CopyKAT-R package is widely used for distinguishing aneuploid tumor cells from diploid normal cells using scRNA-seq data. However, users have reported practical limitations when applying CopyKAT-R to large modern datasets. |
| 8 | + |
| 9 | +Open GitHub issues in the original CopyKAT repository highlight several recurring needs: |
| 10 | + |
| 11 | +- Long runtime, including reports of >1 hour for ~8,000 cells. |
| 12 | +- Difficulty running very large datasets, including hundreds of thousands to millions of cells. |
| 13 | + |
| 14 | +CopyKAT-Python was developed to address these practical issues while preserving the main biological idea of CopyKAT: large-scale chromosomal expression patterns can be used to infer copy number profiles and separate malignant from non-malignant cells. |
| 15 | + |
| 16 | +## What Is Improved? |
| 17 | + |
| 18 | +Compared with CopyKAT-R, CopyKAT-Python focuses on: |
| 19 | + |
| 20 | +- Native Python workflow support. |
| 21 | +- Easier integration with `AnnData`, `Scanpy`, and Python pipelines. |
| 22 | +- Improved handling of large datasets. |
| 23 | +- More transparent intermediate outputs. |
| 24 | +- Clearer confidence reporting for uncertain cells. |
| 25 | +- More flexible downstream use of CNV matrices and cell-level annotations. |
| 26 | +- Better reproducibility through Python package management and scripted workflows. |
| 27 | + |
| 28 | +CopyKAT-Python is not intended to be a line-by-line clone of CopyKAT-R. It is a Python reimplementation designed to reproduce the core CopyKAT strategy while improving scalability and usability. |
| 29 | + |
| 30 | +## Confidence of Results |
| 31 | + |
| 32 | +CopyKAT-Python reports tumor/normal predictions together with confidence-related outputs. High-confidence results usually show clear chromosome-arm or whole-chromosome CNV patterns, consistent CNV profiles within clusters, and strong separation between inferred diploid and aneuploid cells. |
| 33 | + |
| 34 | +Lower-confidence results may occur in samples with weak CNV signal, low sequencing depth, few normal reference cells, strong batch effects, or tumors with near-diploid genomes. |
| 35 | + |
| 36 | +## Why Results May Differ from CopyKAT-R |
| 37 | + |
| 38 | +CopyKAT-Python results may not be identical to CopyKAT-R because of differences in: |
| 39 | + |
| 40 | +- Gene annotation versions. |
| 41 | +- Filtering and preprocessing. |
| 42 | +- Numerical implementation. |
| 43 | +- Smoothing and segmentation details. |
| 44 | +- Clustering behavior and random seeds. |
| 45 | +- Handling of uncertain or `not.defined` cells. |
| 46 | + |
| 47 | +These differences are expected for an independent Python implementation. |
| 48 | + |
| 49 | +## Figures and Tables to Add |
| 50 | + |
| 51 | +### Figure 1. Workflow overview |
| 52 | + |
| 53 | +Input expression matrix → gene genomic ordering → smoothing → CNV inference → clustering → tumor/normal prediction. |
| 54 | + |
| 55 | +### Figure 2. Example CNV heatmap |
| 56 | + |
| 57 | +Show inferred CNV profiles across chromosomes with cells grouped by predicted tumor/normal status. |
| 58 | + |
| 59 | +### Figure 3. CopyKAT-R vs CopyKAT-Python comparison |
| 60 | + |
| 61 | +Show side-by-side CNV heatmaps or classification agreement on the same dataset. |
| 62 | + |
| 63 | +### Table 1. Runtime and scalability benchmark |
| 64 | + |
| 65 | +| Dataset | Cells | Genes | CopyKAT-R runtime | CopyKAT-Python runtime | Notes | |
| 66 | +|---|---:|---:|---:|---:|---| |
| 67 | +| TODO | TODO | TODO | TODO | TODO | TODO | |
| 68 | + |
| 69 | +### Table 2. Classification concordance |
| 70 | + |
| 71 | +| Dataset | Tumor/normal agreement | Aneuploid agreement | Diploid agreement | Uncertain cells | Notes | |
| 72 | +|---|---:|---:|---:|---:|---| |
| 73 | +| TODO | TODO | TODO | TODO | TODO | TODO | |
11 | 74 |
|
12 | 75 | ## Installation |
13 | 76 |
|
14 | 77 | ```bash |
15 | | -conda env create -f environment.yml |
16 | | -conda activate copykat_py |
| 78 | +git clone https://github.com/NavinLab/copykat-python.git |
| 79 | +cd copykat-python |
17 | 80 | pip install -e . |
18 | | -``` |
19 | | - |
20 | | -## Quick Start |
21 | | - |
22 | | -```python |
23 | | -import copykat_py |
24 | | - |
25 | | -results = copykat_py.copykat( |
26 | | - rawmat="path/to/matrix.mtx", # or a pandas DataFrame / scipy sparse matrix |
27 | | - id_type="S", # "S" for Symbol, "E" for Ensembl |
28 | | - genome="hg20", |
29 | | - n_cores=8, |
30 | | - sam_name="test", |
31 | | -) |
32 | | - |
33 | | -# Access results |
34 | | -predictions = results["prediction"] # DataFrame: cell.names, copykat.pred |
35 | | -cna_mat = results["CNAmat"] # DataFrame: chrom, chrompos, abspos, cell1, cell2, ... |
36 | | -``` |
37 | | - |
38 | | -## Parameters |
39 | | - |
40 | | -| Parameter | Default | Description | |
41 | | -|-----------|---------|-------------| |
42 | | -| rawmat | required | UMI count matrix (genes×cells), DataFrame, sparse matrix, or path | |
43 | | -| id_type | "S" | Gene ID type: "S" (Symbol) or "E" (Ensembl) | |
44 | | -| cell_line | "no" | "yes" for pure cell line data | |
45 | | -| ngene_chr | 5 | Min genes per chromosome for cell filtering | |
46 | | -| LOW_DR | 0.05 | Min gene detection rate for smoothing | |
47 | | -| UP_DR | 0.1 | Min gene detection rate for segmentation | |
48 | | -| win_size | 25 | Window size for segmentation | |
49 | | -| norm_cell_names | "" | Known normal cell barcodes (list or "") | |
50 | | -| KS_cut | 0.1 | KS test cutoff for breakpoint calling | |
51 | | -| sam_name | "" | Sample name prefix for output files | |
52 | | -| distance | "euclidean" | Distance metric: "euclidean", "pearson", "spearman" | |
53 | | -| n_cores | 1 | Number of CPU cores | |
54 | | -| genome | "hg20" | Genome: "hg20" or "mm10" | |
55 | | -| output_seg | False | Output .seg file for IGV | |
56 | | -| plot_genes | True | Plot gene-level heatmap | |
57 | | -| min_gene_per_cell | 200 | Minimum genes per cell | |
58 | | - |
59 | | -## Reference |
60 | | - |
61 | | -Gao, R., et al. "Delineating copy number and clonal substructure in human tumors from single-cell transcriptomes." |
62 | | -Nature Biotechnology 39, 599–608 (2021). |
|
0 commit comments