Genome-Anchored Foundation Model Embeddings Improve Molecular Prediction from Histology Images

Cheng Jin, Fengtao Zhou, Yunfang Yu, Jiabo Ma, Yihui Wang, Yingxue Xu, Huajun Zhou, Hao Jiang, Luyang Luo, Luhui Mao, Zifan He, Xiuming Zhang, Jing Zhang, Ronald Cheong Kin Chan, Herui Yao, and Hao Chen

Abstract

Precision oncology requires accurate molecular insights, yet obtaining these directly from genomics is costly and time-consuming for broad clinical use. Predicting complex molecular features and patient prognosis directly from routine whole-slide images (WSI) remains a major challenge for current deep learning methods. Here we introduce PathLUPI, which uses transcriptomic privileged information during training to extract genome-anchored histological embeddings, enabling effective molecular prediction using only WSIs at inference. Through extensive evaluation across 49 molecular oncology tasks using 11,257 cases among 20 cohorts, PathLUPI demonstrated superior performance compared to conventional methods trained solely on WSIs. Crucially, it achieves AUC ≥ 0.80 in 14 of the biomarker prediction and molecular subtyping tasks and C-index ≥ 0.70 in survival cohorts of 5 major cancer types. Moreover, PathLUPI embeddings reveal distinct cellular morphological signatures associated with specific genotypes and related biological pathways within WSIs. By effectively encoding molecular context to refine WSI representations, PathLUPI overcomes a key limitation of existing models and offers a novel strategy to bridge molecular insights with routine pathology workflows for wider clinical application.

Quick Start

Installation

# Clone repository
git clone https://github.com/ChengJin-git/PathLUPI.git
cd PathLUPI

# Create conda environment
conda create -n pathlupi python=3.8
conda activate pathlupi

# Install dependencies
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install numpy pandas scikit-learn scipy matplotlib seaborn h5py openpyxl tqdm
pip install openslide-python

Basic Usage

Train PathLUPI on breast cancer ER prediction:

python main_code/main.py \
    --model PathLUPI \
    --task BRCA_ERSub \
    --excel_file main_code/labels/biomarker/BRCA_ER.csv \
    --root_path /path/to/wsi/features \
    --root_omic /path/to/transcriptomic/data \
    --signatures main_code/labels/signatures/hallmark_pathways.csv \
    --fold 0,1,2,3,4 \

Run batch training:

# All molecular subtyping tasks
bash main_code/scripts/subtyping_conch.sh

# All survival prediction tasks
bash main_code/scripts/survival_conch.sh

Project Structure

PathLUPI/
├── main_code/
│   ├── main.py                 # Main training script
│   ├── models/PathLUPI/        # PathLUPI model implementation
│   │   ├── network.py          # Model architecture
│   │   ├── engine.py           # Training engine
│   │   └── ...
│   ├── datasets/               # Dataset loaders
│   ├── labels/                 # Task labels and splits
│   │   ├── biomarker/          # Biomarker prediction labels
│   │   ├── molecular/          # Molecular subtyping labels
│   │   └── signatures/         # Pathway signatures
│   ├── scripts/                # Training scripts
│   └── utils/                  # Utility functions

🔧 Data Preparation

1. Download TCGA Data

Install GDC Data Transfer Tool (v2.3):

# Option 1: Install via pip (recommended)
pip install gdctools

# Option 2: Download pre-compiled binary
wget https://gdc.cancer.gov/system/files/public/file/gdc-client_2.3_Ubuntu_x64-py3.8-ubuntu-20.04.zip
unzip gdc-client_2.3_Ubuntu_x64-py3.8-ubuntu-20.04.zip
export PATH=$PATH:$(pwd)

# Verify installation
gdc-client --version

Download WSI slides:

# 1. Create manifest file from GDC Portal (https://portal.gdc.cancer.gov/)
#    - Go to Repository → Cases
#    - Filter: Data Category = Biospecimen, Data Type = Slide Image
#    - Select cases and download manifest

# 2. Download slides using manifest
gdc-client download -m gdc_manifest_slides.txt -d ./tcga_slides/

Download RNA-seq data from cBioPortal:

# RNA-seq data is already processed and available on cBioPortal
# Download from: https://www.cbioportal.org/datasets

# Example: Download TCGA breast cancer RNA-seq data
wget https://cbioportal-datahub.s3.amazonaws.com/brca_tcga_pub2015.tar.gz
tar -xzf brca_tcga_pub2015.tar.gz

# The data_mrna_seq_v2_rsem.txt file contains processed gene expression data
# Format: Hugo_Symbol, Entrez_Gene_Id, patient_id1, patient_id2, ...

2. Extract WSI Features using CONCH

Install CONCH:

# Clone CONCH repository
git clone https://github.com/mahmoodlab/CONCH.git
cd CONCH

# Install dependencies
pip install -r requirements.txt
pip install -e .

Extract patch features:

# Step 1: Segment and patch WSIs
python create_patches_fp.py \
    --source /path/to/tcga/slides \
    --save_dir /path/to/patches \
    --patch_size 512 \
    --step_size 512 \
    --preset tcga.csv \
    --seg \
    --patch \
    --stitch

# Step 2: Extract features using CONCH
python extract_features_fp.py \
    --data_h5_dir /path/to/patches \
    --data_slide_dir /path/to/tcga/slides \
    --csv_path /path/to/patches/process_list_autogen.csv \
    --feat_dir /path/to/features \
    --batch_size 512 \
    --slide_ext .svs

Expected output structure:

features/
├── TCGA-XX-XXXX-01Z-00-DX1.h5    # Patch features (N_patches x 512)
├── TCGA-YY-YYYY-01Z-00-DX1.h5
└── ...

Expected gene expression format:

patient_id,A1BG,A1CF,A2M,...,ZZEF1
TCGA-XX-XXXX,5.234,3.123,7.456,...,2.891
TCGA-YY-YYYY,4.567,2.789,6.123,...,3.456

3. Prepare Pathway Signatures

Please refer to this link.

Key Parameters

Parameter	Description	Default
`--model`	Model architecture	PathLUPI
`--task`	Prediction task	BRCA_ERSub, CRC_BRAFSub, etc.
`--fold`	CV folds	0,1,2,3,4
`--region_num`	Number of pathways	50
`--lr`	Learning rate	2e-4
`--epochs`	Training epochs	30

Ethical Considerations

This study adhered to the Declaration of Helsinki and received ethical approval from the Human and Artifact Research Ethics Committee of The Hong Kong University of Science and Technology (HREP-2024-0423). All data used were anonymized and obtained through appropriate data use agreements.

License

This project is licensed under the CC-BY-NC-ND 4.0 License - see the LICENSE file for details.

Acknowledgement

This work was supported by the National Natural Science Foundation of China (No. 62202403), Innovation and Technology Commission (Project No. MHP/002/22 and ITCPD/17-9), Research Grants Council of the Hong Kong Special Administrative Region, China (Project No: T45-401/22-N) and National Key R&D Program of China (Project No. 2023YFE0204000).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Genome-Anchored Foundation Model Embeddings Improve Molecular Prediction from Histology Images

Abstract

Quick Start

Installation

Basic Usage

Project Structure

🔧 Data Preparation

1. Download TCGA Data

2. Extract WSI Features using CONCH

3. Prepare Pathway Signatures

Key Parameters

Ethical Considerations

License

Acknowledgement

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
datasets		datasets
labels		labels
models/PathLUPI		models/PathLUPI
scripts		scripts
splits		splits
utils		utils
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

License

ChengJin-git/PathLUPI

Folders and files

Latest commit

History

Repository files navigation

Genome-Anchored Foundation Model Embeddings Improve Molecular Prediction from Histology Images

Abstract

Quick Start

Installation

Basic Usage

Project Structure

🔧 Data Preparation

1. Download TCGA Data

2. Extract WSI Features using CONCH

3. Prepare Pathway Signatures

Key Parameters

Ethical Considerations

License

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages