MedLink Bounty #728

Rian354 · 2025-12-08T20:23:29Z

PR for MedLink bounty

Tests:
To run the MedLink unit tests, from the project root run:

pytest tests/core/test_medlink.py (locally, 3 passed & 1 warning)

Model Implementation:

Implemented the MedLink retrieval model on top of the current BaseModel / dataset API.
Added unit tests with small synthetic data for MedLink.
Added a Jupyter notebook that trains and evaluates MedLink on the MIMIC-III demo dataset.

Additions to "pyhealth/models/medlink/model.py":

BaseModel-compatible "MedLink" class that takes a task-generated dataset (e.g., "SampleDataset" from "set_task") and "feature_keys".
Vocabulary construction from the underlying task dataset using "dataset.get_all_tokens(...)" for queries and documents.
Query and corpus encoders ("encode_queries", "encode_corpus") that produce sparse multi-hot representations.
BM25-style scoring in "compute_scores", compatible with the IR-format data produced by the MedLink utilities.
Combined retrieval and prediction loss in forward / get_loss, returning a scalar loss for training.

Other changes:

Extended SampleDataset w/ get_all_tokens(key: str) to collect unique tokens across samples, used by MedLink for vocabulary building.
Implemented BM25 and IR helpers in the pyhealth.models.medlink package:
- BM25Okapi
- convert_to_ir_format, tvt_split
- generate_candidates, filter_by_candidates
- get_bm25_hard_negatives, get_train_dataloader, get_eval_dataloader
Exported MedLink via pyhealth.models.init, so users can do: from pyhealth.models import MedLink

Added examples/medlink_mimic3.ipynb, a runnable notebook that:

Loads the MIMIC-III demo dataset via MIMIC3Dataset.

Defines a patient linkage task to generate query–candidate pairs.

Uses the MedLink helpers to build IR-format data and PyTorch dataloaders.

Trains and evaluates MedLink and reports ranking metrics.

Locally ran:

examples/medlink_mimic3.ipynb runs end-to-end on the MIMIC-III demo dataset.

The notebook includes a note on how to run the MedLink unit tests from project root.

Files to review:

pyhealth/datasets/sample_dataset.py – SampleDataset.get_all_tokens helper for vocabulary construction.

pyhealth/models/medlink/model.py – core MedLink model implementation.

pyhealth/models/medlink/bm25.py – BM25Okapi implementation used in the retrieval pipeline.

pyhealth/models/medlink/utils.py – IR-format conversion, TVT split, candidate generation, dataloaders.

pyhealth/models/init.py – export of MedLink.

tests/core/test_medlink.py – synthetic unit tests for MedLink (forward pass, encoders, score shapes).

examples/medlink_mimic3.ipynb – Jupyter notebook for training and evaluating MedLink on the MIMIC-III demo dataset.

jhnwu3

I'll probably add more comments as I have more time to dig deeper into this, but nice first attempt at actually a pretty hard bounty.

jhnwu3 · 2025-12-11T23:39:03Z

examples/medlink_mimic3.ipynb

Some quick thoughts that:

Can we move the medlink task into the pyhealth.tasks module too? I actually think it'd be really helpful also to further have detailed documentation surrounding the query/document identifiers. It'd be good to link it up with the original paper's task of mapping records to a master known patient record.

It would also be nice to have it in the docs/ as that'll actually be a pretty nice to have for anyone working on record linkage problems.

jhnwu3 · 2025-12-11T23:46:39Z

pyhealth/models/medlink/model.py

Can we also try to see if we can't build new processors here to pass to the MedLink model.

Actually, I think the sequence processors should have built-in vocabularies here. But, it would be nice to update the EmbeddingModel to better support things like initialized Glove vectors or just use randomly initailized embeddings for now. This way medlink can be better integrated with the rest of PyHealth, and I think it'd be a nice lesson in replicating the original implementation. (A lot of the techniques are pretty relevant to clinical predictive modeling, so I think it's a good learning exercise).

Example of a PR working with the processors instead of the previous old PyHealth tokenizer approach here: https://github.com/sunlabuiuc/PyHealth/pull/610/files

Glove vectors from the original implementation: https://github.com/zzachw/MedLink here.

Rian354 added 5 commits December 8, 2025 03:08

medlink bounty implementation

4b2b80f

Merge branch 'master' of https://github.com/Rian354/PyHealth

8fe5675

Notebook Clean Up

b142938

Further notebook modification

8369ed3

Removed redecleration of methods

456b5bc

jhnwu3 requested changes Dec 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MedLink Bounty #728

MedLink Bounty #728

Uh oh!

Rian354 commented Dec 8, 2025

Uh oh!

jhnwu3 left a comment

Uh oh!

jhnwu3 Dec 11, 2025

Uh oh!

jhnwu3 Dec 11, 2025

Uh oh!

jhnwu3 Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MedLink Bounty #728

Are you sure you want to change the base?

MedLink Bounty #728

Uh oh!

Conversation

Rian354 commented Dec 8, 2025

Uh oh!

jhnwu3 left a comment

Choose a reason for hiding this comment

Uh oh!

jhnwu3 Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

jhnwu3 Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

jhnwu3 Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants