Skip to content
@ProMeText

ProMeTEXT

Centre for PROcessing MEdieval TEXTs (ProMeText) — medieval corpora & alignment tools

Paper: CHR 2024 Paper: LREC 2026 Dataset: Zenodo Model: Segmenter

ProMeTEXT

ProMeTEXT — the Centre for PROcessing MEdieval TEXTs — develops corpora, methods, and tools for the segmentation and multilingual alignment of medieval texts.

Our work focuses primarily on medieval romance texts from the 13th to 16th centuries, with the aim of supporting research in textual transmission, multilingual corpora, computational humanities, and historical language processing.

Scope

ProMeTEXT provides resources for working with medieval texts in multiple languages, including:

  • phrase-level segmentation data
  • multilingual alignment tools
  • evaluation datasets
  • notebooks and reproducible workflows
  • resources for medieval textual transmission studies

Supported Languages

Current resources cover the following medieval languages:

  • Latin
  • French
  • Castilian
  • Portuguese
  • Catalan
  • English
  • Italian

We welcome collaboration on additional languages. Please feel free to contact us if you would like to contribute data, methods, or language-specific expertise.

Data

Gold-Standard Segmentation Data

We provide a gold-standard dataset for phrase-level segmentation of medieval corpora.

Dataset Citation

Please cite the dataset itself as follows when using the released data:

APA
Ing, L., Gille Levenson, M., & Macedo, C. (2025). Multilingual Segmentation Dataset for Historical Prose (13th–16th c.) (Version 1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.16992629

BibTeX

@dataset{ing2025multilingual,
  author       = {Ing, L. and Gille Levenson, M. and Macedo, C.},
  title        = {Multilingual Segmentation Dataset for Historical Prose (13th--16th c.)},
  year         = {2025},
  publisher    = {Zenodo},
  version      = {1.0},
  doi          = {10.5281/zenodo.16992629},
  url          = {https://doi.org/10.5281/zenodo.16992629},
  license      = {Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International}
}

Related Publication

For the method, experiments, and scientific context, please cite the related publication:

APA
Ing, L., Gille Levenson, M., & Macedo, C. (2026). Phrase-Level Segmentation on Medieval Corpora for Aligning Multilingual Texts. In Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026) (pp. 936–946). European Language Resources Association (ELRA). https://doi.org/10.63317/32huzuuokpfr

BibTeX

@inproceedings{ing-etal-2026-phrase,
  title = {Phrase-Level Segmentation on Medieval Corpora for Aligning Multilingual Texts},
  author = {Ing, Lucence and Gille Levenson, Matthias and Macedo, Carolina},
  booktitle = {Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)},
  month = {May},
  year = {2026},
  pages = {936--946},
  address = {Palma, Mallorca, Spain},
  publisher = {European Language Resources Association (ELRA)},
  editor = {Piperidis, Stelios and Bel, Núria and van den Heuvel, Henk and Ide, Nancy and Krek, Simon and Toral, Antonio},
  doi = {10.63317/32huzuuokpfr}
}

Gold-Standard Alignment Data

We provide gold-standard alignment data for parallel medieval and historical texts through the following repository:

Other parallel corpora and alignment datasets are currently under construction.

Tools

Aquilign

Aquilign is a tool for aligning multilingual texts at phrase level. It was designed for medieval textual traditions and supports multilingual alignment workflows across related witnesses and translations.

Please cite Aquilign as follows:

APA
Gille Levenson, M., Ing, L., & Camps, J.-B. (2024). Textual Transmission without Borders: Multiple Multilingual Alignment and Stemmatology of the “Lancelot en prose” (Medieval French, Castilian, Italian). In Proceedings of the Computational Humanities Research Conference 2024, CEUR Workshop Proceedings, 3834, 65–92. https://ceur-ws.org/Vol-3834/paper104.pdf

BibTeX

@inproceedings{gillelevensonTextualTransmissionBorders2024,
  title = {Textual Transmission without Borders: Multiple Multilingual Alignment and Stemmatology of the ``Lancelot en prose'' (Medieval French, Castilian, Italian)},
  shorttitle = {Textual Transmission without Borders},
  booktitle = {Proceedings of the Computational Humanities Research Conference 2024},
  author = {Gille Levenson, Matthias and Ing, Lucence and Camps, Jean-Baptiste},
  editor = {Haverals, Wouter and Koolen, Marijn and Thompson, Laure},
  year = {2024},
  series = {CEUR Workshop Proceedings},
  volume = {3834},
  pages = {65--92},
  publisher = {CEUR},
  address = {Aarhus},
  issn = {1613-0073},
  url = {https://ceur-ws.org/Vol-3834/paper104.pdf},
  langid = {english}
}

Apps

Aquilign Explorer

A demo app for Aquilign Explorer is available on Hugging Face Spaces:

Aquilign Explorer

It provides an early interface for testing and demonstrating multilingual medieval text alignment workflows.

Notebooks

Example notebooks and reproducible workflows are available here:

multilingual-medieval-aligner-notebooks

These notebooks can be used to test ProMeTEXT tools on sample data or on your own medieval textual corpora.

Citation

Please cite the relevant dataset, tool, or paper depending on the resource you use:

  • For the released segmentation dataset, cite Ing, Gille Levenson, and Macedo 2025.
  • For the phrase-level segmentation method and experiments, cite Ing, Gille Levenson, and Macedo 2026.
  • For Aquilign and multilingual alignment workflows, cite Gille Levenson, Ing, and Camps 2024.

Contact and Contributions

We welcome collaboration with researchers, developers, and institutions working on medieval texts, historical languages, digital philology, computational humanities, and multilingual corpora.

You can contribute by:

  • suggesting or adding support for additional medieval languages
  • contributing annotated corpora
  • improving segmentation or alignment workflows
  • testing the tools on new textual traditions
  • reporting issues in the relevant repositories

For questions, feedback, or collaboration proposals, please open an issue in the relevant repository or contact the ProMeTEXT team.

Pinned Loading

  1. Aquilign Aquilign Public

    AQUILIGN is a multilingual alignment and collation tool for medieval texts. It uses phrase-level segmentation and contextual alignment based on BERT models, with applications in historical linguist…

    Python 7 4

  2. multilingual-medieval-aligner-notebooks multilingual-medieval-aligner-notebooks Public

    Forked from ProMeText/Aquilign

    A workshop to start using our multilingual medieval aligner (and automatic segmenter)

    Jupyter Notebook 1

  3. multilingual-segmentation-dataset multilingual-segmentation-dataset Public

    Segmentation data used in multilingual alignment tasks across English, French, Spanish, and other languages. Includes raw and segmented text files for training and evaluation.

    Python 1

Repositories

Showing 10 of 11 repositories

People

This organization has no public members. You must be a member to see who’s a part of this organization.

Top languages

Loading…

Most used topics

Loading…