Skip to content

feat: MarkdownHeaderSplitter#9660

Merged
sjrl merged 112 commits intodeepset-ai:mainfrom
OGuggenbuehl:feature/md-header-splitter
Feb 11, 2026
Merged

feat: MarkdownHeaderSplitter#9660
sjrl merged 112 commits intodeepset-ai:mainfrom
OGuggenbuehl:feature/md-header-splitter

Conversation

@OGuggenbuehl
Copy link
Contributor

@OGuggenbuehl OGuggenbuehl commented Jul 29, 2025

Proposed Changes:

Implement MarkdownHeaderSplitter to split Documents written in .md based on their headers

How did you test it?

unit tests

Checklist

  • I have read the contributors guidelines and the code of conduct
  • I have updated the related issue with new insights and changes
  • I added unit tests and updated the docstrings
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
  • I documented my code
  • I ran pre-commit hooks and fixed any issue

@CLAassistant
Copy link

CLAassistant commented Jul 29, 2025

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions bot added topic:tests type:documentation Improvements on the docs labels Jul 29, 2025
@OGuggenbuehl OGuggenbuehl changed the title Feature/md header splitter feat:MarkdownHeaderSplitter Jul 29, 2025
@sjrl sjrl self-assigned this Aug 19, 2025
@sjrl
Copy link
Contributor

sjrl commented Aug 19, 2025

@OGuggenbuehl definitely looks like an interesting approach! I've left an initial set of comments, but to further review I'd appreciate if you could add a set of tests like the ones we have for the DocumentSplitter https://github.com/deepset-ai/haystack/blob/main/test/components/preprocessors/test_document_splitter.py

This will help me be able to review the actual algorithm for splitting since it's easier to understand with examples.

@sjrl sjrl changed the title feat:MarkdownHeaderSplitter feat: MarkdownHeaderSplitter Aug 27, 2025
@OGuggenbuehl OGuggenbuehl force-pushed the feature/md-header-splitter branch from 61a8396 to bcbbf9a Compare September 16, 2025 13:57
@sjrl
Copy link
Contributor

sjrl commented Sep 18, 2025

Thanks for your continued work on this @OGuggenbuehl!

Some general comments. Could you:

  • Add a release note for this PR following the instructions here
  • Could you make sure to include our license header to the beginning of each file you've added. You can find an example of the license header here
  • Please make sure to sign the CLA agreement (docs about it here) from this comment
  • If you haven't already please also set up pre-commit hooks using pre-commit install. You can find more info about that in this section of our contribution guidelines.
  • Also in the future feel free to open branches directly in Haystack instead of using a fork. This makes it slightly easier to pull down your code to review locally.

@coveralls
Copy link
Collaborator

coveralls commented Sep 19, 2025

Pull Request Test Coverage Report for Build 21869478600

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 160 unchanged lines in 16 files lost coverage.
  • Overall coverage increased (+0.2%) to 92.606%

Files with Coverage Reduction New Missed Lines %
components/rankers/hugging_face_tei.py 1 98.63%
components/preprocessors/document_cleaner.py 2 98.32%
components/rankers/sentence_transformers_similarity.py 2 97.53%
components/retrievers/multi_query_embedding_retriever.py 3 93.75%
components/rankers/lost_in_the_middle.py 4 88.37%
core/type_utils.py 4 97.18%
components/retrievers/multi_query_text_retriever.py 5 85.0%
core/pipeline/pipeline.py 5 94.02%
components/rankers/sentence_transformers_diversity.py 8 94.7%
components/rankers/transformers_similarity.py 9 91.51%
Totals Coverage Status
Change from base Build 21517269097: 0.2%
Covered Lines: 15104
Relevant Lines: 16310

💛 - Coveralls

@OGuggenbuehl OGuggenbuehl marked this pull request as ready for review September 19, 2025 16:05
@OGuggenbuehl OGuggenbuehl requested review from a team as code owners September 19, 2025 16:05
@OGuggenbuehl OGuggenbuehl removed the request for review from a team September 19, 2025 16:05
@sjrl
Copy link
Contributor

sjrl commented Feb 6, 2026

@OGuggenbuehl apologies I didn't mention this before but could you also update this pydoc file https://github.com/deepset-ai/haystack/blob/main/pydoc/preprocessors_api.yml to make sure your new component appears in our API docs?

@julian-risch julian-risch added this to the 2.24.0 milestone Feb 6, 2026
OGuggenbuehl and others added 11 commits February 9, 2026 13:29
Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>
Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>
Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>
Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>
Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>
Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>
@vercel
Copy link

vercel bot commented Feb 10, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
haystack-docs Ready Ready Preview, Comment Feb 11, 2026 9:17am

Request Review

Copy link
Contributor

@sjrl sjrl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! thanks for all the work on this

@sjrl sjrl merged commit 8fc71a8 into deepset-ai:main Feb 11, 2026
22 checks passed
julian-risch pushed a commit that referenced this pull request Feb 11, 2026
* implement md-header-splitter and add tests

* rework md-header splitter to rewrite md-header levels

* remove deprecated test

* Update haystack/components/preprocessors/markdown_header_splitter.py

use haystack logging

Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>

* use native types

* move to haystack logging

* docstrings improvements

* Update haystack/components/preprocessors/markdown_header_splitter.py

remove temp toc

Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>

* fix CustomDocumentSplitter arguments

* remove header prefix from content

* rework split_id assignment to avoid collisions

* remove unneeded dese methods

* cleanup

* cleanup

* add tests

cleanup

* move initialization of secondary-splitter out of run method

* move _custom_document_splitter to class method

* removed the _CustomDocumentSplitter class. splitting logic is now encapsulated within the MarkdownHeaderSplitter class as private methods.

* return to standard feed-forward character and add tests for page break handling

* quit exposing splitting_function param since it shouldn't be changed anyway

* remove test section in module

* add license header

* add release note

* minor refactor for type safety

* Update haystack/components/preprocessors/markdown_header_splitter.py

Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>

* remove unneeded release notes entries

* improved documentation for methods

* improve method naming

* improved page-number assignment & added return in docstring

minor cleanup

* unified page-counting

* simplify conditional secondary-split initialization and usage

* fix linting error

* clearly specify the use of ATX-style headers (#) only

* reference doc_id when logging no headers found

* initialize md-header pattern as private variable once

* add example to for inferring header levels to docstring

* improve empty document handling

add more logging for empty documents

* more explicit testing for inferred headers

* fix linting issue

* improved empty content handling test cases

* remove all functionality related to inferring md-header levels

* compile regex-pattern in init for performance gains

* Update haystack/components/preprocessors/markdown_header_splitter.py

Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>

* change all "none" to proper None values

* fix minor

* explicitly test doc content

* rename parentheaders to parent_headers

* test split_id, doc length

* check meta content

* remove unneeded test

* make split_id testing more robust

* more realistic overlap test sample

* assign split_id globally to all output docs

* taste page numbers explicitly

* cleanup pagebreak test

* minor

* return doc unchunked if no headers have content

* add doc-id to logging statement for unsplit documents

* remove unneeded logs

* minor cleanup

* simplify page-number tracking method to not return count, just the updated page number

* add dev comment to mypy check for doc.content is None

* Update haystack/components/preprocessors/markdown_header_splitter.py

Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>

* remove split meta flattening

* keep empty meta return consistent

* remove unneeded content is none check

* update tests to reflect empty meta dict for unsplit docs

* clean up total_page counts

* remove unneeded meta check

* Update test/components/preprocessors/test_markdown_header_splitter.py

Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>

* implement keep_headers parameter

* remove meta-dict flattening

* add minor sanity checks

* Update test/components/preprocessors/test_markdown_header_splitter.py

Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>

* add warmup

* Update haystack/components/preprocessors/markdown_header_splitter.py

Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>

* fix splitting when keeping headers

* test cleanup to cover keep_headers=True

* add tests for keep_headers=False splitting

* remove strip()

* simplify doc handling

* fix split id assignment

* test cleanup

* test splits more explicitly

* cleanup tests

minor commenting

* Update haystack/components/preprocessors/markdown_header_splitter.py

Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>

* Update haystack/components/preprocessors/markdown_header_splitter.py

Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>

* Update haystack/components/preprocessors/markdown_header_splitter.py

Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>

* Update haystack/components/preprocessors/markdown_header_splitter.py

Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>

* Update test/components/preprocessors/test_markdown_header_splitter.py

Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>

* Update haystack/components/preprocessors/markdown_header_splitter.py

Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>

* Update releasenotes/notes/add-md-header-splitter-df5c024a6ddd2718.yaml

Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>

* Update haystack/components/preprocessors/markdown_header_splitter.py

Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>

* Update releasenotes/notes/add-md-header-splitter-df5c024a6ddd2718.yaml

Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>

* Update releasenotes/notes/add-md-header-splitter-df5c024a6ddd2718.yaml

Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>

* Update haystack/components/preprocessors/markdown_header_splitter.py

Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>

* fix test now that meta is always kept regardless of headers

* update tests to consider headers always part of meta

* remove trailing whitespace removal

* remove redundant test

* make test_split_multiple_documents more explicit

* make tests more explicit

* remove header level inference from release notes

* improve splitting log message

* add split ids and more explicit string assertions

* add fixture & appropriate assertions for sample text with page breaks

* add md-header-splitter to preprocessors api

---------

Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>
kacperlukawski pushed a commit that referenced this pull request Feb 12, 2026
* implement md-header-splitter and add tests

* rework md-header splitter to rewrite md-header levels

* remove deprecated test

* Update haystack/components/preprocessors/markdown_header_splitter.py

use haystack logging

Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>

* use native types

* move to haystack logging

* docstrings improvements

* Update haystack/components/preprocessors/markdown_header_splitter.py

remove temp toc

Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>

* fix CustomDocumentSplitter arguments

* remove header prefix from content

* rework split_id assignment to avoid collisions

* remove unneeded dese methods

* cleanup

* cleanup

* add tests

cleanup

* move initialization of secondary-splitter out of run method

* move _custom_document_splitter to class method

* removed the _CustomDocumentSplitter class. splitting logic is now encapsulated within the MarkdownHeaderSplitter class as private methods.

* return to standard feed-forward character and add tests for page break handling

* quit exposing splitting_function param since it shouldn't be changed anyway

* remove test section in module

* add license header

* add release note

* minor refactor for type safety

* Update haystack/components/preprocessors/markdown_header_splitter.py

Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>

* remove unneeded release notes entries

* improved documentation for methods

* improve method naming

* improved page-number assignment & added return in docstring

minor cleanup

* unified page-counting

* simplify conditional secondary-split initialization and usage

* fix linting error

* clearly specify the use of ATX-style headers (#) only

* reference doc_id when logging no headers found

* initialize md-header pattern as private variable once

* add example to for inferring header levels to docstring

* improve empty document handling

add more logging for empty documents

* more explicit testing for inferred headers

* fix linting issue

* improved empty content handling test cases

* remove all functionality related to inferring md-header levels

* compile regex-pattern in init for performance gains

* Update haystack/components/preprocessors/markdown_header_splitter.py

Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>

* change all "none" to proper None values

* fix minor

* explicitly test doc content

* rename parentheaders to parent_headers

* test split_id, doc length

* check meta content

* remove unneeded test

* make split_id testing more robust

* more realistic overlap test sample

* assign split_id globally to all output docs

* taste page numbers explicitly

* cleanup pagebreak test

* minor

* return doc unchunked if no headers have content

* add doc-id to logging statement for unsplit documents

* remove unneeded logs

* minor cleanup

* simplify page-number tracking method to not return count, just the updated page number

* add dev comment to mypy check for doc.content is None

* Update haystack/components/preprocessors/markdown_header_splitter.py

Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>

* remove split meta flattening

* keep empty meta return consistent

* remove unneeded content is none check

* update tests to reflect empty meta dict for unsplit docs

* clean up total_page counts

* remove unneeded meta check

* Update test/components/preprocessors/test_markdown_header_splitter.py

Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>

* implement keep_headers parameter

* remove meta-dict flattening

* add minor sanity checks

* Update test/components/preprocessors/test_markdown_header_splitter.py

Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>

* add warmup

* Update haystack/components/preprocessors/markdown_header_splitter.py

Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>

* fix splitting when keeping headers

* test cleanup to cover keep_headers=True

* add tests for keep_headers=False splitting

* remove strip()

* simplify doc handling

* fix split id assignment

* test cleanup

* test splits more explicitly

* cleanup tests

minor commenting

* Update haystack/components/preprocessors/markdown_header_splitter.py

Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>

* Update haystack/components/preprocessors/markdown_header_splitter.py

Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>

* Update haystack/components/preprocessors/markdown_header_splitter.py

Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>

* Update haystack/components/preprocessors/markdown_header_splitter.py

Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>

* Update test/components/preprocessors/test_markdown_header_splitter.py

Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>

* Update haystack/components/preprocessors/markdown_header_splitter.py

Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>

* Update releasenotes/notes/add-md-header-splitter-df5c024a6ddd2718.yaml

Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>

* Update haystack/components/preprocessors/markdown_header_splitter.py

Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>

* Update releasenotes/notes/add-md-header-splitter-df5c024a6ddd2718.yaml

Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>

* Update releasenotes/notes/add-md-header-splitter-df5c024a6ddd2718.yaml

Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>

* Update haystack/components/preprocessors/markdown_header_splitter.py

Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>

* fix test now that meta is always kept regardless of headers

* update tests to consider headers always part of meta

* remove trailing whitespace removal

* remove redundant test

* make test_split_multiple_documents more explicit

* make tests more explicit

* remove header level inference from release notes

* improve splitting log message

* add split ids and more explicit string assertions

* add fixture & appropriate assertions for sample text with page breaks

* add md-header-splitter to preprocessors api

---------

Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants