Conversation
|
@OGuggenbuehl definitely looks like an interesting approach! I've left an initial set of comments, but to further review I'd appreciate if you could add a set of tests like the ones we have for the This will help me be able to review the actual algorithm for splitting since it's easier to understand with examples. |
61a8396 to
bcbbf9a
Compare
|
Thanks for your continued work on this @OGuggenbuehl! Some general comments. Could you:
|
Pull Request Test Coverage Report for Build 21869478600Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
releasenotes/notes/add-md-header-splitter-df5c024a6ddd2718.yaml
Outdated
Show resolved
Hide resolved
|
@OGuggenbuehl apologies I didn't mention this before but could you also update this pydoc file https://github.com/deepset-ai/haystack/blob/main/pydoc/preprocessors_api.yml to make sure your new component appears in our API docs? |
Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>
Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>
Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>
Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>
Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>
Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
…enbuehl/haystack into feature/md-header-splitter
sjrl
left a comment
There was a problem hiding this comment.
Looks good! thanks for all the work on this
* implement md-header-splitter and add tests * rework md-header splitter to rewrite md-header levels * remove deprecated test * Update haystack/components/preprocessors/markdown_header_splitter.py use haystack logging Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com> * use native types * move to haystack logging * docstrings improvements * Update haystack/components/preprocessors/markdown_header_splitter.py remove temp toc Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com> * fix CustomDocumentSplitter arguments * remove header prefix from content * rework split_id assignment to avoid collisions * remove unneeded dese methods * cleanup * cleanup * add tests cleanup * move initialization of secondary-splitter out of run method * move _custom_document_splitter to class method * removed the _CustomDocumentSplitter class. splitting logic is now encapsulated within the MarkdownHeaderSplitter class as private methods. * return to standard feed-forward character and add tests for page break handling * quit exposing splitting_function param since it shouldn't be changed anyway * remove test section in module * add license header * add release note * minor refactor for type safety * Update haystack/components/preprocessors/markdown_header_splitter.py Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com> * remove unneeded release notes entries * improved documentation for methods * improve method naming * improved page-number assignment & added return in docstring minor cleanup * unified page-counting * simplify conditional secondary-split initialization and usage * fix linting error * clearly specify the use of ATX-style headers (#) only * reference doc_id when logging no headers found * initialize md-header pattern as private variable once * add example to for inferring header levels to docstring * improve empty document handling add more logging for empty documents * more explicit testing for inferred headers * fix linting issue * improved empty content handling test cases * remove all functionality related to inferring md-header levels * compile regex-pattern in init for performance gains * Update haystack/components/preprocessors/markdown_header_splitter.py Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com> * change all "none" to proper None values * fix minor * explicitly test doc content * rename parentheaders to parent_headers * test split_id, doc length * check meta content * remove unneeded test * make split_id testing more robust * more realistic overlap test sample * assign split_id globally to all output docs * taste page numbers explicitly * cleanup pagebreak test * minor * return doc unchunked if no headers have content * add doc-id to logging statement for unsplit documents * remove unneeded logs * minor cleanup * simplify page-number tracking method to not return count, just the updated page number * add dev comment to mypy check for doc.content is None * Update haystack/components/preprocessors/markdown_header_splitter.py Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com> * remove split meta flattening * keep empty meta return consistent * remove unneeded content is none check * update tests to reflect empty meta dict for unsplit docs * clean up total_page counts * remove unneeded meta check * Update test/components/preprocessors/test_markdown_header_splitter.py Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com> * implement keep_headers parameter * remove meta-dict flattening * add minor sanity checks * Update test/components/preprocessors/test_markdown_header_splitter.py Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com> * add warmup * Update haystack/components/preprocessors/markdown_header_splitter.py Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com> * fix splitting when keeping headers * test cleanup to cover keep_headers=True * add tests for keep_headers=False splitting * remove strip() * simplify doc handling * fix split id assignment * test cleanup * test splits more explicitly * cleanup tests minor commenting * Update haystack/components/preprocessors/markdown_header_splitter.py Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/markdown_header_splitter.py Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/markdown_header_splitter.py Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/markdown_header_splitter.py Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com> * Update test/components/preprocessors/test_markdown_header_splitter.py Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/markdown_header_splitter.py Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com> * Update releasenotes/notes/add-md-header-splitter-df5c024a6ddd2718.yaml Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/markdown_header_splitter.py Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com> * Update releasenotes/notes/add-md-header-splitter-df5c024a6ddd2718.yaml Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com> * Update releasenotes/notes/add-md-header-splitter-df5c024a6ddd2718.yaml Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/markdown_header_splitter.py Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com> * fix test now that meta is always kept regardless of headers * update tests to consider headers always part of meta * remove trailing whitespace removal * remove redundant test * make test_split_multiple_documents more explicit * make tests more explicit * remove header level inference from release notes * improve splitting log message * add split ids and more explicit string assertions * add fixture & appropriate assertions for sample text with page breaks * add md-header-splitter to preprocessors api --------- Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>
* implement md-header-splitter and add tests * rework md-header splitter to rewrite md-header levels * remove deprecated test * Update haystack/components/preprocessors/markdown_header_splitter.py use haystack logging Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com> * use native types * move to haystack logging * docstrings improvements * Update haystack/components/preprocessors/markdown_header_splitter.py remove temp toc Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com> * fix CustomDocumentSplitter arguments * remove header prefix from content * rework split_id assignment to avoid collisions * remove unneeded dese methods * cleanup * cleanup * add tests cleanup * move initialization of secondary-splitter out of run method * move _custom_document_splitter to class method * removed the _CustomDocumentSplitter class. splitting logic is now encapsulated within the MarkdownHeaderSplitter class as private methods. * return to standard feed-forward character and add tests for page break handling * quit exposing splitting_function param since it shouldn't be changed anyway * remove test section in module * add license header * add release note * minor refactor for type safety * Update haystack/components/preprocessors/markdown_header_splitter.py Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com> * remove unneeded release notes entries * improved documentation for methods * improve method naming * improved page-number assignment & added return in docstring minor cleanup * unified page-counting * simplify conditional secondary-split initialization and usage * fix linting error * clearly specify the use of ATX-style headers (#) only * reference doc_id when logging no headers found * initialize md-header pattern as private variable once * add example to for inferring header levels to docstring * improve empty document handling add more logging for empty documents * more explicit testing for inferred headers * fix linting issue * improved empty content handling test cases * remove all functionality related to inferring md-header levels * compile regex-pattern in init for performance gains * Update haystack/components/preprocessors/markdown_header_splitter.py Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com> * change all "none" to proper None values * fix minor * explicitly test doc content * rename parentheaders to parent_headers * test split_id, doc length * check meta content * remove unneeded test * make split_id testing more robust * more realistic overlap test sample * assign split_id globally to all output docs * taste page numbers explicitly * cleanup pagebreak test * minor * return doc unchunked if no headers have content * add doc-id to logging statement for unsplit documents * remove unneeded logs * minor cleanup * simplify page-number tracking method to not return count, just the updated page number * add dev comment to mypy check for doc.content is None * Update haystack/components/preprocessors/markdown_header_splitter.py Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com> * remove split meta flattening * keep empty meta return consistent * remove unneeded content is none check * update tests to reflect empty meta dict for unsplit docs * clean up total_page counts * remove unneeded meta check * Update test/components/preprocessors/test_markdown_header_splitter.py Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com> * implement keep_headers parameter * remove meta-dict flattening * add minor sanity checks * Update test/components/preprocessors/test_markdown_header_splitter.py Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com> * add warmup * Update haystack/components/preprocessors/markdown_header_splitter.py Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com> * fix splitting when keeping headers * test cleanup to cover keep_headers=True * add tests for keep_headers=False splitting * remove strip() * simplify doc handling * fix split id assignment * test cleanup * test splits more explicitly * cleanup tests minor commenting * Update haystack/components/preprocessors/markdown_header_splitter.py Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/markdown_header_splitter.py Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/markdown_header_splitter.py Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/markdown_header_splitter.py Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com> * Update test/components/preprocessors/test_markdown_header_splitter.py Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/markdown_header_splitter.py Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com> * Update releasenotes/notes/add-md-header-splitter-df5c024a6ddd2718.yaml Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/markdown_header_splitter.py Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com> * Update releasenotes/notes/add-md-header-splitter-df5c024a6ddd2718.yaml Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com> * Update releasenotes/notes/add-md-header-splitter-df5c024a6ddd2718.yaml Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/markdown_header_splitter.py Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com> * fix test now that meta is always kept regardless of headers * update tests to consider headers always part of meta * remove trailing whitespace removal * remove redundant test * make test_split_multiple_documents more explicit * make tests more explicit * remove header level inference from release notes * improve splitting log message * add split ids and more explicit string assertions * add fixture & appropriate assertions for sample text with page breaks * add md-header-splitter to preprocessors api --------- Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>
Proposed Changes:
Implement MarkdownHeaderSplitter to split Documents written in .md based on their headers
How did you test it?
unit tests
Checklist
fix:,feat:,build:,chore:,ci:,docs:,style:,refactor:,perf:,test:and added!in case the PR includes breaking changes.