Reduce test suite: remove 47 redundant print-only examples from pytest collection #811

psschwei · 2026-04-10T01:02:53Z

psschwei
Apr 10, 2026
Maintainer

Summary

We have 100 example files in docs/examples/ collected by pytest via # pytest: comment markers. After auditing every file, 74 have zero assertions — they run a real LLM backend, print the output, and validate nothing beyond "the code didn't crash."

However, a deeper investigation revealed that not all print-only examples are redundant. ~15 are the only test coverage for significant public APIs, and ~10 more provide the only real-backend integration testing for features that test/ only covers with mocks. The remaining ~47 are genuinely redundant with existing test/ coverage and safe to remove.

Proposal:

Remove # pytest: markers from 47 redundant print-only examples (they stay as runnable docs)
Keep 27 print-only examples that provide irreplaceable coverage (with a follow-up to migrate that coverage into proper tests with assertions)
Clean up a few things in test/ itself

Why this matters

14 intrinsics examples each load a HuggingFace GPU model to print output — 13 of them are fully redundant with test/
~30 Ollama examples duplicate code paths already covered by test/ with real assertions
Maintenance burden: API changes require updating all 100 example files to keep pytest green, even when most don't test unique behavior

The 47 files safe to remove from pytest

These exercise code paths already covered by test/ with equivalent or better coverage:

Category	Count	Why redundant
Intrinsics (answerability, citations, context_attribution, context_relevance, factuality_correction, factuality_detection, hallucination_detection, policy_guardrails, query_clarification, query_rewrite, requirement_check, uncertainty, intrinsics.py)	13	Every function tested in test/stdlib/components/intrinsic/ with assertions
Email/instruct tutorials (101_email, 101_email_with_requirements, 101_email_with_validate, advanced_email_with_validate_function, simple_email, example.py, instruct_validate_repair.py)	7	`instruct()` + requirements + sampling tested across multiple test/ files
Tutorial basics (sentiment_classifier, context_example, model_options_example, hello_world)	4	Basic session operations thoroughly tested
Generative slots tutorials (generative_slots.py, generate_with_context.py, generative_gsm8k.py, inter_module_composition/decision_aides.py, inter_module_composition/summarizers.py)	5	Basic genslot usage tested in test_genslot.py
Vision/image (vision_ollama_chat, vision_openai_examples, vision_litellm_backend)	3	Vision backends tested in test/backends/test_vision_*
Async/melp (async-simple, lazy, lazy_fib)	3	Streaming + lazy compute tested in test/core/
Tools (tool_decorator_example, smolagents_example)	2	@tool decorator tested in test_tool_decorator.py; from_smolagents tested in test_mellea_tool.py
aLora (101_example, 102_example, stembolts_intrinsic)	3	ALoraRequirement tested in test_huggingface.py; mostly already skipped
Other (context/contexts_with_sampling, m_serve_example_simple, m_decompose example, mini_researcher/context_docs, mify/rich_table_execute_basic, mify/rich_document_advanced, information_extraction/101_with_gen_slots, mobject/table)	8	Each has equivalent coverage in test/ (test_sampling_ctx, test_serve, test_decompose, test_richdocument, etc.)

The ~25 files that must keep running (~15 irreplaceable + ~10 real-backend integration)

Irreplaceable API coverage (~15 files)

These are the only test coverage for significant public APIs:

File	API with zero test/ coverage
`plugins/session_scoped.py`	`start_session(plugins=[...])` code path
`plugins/class_plugin.py`	Plugin subclass with real LLM backend
`plugins/tool_hooks.py`	`_call_tools()`, `uses_tool()`, TOOL_PRE/POST_INVOKE hooks
`plugins/plugin_set_composition.py`	PluginSet via `start_session(plugins=[...])`
`safety/guardian.py`	`GuardianCheck` / `GuardianRisk` classes
`safety/repair_with_guardian.py`	`RepairTemplateStrategy`
`safety/guardian_huggingface.py`	GuardianCheck with HF backend
`agents/react/react_using_mellea.py`	Entire `mellea.stdlib.frameworks.react` module
`sessions/creating_a_new_type_of_session.py`	MelleaSession subclassing contract
`tools/interpreter_example.py`	`code_interpreter`, `local_code_interpreter`, `tool_arg_validator`
`intrinsics/guardian_core.py`	`guardian_check` with `social_bias`, `answer_relevance`, custom criteria
`instruct_validate_repair/101_email_comparison.py`	`grounding_context=` parameter
`instruct_validate_repair/multiturn_strategy_example.py`	MultiTurnStrategy end-to-end
`mini_researcher/researcher.py`	Multi-step pipeline with grounding_context + RejectionSampling
`sofai/sofai_graph_coloring.py`	SOFAI with real backends (test/ only uses mocks)

Follow-up work: These should eventually be migrated into proper test/ files with assertions, so the examples can become pure documentation.

Real-backend integration tests (~10 files)

The entire test/plugins/ suite uses mock backends. These examples are the only place where plugins run against a real LLM:

plugins/execution_modes.py
plugins/payload_modification.py
plugins/plugin_scoped.py
plugins/quickstart.py
plugins/standalone_hooks.py
tutorial/compositionality_with_generative_slots.py (slot chaining — test/ only tests individual slots)
generative_slots/inter_module_composition/summarize_and_decide.py (multi-module composition)
telemetry/metrics_example.py (end-to-end metrics with real session)
telemetry/telemetry_example.py (end-to-end tracing with real session)

Git history confirms this coverage matters — multiple fix: commits were triggered by example breakage (aLora, RAG intrinsics, tools, MultiTurnStrategy).

Examples with real assertions (26 files — no change needed)

These already have programmatic validation and should continue running as-is:

plugins/testing_plugins.py, plugins/standalone_hooks.py, plugins/tool_hooks.py, mify/mify.py, melp/simple_example.py, melp/lazy_fib_sample.py, mobject/table.py, tutorial/table_mobject.py, tutorial/document_mobject.py, sofai/sofai_graph_coloring.py, library_interop/langchain_messages.py, information_extraction/advanced_with_m_instruct.py, generative_slots/investment_advice.py, generative_slots/generative_slots_with_requirements.py, tools/interpreter_example.py, tools/smolagents_example.py, async/async-with-lazy-compute.py, telemetry/metrics_example.py, qiskit_code_validation/* (2 files), aLora/make_training_data.py

Cleanup in test/ itself

Item	Action	Tests affected
`test/stdlib/components/test_hello_world.py`	Delete — zero assertions, just `print()`	2
`test/backends/test_tool_calls.py`	Delete or flesh out — nearly empty	3
Backend duplication (`test_ollama.py`, `test_litellm_ollama.py`, `test_openai_ollama.py`, `test_openai_vllm.py`)	Consolidate into parametrized test file	0 (same tests, fewer files)

Impact

~47 fewer test items collected by pytest
Significant CI time savings, especially the 13 HuggingFace GPU model loads
Examples remain as runnable documentation
No coverage lost — redundant files only
Clear follow-up path to migrate the 15 irreplaceable examples into proper tests

psschwei · 2026-04-10T01:09:57Z

psschwei
Apr 10, 2026
Maintainer Author

xref #726

0 replies

planetf1 · 2026-04-10T10:09:02Z

planetf1
Apr 10, 2026
Maintainer

I think there's a case to run less as a typical 'test' -- and ensuring good test coverage within those core tests is critical. We have another issue to convert some e2e tests into integration/unit tests (sometimes splitting and retaining the e2e element)

The examples are a slightly awkward fit & in many ways I agree it shouldn't make sense or rely on them to test the core.

HOWEVER I would also assert that we should aim to automate examples even further so that we can validate examples don't degrade - they are also valuable 'e2e' tests (in some cases). There's also an discussion about using llm as a judge, or other semantic checks (could apply to tests and examples). This requires more scaffolding, and in some cases composition (for example tutorials in docs where we have multiple sections with parts of the instructions) or interaction. 'not crashing' is a starting point, but clearly inefficient.

I guess what I'm saying is

We revisit test coverage (I started - time to return...) based on test/* & track and improve. Improving here could be by taking the essence of some of these e2e tests but incorporating them as formal pytests rather than adapted examples (for example the 15 you list)
we branch off example testing as distinct discussion . more about testing the examples themselves, less about being required to test mellea. We drop docs from testing once item above is done, but leave framework in place until we have a better solution. We consider running these in overnight/CI runs only. Also as part of this we need to ensure the examples are relevant/useful

I'd be happy to start on the first of these.

0 replies

planetf1 · 2026-04-10T10:43:50Z

planetf1
Apr 10, 2026
Maintainer

Some data from the last testing I ran on LSF an hour or two ago to include looking at coverage

1034 tests selected, 998 passed, 30 skipped, 2 failed
Backends available: HuggingFace (CUDA GPU), Ollama, vLLM, LiteLLM
Skipped: Watsonx (11), Bedrock (1), tracing backend (6), session (1), others
Total wall time: 31m 41s
No docs/examples/ were run — the HPC script explicitly targets pytest test/ only
HPC coverage from test/ alone: 63.9% line, 67.5% statement, 51.8% branch (1034 tests, all backends available including CUDA GPU + vLLM + Ollama).

Specifically if we zoom in on the coverage, the LSF run gave me:


area                                       files  stmts  miss   cov%
─────────────────────────────────────────────────────────────────────────
mellea/core/                                  6    636    61   90.4  ###################.
mellea/plugins/                              16    651    92   85.9  #################...
mellea/telemetry/                             6    389    58   85.1  #################...
mellea/backends/adapters                      3     95    13   86.3  #################...
mellea/backends/ (core 5)                     5   1415   286   79.8  ###############.....
  huggingface.py                                   563   152   73.0  ##############......
  openai.py                                        277    41   85.2  #################...
  ollama.py                                        280    52   81.4  ################....
  litellm.py                                       261    56   78.5  ###############.....
  tools.py                                         272    35   87.1  #################...
mellea/stdlib/components/                    18    911   181   80.1  ################....
mellea/stdlib/sampling/                       7    567    90   84.1  ################....
mellea/stdlib/session.py                      1    196    42   78.6  ###############.....
mellea/stdlib/functional.py                   1    278    64   77.0  ###############.....
mellea/helpers/                               4    172    39   77.3  ###############.....
mellea/formatters/ (non-granite)              2    159     8   95.0  ###################.
─────────────────────────────────────────────────────────────────────────
mellea/formatters/granite/                   30   2347  1128   51.9  ##########..........
mellea/stdlib/requirements/                   6    361   208   42.4  ########............
mellea/backends/bedrock.py                    1     42    32   23.8  ####................
mellea/backends/watsonx.py                    1    241   203   15.8  ###.................
mellea/stdlib/frameworks/                     1     39    39    0.0  ....................
mellea/backends/vllm.py                       1    185   185    0.0  ....................
mellea/stdlib/components/unit_test_eval       1     71    71    0.0  ....................
mellea/stdlib/requirements/safety/guardian    1    157   157    0.0  ....................

Backends/e2e are going to be harder, but if we concentrate on the core code where we can write good mock tests and categorize:

Area	Stmts	Miss	Cov%	Backend needed?	Notes
`formatters/granite/` (non-intrinsic)	~815	~720	7-13%	No — pure input/output transforms	Highest ROI. Deterministic string formatters: given prompt + model options, produce formatted text or parse LLM output. Fully testable with unit tests and fixture data. Existing test infra has the pattern — just needs more fixture combos.
`stdlib/requirements/safety/guardian.py`	157	157	0%	No — validation logic	Pydantic models + requirement checking. Core logic (score thresholds, pass/fail, composition) is testable without a backend. Currently only exercised by examples.
`stdlib/requirements/tool_reqs.py`	44	35	18%	No — requirement validation	Same pattern as other requirements — pure logic.
`stdlib/frameworks/react.py`	39	39	0%	Partially — orchestration logic can be mocked	Could reach ~50% with mock backend, hard to go higher without real LLM.
`stdlib/components/react.py`	33	33	0%	Partially — companion to frameworks/react	Same constraints as above.
`stdlib/components/unit_test_eval.py`	71	71	0%	Unclear — needs investigation	Worth checking if this is pure logic or LLM-dependent.

Going a step further, trying to get high coverage even without e2e tests would be a good goal - since these can be run very efficiently (the risk is that any mocking is incorrect/insufficient - or conversely that we test internal implementation details only)

0 replies

planetf1 · 2026-04-10T10:53:55Z

planetf1
Apr 10, 2026
Maintainer

To refocus — the question isn't really which examples to drop, it's whether test/ can stand on its own. Right now it can't in a few key areas.

The coverage data above shows the gaps that matter. The granite formatters and safety/guardian are the highest priority — no backend needed, pure logic, highest ROI. Once those are covered properly in test/, the examples become what they should be: runnable documentation, not accidental test infrastructure.

I can begin starting on the granite formatter coverage since that's the biggest hole by far - if we feel this is an important area?

Back to your original point - yes we could save. Firstly not running examples (locally). Secondly potentially not (?always) running backend tests locally (saves dev time, cause of memory issues) yet testing a lot of the code paths especially in parsing.

1 reply

psschwei Apr 10, 2026
Maintainer Author

the question isn't really which examples to drop, it's whether test/ can stand on its own

Sort of.

A little more for context: I've been thinking about this for a bit, as we've been working for a while now to revise how we run the tests. But it seems to me that we should also be looking at what we're testing. For example, the IVR examples are largely redundant, do we need to run them all? In this case, they're fairly simple tests so no real harm, but it's what got me thinking and poking around to see if there were cases where there might be.

planetf1 · 2026-04-10T10:56:58Z

planetf1
Apr 10, 2026
Maintainer

Related issues:

Epic: Testing Infrastructure & Strategy Overhaul #726 — "Testing Infrastructure & Strategy Overhaul" — the epic I opened (from another discussion here), umbrella for all this
test: split e2e tests into integration + e2e pairs #729 — "split e2e tests into integration + e2e pairs" — directly relevant, converting tests downward -- this was one we discussed in our team review and a good match
test: test results and coverage reporting #737 — "test results and coverage reporting" — reporting infra to track progress - I added coverage capture early on, but we need better tracking to record progress
test: audit and fix backend & resource markers across test suite #728 — "audit and fix backend & resource markers" — in progress via PR test: agent skills infrastructure and marker taxonomy audit (#727, #728) #742
test: CI optimisation — parallelisation and dynamic test selection #733 — "CI optimisation — parallelisation and dynamic test selection" - less critical
Proposal: Semantic assertions for qualitative test validation #692 — "Semantic assertions for qualitative test validation" (richardesp) — LLM-as-judge approach. needed but primarily e2e, getting rid of flakiness
predicates.py Apple Silicon VRAM estimate uses static heuristic instead of actual available memory, causing GPU-gated tests to run on under-resourced hardware #783 — "Apple Silicon VRAM estimate uses static heuristic" — explains why HF intrinsics run on Mac when they probably shouldn't - this has gone back to @ajbozarth as I couldn't repro the actual failing test
test: react_using_mellea.py example fails intermittently — ReACT loop exceeds 10 iteration budget #684 — "react_using_mellea.py example fails intermittently" — the react example with 0% test/ coverage
CI: backends for vllm testing are redundant and cause CUDA context lock #625 — "CI: backends for vllm testing are redundant" — we removed vllm
[guardian] Use Granite Guardian as a Requirement. #30 — "Use Granite Guardian as a Requirement" — guardian module at 0% coverage, still open

The granite formatter coverage gap and safety/guardian gap don't have dedicated issues yet — those would be new work items under #726.

0 replies

planetf1 · 2026-04-10T15:03:44Z

planetf1
Apr 10, 2026
Maintainer

See #813 and children for some improvements (in progress) to unit/integration(ie non-e2e) test coverage which would be an enabler for this proposal

0 replies

markstur · 2026-04-10T15:45:17Z

markstur
Apr 10, 2026
Maintainer

2 points...

examples should not be relied on as tests. They are just run with tests to make sure the examples still work. As already noted this means some of them should be migrated to proper tests and the examples should focus on having the right examples.
I think it would be good to try to articulate the goal of 100% coverage (more like upper 90s is my preference) w.r.t. do we prefer unit or e2e for the coverage goal. Both have benefits but high coverage goal works better if it is consistent. I don't know if we share a preference for e2e w/ unit for gaps vs unit-first for coverage and then e2e where it makes sense. Is there some consensus on that or "it depends"?

11 replies

ajbozarth Apr 10, 2026
Maintainer

And we don't run them in CI so they aren't blocking even if they do fail. Is that right?

Not quite, the examples are currently run in CI, but CI skips most e2e, which most examples are, so very few of them run in CI. Essentially which tests are run in CI is agnostic to texts vs examples, and solely determined by markers

psschwei Apr 10, 2026
Maintainer Author

the examples are currently run in CI

@ajbozarth In which step? the "Run Tests" step only runs on the test/ directory:

mellea/.github/workflows/quality.yml

Line 54 in 315930f

run: uv run -m pytest -v --junit-xml=/tmp/pytest-results.xml test

ajbozarth Apr 10, 2026
Maintainer

I guess I'm crazy, thought we had expanded that when we enabled the examples, but I was wrong, so only nightlies run them.

psschwei Apr 11, 2026
Maintainer Author

Don't worry, I thought so too (that's why I opened this discussion)

planetf1 Apr 15, 2026
Maintainer

I think it would be good to try to articulate the goal of 100% coverage (more like upper 90s is my preference) w.r.t. do we prefer unit or e2e for the coverage goal. Both have benefits but high coverage goal works better if it is consistent. I don't know if we share a preference for e2e w/ unit for gaps vs unit-first for coverage and then e2e where it makes sense. Is there some consensus on that or "it depends"?

During this iteration I've increased coverage quite a lot - the parent issue gives the current figures. Whilst we want to get high levels of testing it depends on the code. In some cases we might do 100%, in others a lot less.

For example if tests become bigger than the code they are testing they can become a liability. Or if they are trivial, or fragile. Some areas like integration with external systems can be particularly difficult to mock properly, though it varies. So it's hard to say what a good number is!

I do think the principle is 'unit first'. There's also another issue #729 which was raised earlier, and suggested migrating/splitting some e2e down to unit -- but I found the bottom up approach better. I'll review that issue to see what might be left

e2e certainly does have it's place, we just want to catch things earlier/quicker/easier when we can.

psschwei · 2026-04-15T01:33:43Z

psschwei
Apr 15, 2026
Maintainer Author

Think the general conclusion is that we don't need to remove the examples from pytest as they are informational only.
But this also highlighted that we need to increase what we cover with test/ (as some of the examples were serving as de facto tests).
I'm going to close this for now, feel free to reopen if there's more we want to discuss on this topic.

1 reply

planetf1 Apr 15, 2026
Maintainer

My ideal picture is that we have high coverage without running examples. As much as possible through unit/integration tests, then added too by carefully targetted e2e tests. They are geared to the needs of testing, not example writing/tutorials. (ie they won't be interactive, customizable, and will have full automation)

Quite apart from that we have user-centric examples/tutorials which show-off mellea. They need testing, but more to ensure they are kept current than any reliance on them for mellea quality. The same applies to executing tutorials in the docs. this can be harder as we need to keep the infrastructure clutter away from the example code we share with users.

It's true that they were being used a proxy for proper tests - they are what we had. With unit tests improved the logical progression is to look at #729 (next iteration or two?) alongside our coverage figures and see if we can target a candidate list to augment our tests with additional e2e that get us better coverage especially in the hard to reach areas.

Reduce test suite: remove 47 redundant print-only examples from pytest collection #811

Uh oh!

psschwei Apr 10, 2026 Maintainer

Summary

Why this matters

The 47 files safe to remove from pytest

The ~25 files that must keep running (~15 irreplaceable + ~10 real-backend integration)

Irreplaceable API coverage (~15 files)

Real-backend integration tests (~10 files)

Examples with real assertions (26 files — no change needed)

Cleanup in test/ itself

Impact

Replies: 8 comments · 13 replies

Uh oh!

psschwei Apr 10, 2026 Maintainer Author

Uh oh!

Uh oh!

planetf1 Apr 10, 2026 Maintainer

Uh oh!

planetf1 Apr 10, 2026 Maintainer

Uh oh!

planetf1 Apr 10, 2026 Maintainer

Uh oh!

psschwei Apr 10, 2026 Maintainer Author

Uh oh!

planetf1 Apr 10, 2026 Maintainer

Uh oh!

planetf1 Apr 10, 2026 Maintainer

Uh oh!

markstur Apr 10, 2026 Maintainer

Uh oh!

ajbozarth Apr 10, 2026 Maintainer

Uh oh!

psschwei Apr 10, 2026 Maintainer Author

Uh oh!

ajbozarth Apr 10, 2026 Maintainer

Uh oh!

psschwei Apr 11, 2026 Maintainer Author

Uh oh!

planetf1 Apr 15, 2026 Maintainer

Uh oh!

psschwei Apr 15, 2026 Maintainer Author

Uh oh!

Uh oh!

planetf1 Apr 15, 2026 Maintainer

psschwei
Apr 10, 2026
Maintainer

Replies: 8 comments 13 replies

psschwei
Apr 10, 2026
Maintainer Author

planetf1
Apr 10, 2026
Maintainer

planetf1
Apr 10, 2026
Maintainer

planetf1
Apr 10, 2026
Maintainer

psschwei Apr 10, 2026
Maintainer Author

planetf1
Apr 10, 2026
Maintainer

planetf1
Apr 10, 2026
Maintainer

markstur
Apr 10, 2026
Maintainer

ajbozarth Apr 10, 2026
Maintainer

psschwei Apr 10, 2026
Maintainer Author

ajbozarth Apr 10, 2026
Maintainer

psschwei Apr 11, 2026
Maintainer Author

planetf1 Apr 15, 2026
Maintainer

psschwei
Apr 15, 2026
Maintainer Author

planetf1 Apr 15, 2026
Maintainer