Replies: 8 comments 13 replies
-
|
xref #726 |
Beta Was this translation helpful? Give feedback.
-
|
I think there's a case to run less as a typical 'test' -- and ensuring good test coverage within those core tests is critical. We have another issue to convert some e2e tests into integration/unit tests (sometimes splitting and retaining the e2e element) The examples are a slightly awkward fit & in many ways I agree it shouldn't make sense or rely on them to test the core. HOWEVER I would also assert that we should aim to automate examples even further so that we can validate examples don't degrade - they are also valuable 'e2e' tests (in some cases). There's also an discussion about using llm as a judge, or other semantic checks (could apply to tests and examples). This requires more scaffolding, and in some cases composition (for example tutorials in docs where we have multiple sections with parts of the instructions) or interaction. 'not crashing' is a starting point, but clearly inefficient. I guess what I'm saying is
I'd be happy to start on the first of these. |
Beta Was this translation helpful? Give feedback.
-
|
Some data from the last testing I ran on LSF an hour or two ago to include looking at coverage Specifically if we zoom in on the coverage, the LSF run gave me: Backends/e2e are going to be harder, but if we concentrate on the core code where we can write good mock tests and categorize:
Going a step further, trying to get high coverage even without e2e tests would be a good goal - since these can be run very efficiently (the risk is that any mocking is incorrect/insufficient - or conversely that we test internal implementation details only) |
Beta Was this translation helpful? Give feedback.
-
|
To refocus — the question isn't really which examples to drop, it's whether test/ can stand on its own. Right now it can't in a few key areas. The coverage data above shows the gaps that matter. The granite formatters and safety/guardian are the highest priority — no backend needed, pure logic, highest ROI. Once those are covered properly in test/, the examples become what they should be: runnable documentation, not accidental test infrastructure. I can begin starting on the granite formatter coverage since that's the biggest hole by far - if we feel this is an important area? Back to your original point - yes we could save. Firstly not running examples (locally). Secondly potentially not (?always) running backend tests locally (saves dev time, cause of memory issues) yet testing a lot of the code paths especially in parsing. |
Beta Was this translation helpful? Give feedback.
-
|
Related issues:
The granite formatter coverage gap and safety/guardian gap don't have dedicated issues yet — those would be new work items under #726. |
Beta Was this translation helpful? Give feedback.
-
|
See #813 and children for some improvements (in progress) to unit/integration(ie non-e2e) test coverage which would be an enabler for this proposal |
Beta Was this translation helpful? Give feedback.
-
|
2 points...
|
Beta Was this translation helpful? Give feedback.
-
|
Think the general conclusion is that we don't need to remove the examples from pytest as they are informational only. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Summary
We have 100 example files in
docs/examples/collected by pytest via# pytest:comment markers. After auditing every file, 74 have zero assertions — they run a real LLM backend, print the output, and validate nothing beyond "the code didn't crash."However, a deeper investigation revealed that not all print-only examples are redundant. ~15 are the only test coverage for significant public APIs, and ~10 more provide the only real-backend integration testing for features that test/ only covers with mocks. The remaining ~47 are genuinely redundant with existing test/ coverage and safe to remove.
Proposal:
# pytest:markers from 47 redundant print-only examples (they stay as runnable docs)Why this matters
The 47 files safe to remove from pytest
These exercise code paths already covered by test/ with equivalent or better coverage:
instruct()+ requirements + sampling tested across multiple test/ filesThe ~25 files that must keep running (~15 irreplaceable + ~10 real-backend integration)
Irreplaceable API coverage (~15 files)
These are the only test coverage for significant public APIs:
plugins/session_scoped.pystart_session(plugins=[...])code pathplugins/class_plugin.pyplugins/tool_hooks.py_call_tools(),uses_tool(), TOOL_PRE/POST_INVOKE hooksplugins/plugin_set_composition.pystart_session(plugins=[...])safety/guardian.pyGuardianCheck/GuardianRiskclassessafety/repair_with_guardian.pyRepairTemplateStrategysafety/guardian_huggingface.pyagents/react/react_using_mellea.pymellea.stdlib.frameworks.reactmodulesessions/creating_a_new_type_of_session.pytools/interpreter_example.pycode_interpreter,local_code_interpreter,tool_arg_validatorintrinsics/guardian_core.pyguardian_checkwithsocial_bias,answer_relevance, custom criteriainstruct_validate_repair/101_email_comparison.pygrounding_context=parameterinstruct_validate_repair/multiturn_strategy_example.pymini_researcher/researcher.pysofai/sofai_graph_coloring.pyFollow-up work: These should eventually be migrated into proper test/ files with assertions, so the examples can become pure documentation.
Real-backend integration tests (~10 files)
The entire test/plugins/ suite uses mock backends. These examples are the only place where plugins run against a real LLM:
plugins/execution_modes.pyplugins/payload_modification.pyplugins/plugin_scoped.pyplugins/quickstart.pyplugins/standalone_hooks.pytutorial/compositionality_with_generative_slots.py(slot chaining — test/ only tests individual slots)generative_slots/inter_module_composition/summarize_and_decide.py(multi-module composition)telemetry/metrics_example.py(end-to-end metrics with real session)telemetry/telemetry_example.py(end-to-end tracing with real session)Git history confirms this coverage matters — multiple
fix:commits were triggered by example breakage (aLora, RAG intrinsics, tools, MultiTurnStrategy).Examples with real assertions (26 files — no change needed)
These already have programmatic validation and should continue running as-is:
plugins/testing_plugins.py,plugins/standalone_hooks.py,plugins/tool_hooks.py,mify/mify.py,melp/simple_example.py,melp/lazy_fib_sample.py,mobject/table.py,tutorial/table_mobject.py,tutorial/document_mobject.py,sofai/sofai_graph_coloring.py,library_interop/langchain_messages.py,information_extraction/advanced_with_m_instruct.py,generative_slots/investment_advice.py,generative_slots/generative_slots_with_requirements.py,tools/interpreter_example.py,tools/smolagents_example.py,async/async-with-lazy-compute.py,telemetry/metrics_example.py,qiskit_code_validation/*(2 files),aLora/make_training_data.pyCleanup in test/ itself
test/stdlib/components/test_hello_world.pyprint()test/backends/test_tool_calls.pytest_ollama.py,test_litellm_ollama.py,test_openai_ollama.py,test_openai_vllm.py)Impact
Beta Was this translation helpful? Give feedback.
All reactions