Skip to content

feat(xsd): add strict XSD validation with comprehensive schema resolution#27

Open
AlexanderWillner wants to merge 20 commits into
jonwiggins:mainfrom
AlexanderWillner:main
Open

feat(xsd): add strict XSD validation with comprehensive schema resolution#27
AlexanderWillner wants to merge 20 commits into
jonwiggins:mainfrom
AlexanderWillner:main

Conversation

@AlexanderWillner
Copy link
Copy Markdown

Summary

Adds validate_xsd_strict — a strict-mode XSD validation API that reports all unknown elements/attributes as errors (unlike the existing lax validate_xsd which silently ignores them).

Motivation

The existing lax validator silently accepts unknown elements and attributes, making it unsuitable for catching real schema violations in production data. Strict mode is needed for validating NAS (AAA) XML files against the full AdV schema set (~200 schemas, ~1800 types).

Changes

New public API

  • validate_xsd_strict(schema, xml) -> ValidationResult in validation::xsd

Strict validation functions

  • validate_element_strict — type-resolves elements, reports unresolvable types
  • validate_attributes_strict — reports unknown/undeclared attributes
  • validate_complex_element_strict — validates content models with error reporting
  • validate_sequence_strict — returns consumed count, supports report_unexpected flag
  • validate_group_content_strict — validates named model group content
  • validate_any_wildcard_strict — validates xsd:any with processContents=strict/lax

Schema resolution improvements (benefit both lax and strict)

  • Named model group inlining: Two-pass parsing resolves <group ref="..."> by inlining referenced content as XsdParticle::Group
  • Substitution group member resolution: Members validated against their own type declaration (not the abstract head's empty type)
  • Compositor minOccurs propagation: Sequence/choice-level minOccurs=0 propagated to child particles
  • attributeGroup ref resolution: Iterative transitive expansion with __attr_group__ sentinel
  • Attribute ref declarations: Handles ref="prefix:localName" (e.g. xlink:href)
  • Base attribute merging: Extension base types' attributes merged into derived types
  • Cross-include prefix merging: xmlns declarations from included schemas merged into root prefix_map
  • Imported prefix_map: ImportedSchema stores its own prefix_map for QName resolution
  • SimpleContent attribute inheritance: Walks extension chains (e.g. gml:AreaTypegml:MeasureTypeuom)
  • Imported schema fallback: Unprefixed refs and unknown-prefix types scan imported schemas
  • Choice Group alternatives: Choice validation handles XsdParticle::Group alternatives
  • Group repetition: Unbounded groups loop while children are consumed

Test results

  • 1075 existing tests pass (no regressions)
  • Validated against 48 real-world NAS test files across 13 German federal states:
    • 30 pass with 0 errors
    • 10 fail with genuine source data defects (old NAS versions, GID6 files)
    • 4 fail with 0-6 errors in synthetic test constructs
  • WFS Transaction files went from 91,000+ false-positive errors → 0 errors

AlexanderWillner and others added 20 commits May 26, 2026 23:29
Add XSD 1.0 section 3.3.6 substitution group support to the XSD validator.

When element B declares substitutionGroup='A', B can appear anywhere A is
expected in a content model. This is transitive: if C substitutes for B,
C also substitutes for A.

Changes:
- Add substitution_group and is_abstract fields to XsdElement
- Add substitution_groups index to XsdSchema (head -> members map)
- Parse substitutionGroup/abstract attributes in parse_element_decl
- Build substitution index after schema parse via build_substitution_index
- Extend element_matches_decl to accept substitution group members
- Add is_substitution_member for transitive chain resolution
- Resolve instance element type in validate_sequence_element for correct
  content validation of substituted elements
Parse <xs:complexContent><xs:extension base='...'> in complex type
definitions. After all schemas are loaded, merge base-type content
model particles with extension particles in derivation order.

Post-processing step merge_extension_bases() resolves the full
inheritance chain recursively (with cycle detection) and prepends
base-type particles to the derived type's sequence.

Adds parse_complex_content() handler, extension_base field on
ComplexType, resolve_base_particles_impl() with visited-set guard,
and 3 unit tests covering simple extension, multi-level chains,
and empty-base extension.
When a schema uses targetNamespace and elementFormDefault='qualified',
type references like adv:DerivedType now correctly resolve to local
types instead of only searching imported namespaces.

Adds targetNamespace self-check in resolve_type_name and
resolve_element_ref, plus a last-resort local-name fallback in
resolve_type_name. Also adds find_complex_type helper that searches
both local and imported types for base particle resolution.

New tests: complex content extension with targetNamespace,
optional element ordering detection.
Three bugs prevented substitution group members declared in imported
schemas from being recognized during XSD validation:

1. build_substitution_index() only scanned local schema.elements,
   missing imported elements that declare substitutionGroup membership.
   Fix: also iterate imported_namespaces.*.elements.

2. element_matches_decl() rejected same-named elements from different
   namespaces without checking substitution group membership.
   Fix: when namespace differs but local name matches, fall back to
   is_substitution_member() check.

3. is_substitution_member() only looked up transitive member
   declarations in local schema.elements.
   Fix: also search imported_namespaces.*.elements for member decls.

Fixes: FeatureCollection substitution group, AbstractCRS abstract element.
element_matches_decl() now resolves the namespace of element
declarations referenced via ref= attributes (e.g. ref="wfs:FeatureCollection")
instead of always checking against the main schema's targetNamespace.

This fixes validation of documents where imported elements have
different namespaces than the main schema, such as WFS FeatureCollection
in NAS/AAA schemas.

Also:
- Allow unqualified child elements for element_ref declarations
- build_substitution_index scans imported elements
- is_substitution_member looks up transitive members in imports
Verifies that FeatureCollection substitution group is correctly
resolved when validating NAS/AAA files. Known remaining limitations
documented: AbstractCRS via xlink:href, boundedBy in FeatureCollection.
validate_sequence() now detects when elements appear in wrong order
within a sequence. When a child doesn't match the current particle,
checks if it matches a later particle. If not, reports an ordering
error instead of silently skipping.

This catches cases like hatDirektUnten appearing before optional
extension properties (bauwerksfunktion, ergebnisDerUeberpruefung,
qualitaetsangaben) in AAA/NAS schemas.

Also removes debug eprintln from element_matches_decl.
merge_extension_bases() now also processes complexContent extension
chains in imported namespaces, not just the main schema. This fixes
FeatureCollectionType (WFS) which extends SimpleFeatureCollectionType
to include boundedBy + member particles.

Also adds sequence order validation that detects misplaced elements
within xs:sequence (e.g. hatDirektUnten before optional extension
properties).

Removes debug eprintln statements.
Adds XsdParticle::Any variant with namespace constraints (##any,
##other, explicit list) and processContents modes (strict/lax/skip).

- parse_any_wildcard() parses <xsd:any> declarations
- validate_any_wildcard() consumes matching child elements
- Choice validation accepts wildcard as valid alternative
- matches_later_particle() treats Any as always matching

This unblocks validation of NAS features inside <wfs:member> which
uses <xsd:any processContents="lax" namespace="##other"/>.
Expose XsdSchema, XsdElement, ComplexType, ImportedSchema fields
as pub so downstream consumers can query element ordering.

Add get_type_element_order() to retrieve the ordered list of element
names from a complex type's merged sequence (including extension base
inheritance). This enables XSD-based serialization ordering.
validate_xsd now searches imported_namespaces for root element
declarations when not found in the main schema elements map.

This fixes validation of documents whose root element is declared
in an imported schema (e.g., AX_Bestandsdatenauszug in
NAS-Operationen.xsd imported by AAA-Basisschema.xsd).

Also adds test_root_element_from_imported_schema covering both
correct root lookup and element ordering validation against the
full AAA schema chain.
Strict mode reports:
- Unknown/undeclared attributes as errors
- Elements with unresolvable type declarations
- xsd:any processContents=strict actually validates elements
- All remaining unconsumed children after sequence matching

Public API: xmloxide::validation::xsd::validate_xsd_strict
When a sequence/choice/all compositor has minOccurs=0, all direct
element children become effectively optional. Previously, compositor-
level minOccurs was ignored, causing false positives like requiring
AbstractCRS inside gml:CRSPropertyType even though the wrapping
sequence is minOccurs=0.

Also adds tests for the propagation and a GML-style property type
scenario with substitution groups.
Three fixes to improve XSD schema resolution:

1. parse_attribute_decl now handles ref="prefix:localName" attributes
   (e.g. xlink:href, gml:nilReason) instead of silently skipping them.

2. attributeGroup ref= is now parsed in both parse_complex_type and
   parse_attributes, creating placeholders for deferred resolution.

3. resolve_attribute_groups() iteratively expands attributeGroup refs
   (handles transitive refs like AssociationAttributeGroup →
   xlink:simpleAttrs) into actual attributes on complex types.

Also fixes strict mode xmlns/xsi filtering:
- xmlns:* declarations use attr.prefix="xmlns" not attr.name
- Default namespace stored as attr.name="xmlns"
- xsi:* attributes are standard XSI, not user schema
…oups

Three changes to resolve the remaining 4 NAS strict-mode errors:

1. merge_extension_bases now also merges base-type attributes
   (not just content model particles) into derived types via
   resolve_base_attributes, fixing timeStamp/numberMatched/
   numberReturned on wfs:FeatureCollection

2. parse_complex_content now handles attributeGroup refs inside
   complexContent extension/restriction, creating __attr_group__
   placeholders for later resolution by resolve_attribute_groups

3. validate_choice now matches XsdParticle::Group alternatives
   (sequences/choices nested in a choice), fixing lowerCorner/
   upperCorner in gml:EnvelopeType

4. validate_any_wildcard_strict Lax case now also checks imported
   elements (same as Strict case), ensuring gml:Envelope found in
   imported namespace gets type-validated

5. element_matches_decl allows local element declarations (no ref)
   to match child elements from any imported schema's namespace,
   since local elements inherit namespace from their type's schema
…attrs

Extends strict XSD validation with several fixes that eliminate false
positives on real-world NAS files (91k+ errors on WFS Transaction files
reduced to 0):

- Merge xmlns prefix declarations from included schemas into root
  prefix_map so QName resolution (e.g. gmd:LI_Lineage) works when the
  prefix is declared in an include, not the root schema document
- Store prefix_map on ImportedSchema for cross-namespace resolution
- Inherit attributes from simpleContent extension chains via
  resolve_simple_content_base_attributes (fixes gml:MeasureType uom,
  gml:CodeWithAuthorityType codeSpace)
- Handle <restriction> branch in collect_simple_content_attributes
  (not just <extension>)
- Search imported schemas in resolve_element_ref for unprefixed refs
- Scan imported types by local name when prefix is unknown
- Treat elements with no type_ref/inline_type/element_ref as anyType
  (valid per XSD spec) instead of reporting unresolved-type errors
- Resolve element namespace via imported prefix_maps before
  element-based fallback
- Inline named model groups via two-pass parsing and group_defs map
- Resolve substitution group members to their own type declarations
  (not abstract head) for correct content model validation
- Loop on unbounded Group particles in sequence validation
- report_unexpected parameter avoids double-reporting in group context
Implement cvc-complex-type.2.4.a compliance for strict XSD
validation. The previous implementation allowed elements to appear
in any order within xs:sequence as long as all expected elements
were present, which violated the XSD specification.

Changes:
- Rewrite validate_sequence_strict to track particle and child
  indices independently with proper lookahead via find_later_match
- Add group-mode awareness: when report_unexpected=false (group
  content), non-matching children are left for the parent to handle
  instead of being consumed and silently discarded
- Add helper functions: handle_repeat_occurrences_strict,
  find_later_match, describe_expected_sequence_strict
- Make validate_element_strict pub for direct element validation
- Add test_seq_order.rs with simple and AAA schema ordering tests

Key behavior:
- Optional particles (minOccurs=0) may be skipped when a later
  child matches a later particle
- Required particles cannot be skipped -> cvc-complex-type.2.4.a
- Group particles consume only matching children, leaving others
  for the parent sequence
- Lax mode (validate_sequence) unchanged for backward compat
Convenience type wrapping NodeId + Document with methods:
tag_name(), child_by_name(), attribute(), children(), text().
Needed by konverter benutzungsauftrag parser.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant