Skip to content

[codex] fix mergeable cid encoding#1002

Draft
zxch3n wants to merge 1 commit into
mainfrom
feat/ensuremergeable
Draft

[codex] fix mergeable cid encoding#1002
zxch3n wants to merge 1 commit into
mainfrom
feat/ensuremergeable

Conversation

@zxch3n
Copy link
Copy Markdown
Member

@zxch3n zxch3n commented Jun 7, 2026

Summary

Replace the unpublished recursive hex mergeable container id payload with a flattened path encoding.

The previous design encoded mergeable child identity by recursively embedding parent.to_bytes() in the synthetic root name. When a mergeable map was nested under another mergeable map, that parent cid already contained its own parent payload, so the size grew roughly exponentially. This PR changes the synthetic root name to encode the nearest non-mergeable map parent once, followed by escaped map keys. Nested mergeable cid size is now linear in logical path length.

Design: Synthetic Root Container ID

A mergeable child container is still represented as a synthetic ContainerID::Root:

ContainerID::Root {
  name: "🤝:" + payload,
  container_type: child_container_type,
}

The child container type is not encoded in payload. It is carried by Root.container_type, just like ordinary root containers. This means two mergeable cids with the same parent/key but different child types have the same root name string but remain different ContainerID values because container_type differs.

There is no version byte in the payload. This replaces an unpublished format, so no old-format decoder is kept.

Payload Grammar

The payload is a flattened map path:

payload      = base-parent ">" key-segment *( ">" key-segment )
base-parent  = "$" escaped-root-name
             | "@" peer-base36 ":" counter-base36
key-segment  = escaped string segment

base-parent is the nearest non-mergeable map ancestor:

  • $<escaped-root-name> means the base parent is a root map.
  • @<peer-base36>:<counter-base36> means the base parent is a normal op-created map.

Every segment after base-parent is one map key. Intermediate mergeable parents are always maps, so their type is omitted. The final container's type is Root.container_type.

Example:

Root map "state", key "note-1", child map:
  Root { name: "🤝:$state>note-1", container_type: Map }

Nested key "body" under that mergeable map, child text:
  Root { name: "🤝:$state>note-1>body", container_type: Text }

Parsing Root { name: "🤝:$state>note-1>body", container_type: Text } returns:

parent = Root { name: "🤝:$state>note-1", container_type: Map }
key = "body"
container_type = Text

Encoding Algorithm

ContainerID::new_mergeable(parent, key, child_type) does:

  1. Assert parent.container_type() == Map. This is a hard release assertion, not only a debug assertion, because the payload omits parent type.
  2. Start name with MERGEABLE_NAMESPACE_PREFIX, currently 🤝:.
  3. Append the parent payload:
    • If parent is a valid mergeable map root, reuse its existing payload without re-encoding the full parent cid.
    • Else if parent is a root map, append $ plus the escaped root name.
    • Else if parent is a normal map, append @, the peer id in canonical lowercase base36, :, then the counter in canonical lowercase base36.
  4. Append >.
  5. Append the escaped key.
  6. Return ContainerID::Root { name, container_type: child_type }.

This is the key property that prevents recursive growth: a nested mergeable parent contributes only its already-flattened payload, not serialized parent bytes.

Escaping Rules

Segments are escaped before being placed in the synthetic root name:

\    -> \\
>    -> \>
/    -> \s
NUL  -> \0

> is the only structural delimiter. \ introduces an escape. / and NUL are escaped so synthetic root names keep the old safety property that raw slash and raw NUL do not appear in root names.

The parser rejects:

  • dangling backslash
  • unknown escapes
  • raw slash
  • raw NUL

Base36 Rules

Normal op-created map parents use canonical lowercase base36:

@<peer-base36>:<counter-base36>

Rules:

  • Digits are 0-9a-z only.
  • Uppercase digits are rejected.
  • Leading zeroes are rejected except for the literal 0.
  • Negative counters are encoded with a leading - followed by the positive magnitude.
  • -0 is rejected.
  • Overflow while parsing u64 peer or i32 counter rejects the payload.

Examples:

peer = u64::MAX, counter = i32::MIN
@3w5e11264sgsf:-zik0zk

Decoding Algorithm

ContainerID::parse_mergeable() does:

  1. Require the id to be ContainerID::Root.
  2. Strip MERGEABLE_NAMESPACE_PREFIX; reject if absent.
  3. Validate the payload structure and escaping.
  4. Find the last unescaped >.
  5. Decode the segment after that separator as key.
  6. Decode the parent:
    • If the prefix before the last separator contains another unescaped >, the parent is a mergeable map root:
      Root { name: "🤝:" + parent_payload, container_type: Map }.
    • Otherwise decode the single base-parent segment as either a root map ($...) or normal map (@peer:counter).
  7. Return (parent, key, Root.container_type).

ContainerID::is_mergeable() uses the same structural validation. A root whose name merely starts with 🤝: but does not match the grammar is treated as an ordinary root, not as a mergeable container.

Map Slot Marker Relationship

This PR does not change the map slot marker format.

The parent LoroMap slot still stores a compact binary activation marker:

MAGIC[4] + KIND[1] + CRC24(parent_id, key, kind)[3]

The marker binds (parent container id, key, child type) and controls visibility of the mergeable child under that map key. The synthetic root cid controls deterministic child identity. These two layers remain separate:

  • cid payload answers: "which CRDT container is this logical child?"
  • map slot marker answers: "is this child currently visible at this key, and which type is active?"

The only remaining parent.to_bytes() in this area is inside marker CRC input. That is intentional and is not part of synthetic root name encoding.

Compatibility

This changes an unpublished mergeable cid payload format before release. Existing public setContainer / regular child container identity is unchanged. Existing marker bytes are unchanged.

Old hex/recursive mergeable cid name decoding is intentionally not retained. User-created root names are still rejected if they start with the reserved 🤝: namespace.

Tests Added / Updated

Coverage includes:

  • Exact flattened payload string for nested mergeable map/text ids.
  • Escaping round trips for empty keys, long keys, >, \, /, NUL, and embedded 🤝: substrings.
  • Root base parent round trip, including escaped root names.
  • Normal op-created map base parent round trip with peer/counter base36.
  • Malformed payload rejection.
  • Non-canonical base36 rejection: uppercase, leading zeroes, and -0.
  • Non-map parent rejection in new_mergeable.
  • Linear nested cid size growth.
  • Existing mergeable container convergence/path/snapshot/public API behavior.

Validation

  • rustfmt --check crates/loro-common/src/lib.rs crates/loro-internal/tests/mergeable_cid_encoding.rs crates/loro-wasm/src/lib.rs crates/loro-internal/tests/mergeable_container/events_and_paths.rs
  • cargo test -p loro-internal --test mergeable_cid_encoding -- --nocapture
  • cargo test -p loro-common mergeable -- --nocapture
  • cargo test -p loro-internal --test mergeable_container -- --nocapture
  • cargo test -p loro --test mergeable_public_api -- --nocapture
  • Repository grep audit for old hex/recursive mergeable cid encoding terms returned no matches.

Note: full-file rustfmt --check on crates/loro-internal/src/handler.rs reports pre-existing unrelated formatting differences, so this PR avoids formatting that whole file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant