Skip to content
Open
179 changes: 179 additions & 0 deletions docs/devnotes/feat-v3-scale-offset-cast.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,179 @@
# scale_offset and cast_value codecs

Source: https://github.com/zarr-developers/zarr-extensions/pull/43

## Overview

Two array-to-array codecs for zarr v3, designed to work together for the
common pattern of storing floating-point data as compressed integers.

---

## scale_offset

**Type:** array -> array (does NOT change dtype)

**Encode:** `out = (in - offset) * scale`
**Decode:** `out = (in / scale) + offset`

### Parameters
- `offset` (optional, float): scalar subtracted during encoding. Default: 0.
- `scale` (optional, float): scalar multiplied during encoding (after offset subtraction). Default: 1.

### Key rules
- Arithmetic uses the input array's own data type semantics (no implicit promotion).
- If neither scale nor offset is given, `configuration` may be omitted (codec is a no-op).
- Fill value is transformed through the codec (encode direction).
- Only valid for real-number data types (int/uint/float families). Complex dtypes are rejected at validation time.

### JSON
```json
{"name": "scale_offset", "configuration": {"offset": 5, "scale": 0.1}}
```
When both offset and scale are defaults: `{"name": "scale_offset"}` (no configuration key).

---

## cast_value

**Type:** array -> array (CHANGES dtype)

**Purpose:** Value-convert (not binary-reinterpret) array elements to a new data type.

### Parameters
- `data_type` (required): target zarr v3 data type name (e.g. `"uint8"`, `"float32"`).
Internally stored as a `ZDType` instance, resolved via `get_data_type_from_json`.
- `rounding` (optional): how to round when casting float to int.
Values: `"nearest-even"` (default), `"towards-zero"`, `"towards-positive"`,
`"towards-negative"`, `"nearest-away"`.
- `out_of_range` (optional): what to do when a value is outside the target's range.
Values: `"clamp"`, `"wrap"`. If absent, out-of-range values raise an error.
`"wrap"` is only valid for integer target types.
- `scalar_map` (optional): explicit value overrides.
`{"encode": [[input, output], ...], "decode": [[input, output], ...]}`.
Applied BEFORE rounding/out_of_range. Each entry's source is deserialized using the
source dtype and target using the target dtype (via `ZDType.from_json_scalar`),
preserving full precision for both sides.

### Cast procedure (`_cast_array_impl`)

Dispatches on `(src_type, tgt_type, has_map)` where src/tgt are `"int"` or `"float"`:

| Source | Target | scalar_map | Procedure |
|--------|--------|------------|-----------|
| any | float | no | `arr.astype(target_dtype)` |
| int | float | yes | widen to float64, apply map, cast |
| float | float | yes | copy, apply map, cast |
| int | int | no | range check, then astype |
| int | int | yes | widen to int64, apply map, range check |
| float | int | any | widen to float64, apply map (if any), reject NaN/Inf, round, range check |

All casts are wrapped in `np.errstate(over='raise', invalid='raise')` to convert
numpy overflow/invalid warnings to hard errors.

### Validation checks
- Only integer and floating-point dtypes are allowed (both source and target).
- `out_of_range='wrap'` is rejected for non-integer target types.
- Int-to-float casts are rejected if the float type's mantissa cannot exactly represent
the full integer range (e.g. int64 -> float64 is rejected because float64 has only
52 mantissa bits, but int64 has values up to 2^63-1). Same check applies for the
float-to-int decode direction.

### Special values
- NaN: detected dynamically via `isinstance(src, (float, np.floating)) and np.isnan(src)`.
NaN-to-integer casts error unless `scalar_map` provides a mapping.
Hex-encoded NaN strings (e.g. `"0x7fc00001"`) preserve NaN payloads per the zarr v3 spec.
- `_check_int_range` handles out-of-range integer values with clamp (via `np.clip`) or
wrap (via modular arithmetic).

### Fill value
- Cast using the same `_cast_array` path as array elements, including scalar_map and rounding.
- Done in `resolve_metadata`, which also changes the chunk spec's dtype to the target.

### JSON
```json
{
"name": "cast_value",
"configuration": {
"data_type": "uint8",
"rounding": "nearest-even",
"out_of_range": "clamp",
"scalar_map": {
"encode": [["NaN", 0], ["+Infinity", 0], ["-Infinity", 0]],
"decode": [[0, "NaN"]]
}
}
}
```
Only non-default fields are serialized (rounding and out_of_range are omitted when default).

---

## Typical combined usage

```json
{
"data_type": "float64",
"fill_value": "NaN",
"codecs": [
{"name": "scale_offset", "configuration": {"offset": -10, "scale": 0.1}},
{"name": "cast_value", "configuration": {
"data_type": "uint8",
"rounding": "nearest-even",
"scalar_map": {"encode": [["NaN", 0]], "decode": [[0, "NaN"]]}
}},
"bytes"
]
}
```

---

## Implementation notes

### Module structure
- `src/zarr/codecs/scale_offset.py` — `ScaleOffset` class
- `src/zarr/codecs/cast_value.py` — `CastValue` class and casting helpers
- `tests/test_codecs/test_scale_offset.py` — ScaleOffset tests
- `tests/test_codecs/test_cast_value.py` — CastValue tests + combined pipeline tests

### scale_offset
- `@dataclass(kw_only=True, frozen=True)`, subclasses `ArrayArrayCodec`.
- Uses `ScaleOffsetJSON` (a `NamedConfig` TypedDict) for typed serialization.
- `from_dict` uses `parse_named_configuration(data, "scale_offset", require_configuration=False)`.
- `to_dict` omits the `configuration` key entirely when both offset=0 and scale=1.
- `resolve_metadata`: transforms fill_value via `(fill - offset) * scale`, dtype unchanged.
- `_encode_sync`: `(arr - offset) * scale` using the array's own dtype.
- `_decode_sync`: `(arr / scale) + offset` using the array's own dtype.
- `is_fixed_size = True`, `compute_encoded_size` returns input size unchanged.

### cast_value
- `@dataclass(frozen=True)` with custom `__init__` (accepts `data_type: str | ZDType`).
- Stores `dtype: ZDType` (not a string). String data_type is resolved via `get_data_type_from_json`.
- `from_dict` uses `parse_named_configuration(data, "cast_value", require_configuration=True)`.
- `to_dict` serializes dtype via `self.dtype.to_json(zarr_format=3)`, only includes
non-default rounding/out_of_range/scalar_map.
- `resolve_metadata`: casts fill value, changes chunk spec dtype to target.
- `_encode_sync` / `_decode_sync`: delegate to `_cast_array`, threading the appropriate
scalar_map direction ("encode" or "decode") and the correct src/tgt ZDType pair for
scalar map deserialization.
- `compute_encoded_size`: scales by `target_itemsize / source_itemsize`.

### Key helpers (cast_value.py)
- `_cast_array` — public entry point, wraps `_cast_array_impl` with `np.errstate`.
- `_cast_array_impl` — match-based dispatch on `(src_type, tgt_type, has_map)`.
- `_check_int_range` — integer range check with clamp/wrap/error.
- `_round_inplace` — rounding dispatch (rint, trunc, ceil, floor, nearest-away).
- `_apply_scalar_map` — in-place value remapping with NaN-aware matching.
- `_parse_map_entries` — deserializes scalar_map JSON using separate src/tgt ZDType instances.
- `_extract_raw_map` — extracts "encode" or "decode" direction from ScalarMapJSON.

### Key design decisions
1. Encode = `(in - offset) * scale` (subtract, not add) — matches HDF5 and numcodecs.
2. No implicit precision promotion — arithmetic stays in the input dtype.
3. `out_of_range` defaults to error (not clamp).
4. `scalar_map` entries are typed: each side is deserialized with its own ZDType,
so int64 scalars don't lose precision through float64 intermediaries.
5. Fill value is cast through the same `_cast_array` path as data elements.
6. Int-to-float precision loss is caught at validate time (mantissa bit check).
7. Runtime overflow/invalid is caught via `np.errstate(over='raise', invalid='raise')`.
6 changes: 6 additions & 0 deletions src/zarr/codecs/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

from zarr.codecs.blosc import BloscCname, BloscCodec, BloscShuffle
from zarr.codecs.bytes import BytesCodec, Endian
from zarr.codecs.cast_value import CastValue
from zarr.codecs.crc32c_ import Crc32cCodec
from zarr.codecs.gzip import GzipCodec
from zarr.codecs.numcodecs import (
Expand All @@ -27,6 +28,7 @@
Zlib,
Zstd,
)
from zarr.codecs.scale_offset import ScaleOffset
from zarr.codecs.sharding import ShardingCodec, ShardingCodecIndexLocation
from zarr.codecs.transpose import TransposeCodec
from zarr.codecs.vlen_utf8 import VLenBytesCodec, VLenUTF8Codec
Expand All @@ -38,9 +40,11 @@
"BloscCodec",
"BloscShuffle",
"BytesCodec",
"CastValue",
"Crc32cCodec",
"Endian",
"GzipCodec",
"ScaleOffset",
"ShardingCodec",
"ShardingCodecIndexLocation",
"TransposeCodec",
Expand All @@ -61,6 +65,8 @@
register_codec("vlen-utf8", VLenUTF8Codec)
register_codec("vlen-bytes", VLenBytesCodec)
register_codec("transpose", TransposeCodec)
register_codec("scale_offset", ScaleOffset)
register_codec("cast_value", CastValue)

# Register all the codecs formerly contained in numcodecs.zarr3

Expand Down
Loading