Proposal: Allow writes during SST ingestion to avoid write stalls

RocksDB blocks all writes during `IngestExternalFile()` to maintain sequence number consistency between ingested data and normal writes, causing write stalls. In `DBImpl::IngestExternalFiles()`, it calls `write_thread_.EnterUnbatched()` to block all writes and `WaitForPendingWrites()` to wait for pending writes to complete ([code reference](https://github.com/facebook/rocksdb/blob/c76cacc696b50c3a407b3df537646a6535059ab5/db/db_impl/db_impl.cc#L5908-L5922)).

The existing implementation is robust and handles sequence number consistency well, which works great for most use cases. However, this becomes problematic when multiple data shards (logical data partitions) share the same RocksDB instance. When one shard triggers a write stall during ingestion, it blocks writes to all other shards on that instance, even if they're completely unrelated. This causes widespread latency spikes during bulk data loading operations.

To address this, we've added an `allow_write` option to `IngestExternalFileOptions` that allows writes to continue during ingestion. When set to true, RocksDB skips blocking writes, eliminating the write stall entirely. The caller must ensure no overlapping writes occur during ingestion, which is safe when the application can guarantee that no concurrent writes will target the same key ranges being ingested. The change is minimal - we simply make the write stall conditional based on this flag. You can see the full diff here: https://github.com/tikv/rocksdb/pull/400/files

This works well for TiKV because each data shard uses Raft consensus protocol, which ensures serial execution - ingest and write operations never run concurrently on the same shard. This guarantees that ingestion and writes to the same key ranges won't overlap, making it safe to use `allow_write = true`. Additionally, TiKV has frequent operations that require ingestion, such as data shard scheduling, data import, and index creation, making this optimization particularly valuable.

We've tested this in production with TiKV (a distributed key-value store that uses RocksDB) with significant results. Before the change, write stalls could last hundreds of milliseconds during large data migrations, with P9999 wait times around 25ms and P99 latency at 2-4ms with spikes. After enabling `allow_write = true`, write stalls are eliminated entirely, P9999 dropped to 2ms (over 90% reduction), and P99 latency is down to 1ms (over 50% reduction). Performance is now consistent even under heavy load. The implementation maintains backward compatibility by defaulting `allow_write` to false and includes validation and tests.

We'd like to propose this for upstream inclusion. It's been running in TiKV production for a while and working well. We believe many other services that share a single RocksDB instance across multiple shards could benefit from this optimization as well. More details are available in my blog post: https://www.pingcap.com/blog/tikv-write-latency-solved-smoother-performance-without-compromises/ 

If there are no concerns, I'll open a pull request to merge this change. Thanks for considering this!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proposal: Allow writes during SST ingestion to avoid write stalls #14137

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Proposal: Allow writes during SST ingestion to avoid write stalls #14137

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions