-
Notifications
You must be signed in to change notification settings - Fork 6.7k
Description
RocksDB blocks all writes during IngestExternalFile() to maintain sequence number consistency between ingested data and normal writes, causing write stalls. In DBImpl::IngestExternalFiles(), it calls write_thread_.EnterUnbatched() to block all writes and WaitForPendingWrites() to wait for pending writes to complete (code reference).
The existing implementation is robust and handles sequence number consistency well, which works great for most use cases. However, this becomes problematic when multiple data shards (logical data partitions) share the same RocksDB instance. When one shard triggers a write stall during ingestion, it blocks writes to all other shards on that instance, even if they're completely unrelated. This causes widespread latency spikes during bulk data loading operations.
To address this, we've added an allow_write option to IngestExternalFileOptions that allows writes to continue during ingestion. When set to true, RocksDB skips blocking writes, eliminating the write stall entirely. The caller must ensure no overlapping writes occur during ingestion, which is safe when the application can guarantee that no concurrent writes will target the same key ranges being ingested. The change is minimal - we simply make the write stall conditional based on this flag. You can see the full diff here: https://github.com/tikv/rocksdb/pull/400/files
This works well for TiKV because each data shard uses Raft consensus protocol, which ensures serial execution - ingest and write operations never run concurrently on the same shard. This guarantees that ingestion and writes to the same key ranges won't overlap, making it safe to use allow_write = true. Additionally, TiKV has frequent operations that require ingestion, such as data shard scheduling, data import, and index creation, making this optimization particularly valuable.
We've tested this in production with TiKV (a distributed key-value store that uses RocksDB) with significant results. Before the change, write stalls could last hundreds of milliseconds during large data migrations, with P9999 wait times around 25ms and P99 latency at 2-4ms with spikes. After enabling allow_write = true, write stalls are eliminated entirely, P9999 dropped to 2ms (over 90% reduction), and P99 latency is down to 1ms (over 50% reduction). Performance is now consistent even under heavy load. The implementation maintains backward compatibility by defaulting allow_write to false and includes validation and tests.
We'd like to propose this for upstream inclusion. It's been running in TiKV production for a while and working well. We believe many other services that share a single RocksDB instance across multiple shards could benefit from this optimization as well. More details are available in my blog post: https://www.pingcap.com/blog/tikv-write-latency-solved-smoother-performance-without-compromises/
If there are no concerns, I'll open a pull request to merge this change. Thanks for considering this!