From 4850d0f69cd1d253ba021acfe8c231ec5071cb39 Mon Sep 17 00:00:00 2001 From: "Jorge C. Leitao" Date: Sat, 12 Dec 2020 18:12:41 +0100 Subject: [PATCH 1/6] Added discussion about usage of unsaee --- rust/arrow/README.md | 73 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 73 insertions(+) diff --git a/rust/arrow/README.md b/rust/arrow/README.md index 3dc1c2e1ba1..38dc0842aae 100644 --- a/rust/arrow/README.md +++ b/rust/arrow/README.md @@ -96,6 +96,79 @@ Arrow uses the following features: Other than `simd` all the other features are enabled by default. Disabling `prettyprint` might be necessary in order to compile Arrow to the `wasm32-unknown-unknown` WASM target. +## Guidelines in usage of `unsafe` + +[`unsafe`](https://doc.rust-lang.org/book/ch19-01-unsafe-rust.html) has a high maintenance cost because debugging and testing it is difficult, time consuming, often requires external tools (e.g. `valgrind`), and requires a higher-than-usual attention to details. Undefined behavior is particularly difficult to identify and test, and usage of `unsafe` is the primary cause of undefined behavior in a program written in Rust `[citation needed]`. + +This crate only accepts the usage of `unsafe` code upon careful consideration, and strives to avoid it to the largest possible extent. + +### When can `unsafe` be used? + +Generally, `unsafe` can only be used when a `safe` counterpart is not available or has performance implications. The following is a summary of the current components of the crate that require `unsafe`: + +* alloc, dealloc and realloc of buffers along cache lines +* Interpreting bytes as certain rust types, for access, representation and compute +* Foreign interfaces (C data interface) +* Inter-process communication (IPC) +* SIMD +* Performance + +#### cache-line aligned memory management + +The arrow format recommends storing buffers aligned with cache lines, and this crate adopts this behavior. +However, Rust's global allocator does not allocates memory aligned with cache-lines. As such, many of the low-level operations related to memory management require `unsafe`. + +Usage of `unsafe` for the purposes of supporting allocations aligned with cache lines is allowed. + +#### Interpreting bytes + +The arrow format is specified in bytes (`u8`), which can be logically represented as certain types +depending on the DataType. +For many operations, such as access, representation, numerical computation and string manipulation, +it is often necessary to interpret bytes as other physical types (e.g. `i32`). + +Usage of `unsafe` for the purpose of interpreting bytes in their corresponding type (according to the arrow specification) is allowed. Specifically, the pointer to the byte slice must be aligned to the type that it intends to represent and the length of the slice is a multiple of the size of the target type of the transmutation. + +#### FFI + +The arrow format declares an ABI for zero-copy from and to libraries that implement the specification +(foreign interfaces). In Rust, receiving and sending pointers via FFI requires usage of `unsafe` due to +the impossibility of the compiler to derive the invariants (such as lifetime, null pointers, and pointer alignment), from the source code alone as they are part of the FFI contract. + +Usage of `unsafe` for the purposes of supporting FFI is allowed. + +#### IPC + +The arrow format declares a IPC protocol, which this crate supports. IPC is equivalent to a FFI in that the rust compiler can't reason about the contract's invariants. + +Usage of `unsafe` for the purposes of supporting IPC is allowed. + +#### SIMD + +The API provided by the `packed_simd` library is currently `unsafe`. However, SIMD offers a significant performance improvement over non-SIMD operations. + +Usage of `unsafe` for the purposes of supporting SIMD is allowed. + +#### Performance + +Some operations are significantly faster when `unsafe` is used. + +A common usage of `unsafe` is to offer an API to access the `i`th element of an array (e.g. `UInt32Array`). +This requires accessing the values buffer e.g. `array.buffers()[0]`, picking the slice +`[i * size_of(), (i + 1) * size_of()]`, and then transmuting it to `i32`. In safe Rust, +this operation requires boundary checks that are detrimental to performance. + +Usage of `unsafe` for performance reasons is justified only when the performance difference of a publicly available API is estatistically significantly larger than 10%, as demonstrated by a `bench`. + +### Considerations when introducing `unsafe` + +Usage of `unsafe` in this crate *must*: + +* not expose a public API as `safe` when there are necessary invariants for that API to be defined behavior. +* have code documentation for why a `safe` is not used / possible (e.g. `// 30% performance degradation if the safe counterpart is used, see bench X`) +* have code documentation about which invariant the user needs to enforce to ensure no undefined behavior (e.g. `// this buffer must be constructed according to the arrow specification`) +* if applicable, have the necessary `assert`s (not `debug_assert`!) of invariants. + # Publishing to crates.io An Arrow committer can publish this crate after an official project release has From 3462c3b06de59d83751f4bf023602b6c1a3f414c Mon Sep 17 00:00:00 2001 From: Jorge Leitao Date: Mon, 14 Dec 2020 12:47:58 +0000 Subject: [PATCH 2/6] Update rust/arrow/README.md Co-authored-by: Andrew Lamb --- rust/arrow/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rust/arrow/README.md b/rust/arrow/README.md index 38dc0842aae..ca8cec542f8 100644 --- a/rust/arrow/README.md +++ b/rust/arrow/README.md @@ -99,7 +99,7 @@ compile Arrow to the `wasm32-unknown-unknown` WASM target. ## Guidelines in usage of `unsafe` [`unsafe`](https://doc.rust-lang.org/book/ch19-01-unsafe-rust.html) has a high maintenance cost because debugging and testing it is difficult, time consuming, often requires external tools (e.g. `valgrind`), and requires a higher-than-usual attention to details. Undefined behavior is particularly difficult to identify and test, and usage of `unsafe` is the primary cause of undefined behavior in a program written in Rust `[citation needed]`. - +For two real world examples of where `unsafe` has consumed time in the past in this project see [#8545](https://github.com/apache/arrow/pull/8645) and [8829](https://github.com/apache/arrow/pull/8829) This crate only accepts the usage of `unsafe` code upon careful consideration, and strives to avoid it to the largest possible extent. ### When can `unsafe` be used? From e26e5fb0139a8067ee09318d4788ae2abcc961e1 Mon Sep 17 00:00:00 2001 From: Jorge Leitao Date: Mon, 14 Dec 2020 12:48:17 +0000 Subject: [PATCH 3/6] Update rust/arrow/README.md Co-authored-by: Andrew Lamb --- rust/arrow/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rust/arrow/README.md b/rust/arrow/README.md index ca8cec542f8..bfcc96289ca 100644 --- a/rust/arrow/README.md +++ b/rust/arrow/README.md @@ -104,7 +104,7 @@ This crate only accepts the usage of `unsafe` code upon careful consideration, a ### When can `unsafe` be used? -Generally, `unsafe` can only be used when a `safe` counterpart is not available or has performance implications. The following is a summary of the current components of the crate that require `unsafe`: +Generally, `unsafe` should only be used when a `safe` counterpart is not available and there is no `safe` way to achieve additional performance in that area. The following is a summary of the current components of the crate that require `unsafe`: * alloc, dealloc and realloc of buffers along cache lines * Interpreting bytes as certain rust types, for access, representation and compute From 8da1cbaef9973082044060cd45bd8bd055cc9093 Mon Sep 17 00:00:00 2001 From: Jorge Leitao Date: Mon, 14 Dec 2020 12:48:32 +0000 Subject: [PATCH 4/6] Update rust/arrow/README.md Co-authored-by: Andrew Lamb --- rust/arrow/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rust/arrow/README.md b/rust/arrow/README.md index bfcc96289ca..df66fbdf65c 100644 --- a/rust/arrow/README.md +++ b/rust/arrow/README.md @@ -165,7 +165,7 @@ Usage of `unsafe` for performance reasons is justified only when the performance Usage of `unsafe` in this crate *must*: * not expose a public API as `safe` when there are necessary invariants for that API to be defined behavior. -* have code documentation for why a `safe` is not used / possible (e.g. `// 30% performance degradation if the safe counterpart is used, see bench X`) +* have code documentation for why `safe` is not used / possible (e.g. `// 30% performance degradation if the safe counterpart is used, see bench X`) * have code documentation about which invariant the user needs to enforce to ensure no undefined behavior (e.g. `// this buffer must be constructed according to the arrow specification`) * if applicable, have the necessary `assert`s (not `debug_assert`!) of invariants. From 0274a17294e3cad2d584c9e3286893612ac3137b Mon Sep 17 00:00:00 2001 From: "Jorge C. Leitao" Date: Mon, 14 Dec 2020 13:50:26 +0000 Subject: [PATCH 5/6] Re-phrased assert. --- rust/arrow/README.md | 14 +++----------- 1 file changed, 3 insertions(+), 11 deletions(-) diff --git a/rust/arrow/README.md b/rust/arrow/README.md index df66fbdf65c..57a4a2a59fd 100644 --- a/rust/arrow/README.md +++ b/rust/arrow/README.md @@ -98,7 +98,7 @@ compile Arrow to the `wasm32-unknown-unknown` WASM target. ## Guidelines in usage of `unsafe` -[`unsafe`](https://doc.rust-lang.org/book/ch19-01-unsafe-rust.html) has a high maintenance cost because debugging and testing it is difficult, time consuming, often requires external tools (e.g. `valgrind`), and requires a higher-than-usual attention to details. Undefined behavior is particularly difficult to identify and test, and usage of `unsafe` is the primary cause of undefined behavior in a program written in Rust `[citation needed]`. +[`unsafe`](https://doc.rust-lang.org/book/ch19-01-unsafe-rust.html) has a high maintenance cost because debugging and testing it is difficult, time consuming, often requires external tools (e.g. `valgrind`), and requires a higher-than-usual attention to details. Undefined behavior is particularly difficult to identify and test, and usage of `unsafe` is the [primary cause of undefined behavior](https://doc.rust-lang.org/reference/behavior-considered-undefined.html) in a program written in Rust. For two real world examples of where `unsafe` has consumed time in the past in this project see [#8545](https://github.com/apache/arrow/pull/8645) and [8829](https://github.com/apache/arrow/pull/8829) This crate only accepts the usage of `unsafe` code upon careful consideration, and strives to avoid it to the largest possible extent. @@ -111,15 +111,13 @@ Generally, `unsafe` should only be used when a `safe` counterpart is not availab * Foreign interfaces (C data interface) * Inter-process communication (IPC) * SIMD -* Performance +* Performance (e.g. omit bounds checks, use of pointers to avoid bound checks) #### cache-line aligned memory management The arrow format recommends storing buffers aligned with cache lines, and this crate adopts this behavior. However, Rust's global allocator does not allocates memory aligned with cache-lines. As such, many of the low-level operations related to memory management require `unsafe`. -Usage of `unsafe` for the purposes of supporting allocations aligned with cache lines is allowed. - #### Interpreting bytes The arrow format is specified in bytes (`u8`), which can be logically represented as certain types @@ -135,20 +133,14 @@ The arrow format declares an ABI for zero-copy from and to libraries that implem (foreign interfaces). In Rust, receiving and sending pointers via FFI requires usage of `unsafe` due to the impossibility of the compiler to derive the invariants (such as lifetime, null pointers, and pointer alignment), from the source code alone as they are part of the FFI contract. -Usage of `unsafe` for the purposes of supporting FFI is allowed. - #### IPC The arrow format declares a IPC protocol, which this crate supports. IPC is equivalent to a FFI in that the rust compiler can't reason about the contract's invariants. -Usage of `unsafe` for the purposes of supporting IPC is allowed. - #### SIMD The API provided by the `packed_simd` library is currently `unsafe`. However, SIMD offers a significant performance improvement over non-SIMD operations. -Usage of `unsafe` for the purposes of supporting SIMD is allowed. - #### Performance Some operations are significantly faster when `unsafe` is used. @@ -167,7 +159,7 @@ Usage of `unsafe` in this crate *must*: * not expose a public API as `safe` when there are necessary invariants for that API to be defined behavior. * have code documentation for why `safe` is not used / possible (e.g. `// 30% performance degradation if the safe counterpart is used, see bench X`) * have code documentation about which invariant the user needs to enforce to ensure no undefined behavior (e.g. `// this buffer must be constructed according to the arrow specification`) -* if applicable, have the necessary `assert`s (not `debug_assert`!) of invariants. +* if applicable, use `debug_assert`s to relevant invariants (e.g. bound checks) # Publishing to crates.io From ff35ce8988a51193add3beb9b53645e68eef5e16 Mon Sep 17 00:00:00 2001 From: "Jorge C. Leitao" Date: Wed, 16 Dec 2020 06:14:20 +0000 Subject: [PATCH 6/6] Typos and incorporated changes. --- rust/arrow/README.md | 26 ++++++++++++++++++++------ 1 file changed, 20 insertions(+), 6 deletions(-) diff --git a/rust/arrow/README.md b/rust/arrow/README.md index 57a4a2a59fd..8646daae63b 100644 --- a/rust/arrow/README.md +++ b/rust/arrow/README.md @@ -116,12 +116,12 @@ Generally, `unsafe` should only be used when a `safe` counterpart is not availab #### cache-line aligned memory management The arrow format recommends storing buffers aligned with cache lines, and this crate adopts this behavior. -However, Rust's global allocator does not allocates memory aligned with cache-lines. As such, many of the low-level operations related to memory management require `unsafe`. +However, Rust's global allocator does not allocate memory aligned with cache-lines. As such, many of the low-level operations related to memory management require `unsafe`. #### Interpreting bytes The arrow format is specified in bytes (`u8`), which can be logically represented as certain types -depending on the DataType. +depending on the `DataType`. For many operations, such as access, representation, numerical computation and string manipulation, it is often necessary to interpret bytes as other physical types (e.g. `i32`). @@ -131,7 +131,7 @@ Usage of `unsafe` for the purpose of interpreting bytes in their corresponding t The arrow format declares an ABI for zero-copy from and to libraries that implement the specification (foreign interfaces). In Rust, receiving and sending pointers via FFI requires usage of `unsafe` due to -the impossibility of the compiler to derive the invariants (such as lifetime, null pointers, and pointer alignment), from the source code alone as they are part of the FFI contract. +the impossibility of the compiler to derive the invariants (such as lifetime, null pointers, and pointer alignment) from the source code alone as they are part of the FFI contract. #### IPC @@ -150,17 +150,31 @@ This requires accessing the values buffer e.g. `array.buffers()[0]`, picking the `[i * size_of(), (i + 1) * size_of()]`, and then transmuting it to `i32`. In safe Rust, this operation requires boundary checks that are detrimental to performance. -Usage of `unsafe` for performance reasons is justified only when the performance difference of a publicly available API is estatistically significantly larger than 10%, as demonstrated by a `bench`. +Usage of `unsafe` for performance reasons is justified only when all other alternatives have been exhausted and the performance benefits are sufficiently large (e.g. >~10%). ### Considerations when introducing `unsafe` Usage of `unsafe` in this crate *must*: * not expose a public API as `safe` when there are necessary invariants for that API to be defined behavior. -* have code documentation for why `safe` is not used / possible (e.g. `// 30% performance degradation if the safe counterpart is used, see bench X`) -* have code documentation about which invariant the user needs to enforce to ensure no undefined behavior (e.g. `// this buffer must be constructed according to the arrow specification`) +* have code documentation for why `safe` is not used / possible +* have code documentation about which invariant the user needs to enforce to ensure [soundness](https://rust-lang.github.io/unsafe-code-guidelines/glossary.html#soundness-of-code--of-a-library), or which +* invariant is being preserved. * if applicable, use `debug_assert`s to relevant invariants (e.g. bound checks) +Example of code documentation: + +```rust +// JUSTIFICATION +// Benefit +// Describe the benefit of using unsafe. E.g. +// "30% performance degradation if the safe counterpart is used, see bench X." +// Soundness +// Describe why the code remains sound (according to the definition of rust's unsafe code guidelines). E.g. +// "We bounded check these values at initialization and the array is immutable." +let ... = unsafe { ... }; +``` + # Publishing to crates.io An Arrow committer can publish this crate after an official project release has