|
| 1 | ++++ |
| 2 | +title = "Rust Proto Design Decisions" |
| 3 | +weight = 782 |
| 4 | +linkTitle = "Design Decisions" |
| 5 | +description = "Explains some of the design choices that the Rust Proto implementation makes." |
| 6 | +type = "docs" |
| 7 | +toc_hide = "true" |
| 8 | ++++ |
| 9 | + |
| 10 | +As with any library, Rust Protobuf is designed considering the needs of both |
| 11 | +Google's first-party usage of Rust as well that of external users. Choosing a |
| 12 | +path in that design space means that some choices made will not be optimal for |
| 13 | +some users in some cases, even if it is the right choice for the implementation |
| 14 | +overall. |
| 15 | + |
| 16 | +This page covers some of the larger design decisions that the Rust Protobuf |
| 17 | +implementation makes and the considerations which led to those decisions. |
| 18 | + |
| 19 | +## Designed to Be ‘Backed’ by Other Protobuf Implementations, Including C++ Protobuf {#backed-by-cpp} |
| 20 | + |
| 21 | +Protobuf Rust is not a pure Rust implementation of protobuf, but a safe Rust API |
| 22 | +implemented on top of existing protobuf implementations, or as we call these |
| 23 | +implementations: kernels. |
| 24 | + |
| 25 | +The biggest factor that goes into this decision was to enable zero-cost of |
| 26 | +adding Rust to a preexisting binary which already uses non-Rust Protobuf. Bby |
| 27 | +enabling the implementation to be ABI-compatible with the C++ Protobuf generated |
| 28 | +code, it is possible to share Protobuf messages across the language boundary |
| 29 | +(FFI) as plain pointers, avoiding the need to serialize in one language, pass |
| 30 | +the byte array across the boundary, and deserialize in the other language. This |
| 31 | +also reduces binary size for these use cases by avoiding having redundant schema |
| 32 | +information embedded in the binary for the same messages for each language. |
| 33 | + |
| 34 | +Protobuf Rust currently supports three kernels: |
| 35 | + |
| 36 | +* C++ kernel - the generated code is backed by C++ Protocol Buffers (the |
| 37 | + "full" implementation, typically used for servers). This kernel offers |
| 38 | + in-memory interoperability with C++ code that uses the C++ runtime. This is |
| 39 | + the default for servers within Google. |
| 40 | +* C++ Lite kernel - the generated code is backed by C++ Lite Protocol Buffers |
| 41 | + (typically used for mobile). This kernel offers in-memory interoperability |
| 42 | + with C++ code that uses the C++ Lite runtime. This is the default for |
| 43 | + for mobile apps within Google. |
| 44 | +* upb kernel - the generated code is backed by |
| 45 | + [upb](https://github.com/protocolbuffers/protobuf/tree/main/upb), |
| 46 | + a highly performant and small-binary-size Protobuf library written in C. upb |
| 47 | + is designed to be used as an implementation detail by Protobuf runtimes in |
| 48 | + other languages. This is the default in open source builds where we expect |
| 49 | + static linking with code already using C++ Protobuf to be more rare. |
| 50 | + |
| 51 | +The decision to support multiple non-Rust kernels significantly influences the |
| 52 | +our public API decisions, including the types used on getters (discussed later |
| 53 | +in this document). |
| 54 | + |
| 55 | +### No Pure Rust Kernel {#no-pure-rust} |
| 56 | + |
| 57 | +Given that we designed the API to be implementable by multiple backing |
| 58 | +implementations, a natural question is why the only supported kernels are |
| 59 | +written in the memory unsafe languages of C and C++ today. |
| 60 | + |
| 61 | +While Rust being a memory-safe language can significantly reduce exposure to |
| 62 | +critical security issues, no language is immune to security issues. The Protobuf |
| 63 | +implementations that we support as kernels have been scrutinized and fuzzed to |
| 64 | +the extent that Google is comfortable using those implementations to perform |
| 65 | +unsandboxed parsing of untrusted inputs in our own servers and apps. A |
| 66 | +greenfield binary parser written in Rust at this time would be understood to be |
| 67 | +much more likely to contain critical vulnerabilities than the preexisting C++ |
| 68 | +Protobuf parser. |
| 69 | + |
| 70 | +There are legitimate arguments for long-term supporting a pure Rust |
| 71 | +implementation, including toolchain difficulties for developers using our |
| 72 | +implementation in open source. |
| 73 | + |
| 74 | +It is a reasonable assumption that Google will support a pure Rust |
| 75 | +implementation at some later date, but we are not investing in it today and have |
| 76 | +no concrete roadmap for it at this time. |
| 77 | + |
| 78 | +## View/Mut Proxy Types {#view-mut-proxy-types} |
| 79 | + |
| 80 | +The Rust Proto API is designed with opaque "Proxy" types. For a .proto file that |
| 81 | +defines `message SomeMsg {}`, we generate the Rust types `SomeMsg`, |
| 82 | +`SomeMsgView<'_>` and `SomeMsgMut<'_>`. The simple rule of thumb is that we |
| 83 | +expect the View and Mut types to stand in for `&SomeMsg` and `&mut SomeMsg` in |
| 84 | +all usages by default, while still getting all of the borrow checking/Send/etc. |
| 85 | +behavior that you would expect from those types. |
| 86 | + |
| 87 | +### Another Lens to Understand These Types {#another-lens} |
| 88 | + |
| 89 | +To better understand the nuances of these types, it may be useful to think of |
| 90 | +these types as follows: |
| 91 | + |
| 92 | +```rust |
| 93 | +struct SomeMsg(Box<cpp::SomeMsg>); |
| 94 | +struct SomeMsgView<'a>(&'a cpp::SomeMsg); |
| 95 | +struct SomeMsgMut<'a>(&'a mut cpp::SomeMsg); |
| 96 | +``` |
| 97 | + |
| 98 | +Under this lens you can see that: |
| 99 | + |
| 100 | +- Given a `&SomeMsg` it is possible to get a `SomeMsgView` (similar to how |
| 101 | + given a `&Box<T>` you can get a `&T`) |
| 102 | +- Given a `SomeMsgView` it in *not* possible to get a `&SomeMsg` (similar to |
| 103 | + how given a `&T` you couldn't get a `&Box<T>`). |
| 104 | + |
| 105 | +Just like with the `&Box` example, this means that on function arguments, it is |
| 106 | +generally better to default to use `SomeMsgView<'a>` rather than a `&'a |
| 107 | +SomeMsg`, as it will allow a superset of callers to use the function. |
| 108 | + |
| 109 | +### Why {#why} |
| 110 | + |
| 111 | +There are two main reasons for this design: to unlock possible optimization |
| 112 | +benefits, and as an inherent outcome of the kernel design. |
| 113 | + |
| 114 | +#### Optimization Opportunity Benefit {#optimization} |
| 115 | + |
| 116 | +Protobuf being such a core and widespread technology makes it unusually both |
| 117 | +prone to all possible observable behaviors being depended on by someone, as well |
| 118 | +as relatively small optimizations having unusually major net impact at scale. We |
| 119 | +have found that more opaqueness of types gives unusually high amount of |
| 120 | +leverage: they permit us to be more deliberate about exactly what behaviors are |
| 121 | +exposed, and give us more room to optimize the implementation. |
| 122 | + |
| 123 | +A `SomeMsgMut<'_>` provides those opportunities where a `&mut SomeMsg` would |
| 124 | +not: namely that we can construct them lazily and with an implementation detail |
| 125 | +which is not the same as the owned message representation. It also inherently |
| 126 | +allows us to control certain behaviors that we couldn't otherwise limit or |
| 127 | +control: for example, any `&mut` can be used with `std::mem::swap()`, which is a |
| 128 | +behavior that would place strong limits on what invariants you are able to |
| 129 | +maintain between a parent and child struct if `&mut SomeChild` is given to |
| 130 | +callers. |
| 131 | + |
| 132 | +#### Inherent to Kernel Design {#kernel-design} |
| 133 | + |
| 134 | +The other reason for the proxy types is more of an inherent limitation to our |
| 135 | +kernel design; when you have a `&T` there must be a real Rust `T` type in memory |
| 136 | +somewhere. |
| 137 | + |
| 138 | +Our C++ kernel design allows you to parse a message which contains nested |
| 139 | +messages, and create only a small Rust stack-allocated object to representing |
| 140 | +the root message, with all other memory being stored on the C++ Heap. When you |
| 141 | +later access a child message, there will be no already-allocated Rust object |
| 142 | +which corresponds to that child, and so there's no Rust instance to borrow at |
| 143 | +that moment. |
| 144 | + |
| 145 | +By using proxy types, we're able to on-demand create the Rust proxy types that |
| 146 | +semantically acting as borrows, without there being any eagerly allocated Rust |
| 147 | +memory for those instances ahead of time. |
| 148 | + |
| 149 | +## Non-Std Types {#non-std} |
| 150 | + |
| 151 | +### Simple Types Which May Have a Directly Corresponding Std Type {#corresponding-std} |
| 152 | + |
| 153 | +In some cases the Rust Protobuf API may choose to create our own types where a |
| 154 | +corresponding std type exists with the same name, where the current |
| 155 | +implementation may even simply wrap the std type, for example |
| 156 | +`proto::UTF-8Error`. |
| 157 | + |
| 158 | +Using these types rather than std types gives us more flexibility in optimizing |
| 159 | +the implementation in the future. While our current implementation uses the Rust |
| 160 | +std UTF-8 validation today, by creating our own `proto::Utf8Error` type it |
| 161 | +enables us to change the implementation to use the highly optimized C++ |
| 162 | +implementation of UTF-8 validation that we use from C++ Protobuf which is faster |
| 163 | +than Rust's std UTF-8 validation. |
| 164 | + |
| 165 | +### ProtoString {#proto-string} |
| 166 | + |
| 167 | +Rust's `str` and `std::string::String` types maintain a strict invariant that |
| 168 | +they only contain valid UTF-8, but C++ Protobuf and C++'s `std::string` type |
| 169 | +generally do not enforce any such guarantee. `string` typed Protobuf fields are |
| 170 | +intended to only ever contain valid UTF-8, but the enforcement of this has many |
| 171 | +holes where a `string` field may end up containing invalid UTF-8 contents at |
| 172 | +runtime. |
| 173 | + |
| 174 | +To deliver on zero-cost message sharing between C++ and Rust while minimizing |
| 175 | +costly validations or risk of undefined behavior in Rust, we chose not to using |
| 176 | +the `str`/`String` types for `string` field getters, and introduced the types |
| 177 | +`ProtoStr` and `ProtoString` instead which are equivalent types except they |
| 178 | +could contain invalid UTF-8 in rare situations. Those types let the application |
| 179 | +code choose if they wish to perform the validation on-demand to get a `&str`, or |
| 180 | +operate on the raw bytes to avoid any validation. |
| 181 | + |
| 182 | +We are aware that vocabulary types like `str` are very important to idiomatic |
| 183 | +usage, and intend to keep an eye on if this decision is the right one as usage |
| 184 | +details of Rust evolves. |
0 commit comments