Skip to content

Commit 5f6ba55

Browse files
committed
Ruben Q4
1 parent 77a924b commit 5f6ba55

File tree

1 file changed

+380
-0
lines changed

1 file changed

+380
-0
lines changed
Lines changed: 380 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,380 @@
1+
---
2+
layout: post
3+
nav-class: dark
4+
categories: ruben
5+
title: "A postgres library for Boost"
6+
author-id: ruben
7+
author-name: Rubén Pérez Hidalgo
8+
---
9+
10+
Do you know Boost.MySQL? If you've been reading my posts, you probably do.
11+
Many people have wondered 'why not Postgres?'. Well, the time is now.
12+
TL;DR: I'm writing the equivalent of Boost.MySQL, but for PostgreSQL.
13+
You can find the code [here](https://github.com/anarthal/nativepg).
14+
15+
Since libPQ is already a good library, the NativePG project intends
16+
to be more ambitious than Boost.MySQL. In addition to the expected
17+
Asio interface, I intend to provide a sans-io API that exposes primitives
18+
like message serialization.
19+
20+
Throughout this post, I will go into the intended library design and the rationales
21+
behind its design.
22+
23+
## The lowest level: message serialization
24+
25+
PostgreSQL clients communicate with the server using
26+
a binary protocol on top of TCP, termed [the frontend/backend protocol](https://www.postgresql.org/docs/current/protocol.html).
27+
The protocol defines a set of messages used for interactions. For example, when running a query, the following happens:
28+
29+
```
30+
┌────────┐ ┌────────┐
31+
│ Client │ │ Server │
32+
└───┬────┘ └───┬────┘
33+
│ │
34+
│ Query │
35+
│ ──────────────────────────────────────────> │
36+
│ │
37+
│ RowDescription │
38+
│ <────────────────────────────────────────── │
39+
│ │
40+
│ DataRow │
41+
│ <────────────────────────────────────────── │
42+
│ │
43+
│ CommandComplete │
44+
│ <────────────────────────────────────────── │
45+
│ │
46+
│ ReadyForQuery │
47+
│ <────────────────────────────────────────── │
48+
│ │
49+
```
50+
51+
In the lowest layer, this library provides functions to serialize and parse
52+
such messages. The goal here is being as efficient as possible.
53+
Parsing functions are non-allocating, and use an approach inspired by
54+
Boost.Url collections:
55+
56+
## Parsing database types
57+
58+
The PostgreSQL type system is quite rich. In addition to the usual SQL built-in types,
59+
it supports advanced scalars like UUIDs, arrays and user-defined aggregates.
60+
61+
When running a query, libPQ exposes retrieved data as either raw text or bytes.
62+
This is what the server sends in the `DataRow` packets shown above.
63+
To do something useful with the data, users likely need parsing and serializing
64+
such types.
65+
66+
The next layer of NativePG is in charge of providing such functions.
67+
This will likely contain some extension points for users to plug in their types.
68+
This is the general form of such functions:
69+
70+
```cpp
71+
system::error_code parse(span<const std::byte> from, T& to, const connection_state&);
72+
void serialize(const T& from, dynamic_buffer& to, const connection_state&);
73+
```
74+
75+
Note that some types might require access to session configuration.
76+
For instance, dates may be expressed using different wire formats depending
77+
on the connection's runtime settings.
78+
79+
At the time of writing, only ints and strings are supported,
80+
but this will be extended soon.
81+
82+
## Composing requests
83+
84+
Efficiency in database communication is achieved with pipelining.
85+
A network round-trip with the server is worth a thousand allocations in the client.
86+
It is thus critical that:
87+
88+
- The protocol properly supports pipelining. This is the case with PostgreSQL.
89+
- The client should expose an interface to it, and make it very easy to use.
90+
libPQ does the first, and NativePG intends to achieve the second.
91+
92+
NativePG pipelines by default. In NativePG, a `request` object is always
93+
a pipeline:
94+
95+
```cpp
96+
// Create a request
97+
request req;
98+
99+
// These two queries will be executed as part of a pipeline
100+
req.add_query("SELECT * FROM libs WHERE author = $1", {"Ruben"});
101+
req.add_query("DELETE FROM libs WHERE author <> $1", {"Ruben"});
102+
```
103+
104+
Everything you may ask the server can be added to `request`.
105+
This includes preparing and executing statements, establishing
106+
pipeline synchronization points, and so on.
107+
It aims to be close enough to the protocol to be powerful,
108+
while also exposing high-level functions to make things easier.
109+
110+
## Reading responses
111+
112+
Like `request`, the core response mechanism aims to be as close
113+
to the protocol as possible. Since use cases here are much more varied,
114+
there is no single `response` class, but a concept, instead.
115+
This is what a `response_handler` looks like:
116+
117+
```cpp
118+
119+
struct my_handler {
120+
// Check that the handler is compatible with the request,
121+
// and prepare any required data structures. Called once at the beginning
122+
handler_setup_result setup(const request& req, std::size_t pipeline_offset);
123+
124+
// Called once for every message received from the server
125+
// (e.g. `RowDescription`, `DataRow`, `CommandComplete`)
126+
void on_message(const any_request_message& msg);
127+
128+
// The overall result of the operation (error_code + diagnostic string).
129+
// Called after the operation has finished.
130+
const extended_error& result() const;
131+
};
132+
133+
```
134+
135+
Note that `on_message` is not allowed to report errors.
136+
Even if a handler encounters a problem with a message
137+
(imagine finding a `NULL` for a field where the user isn't expecting one),
138+
this is a user error, rather than a protocol error.
139+
Subsequent steps in the pipeline must not be affected by this.
140+
141+
This is powerful but very low-level. Using this mechanism, the library
142+
exposes an interface to parse the result of a query into a user-supplied
143+
struct, using Boost.Describe:
144+
145+
```cpp
146+
struct library
147+
{
148+
std::int32_t id;
149+
std::string name;
150+
std::string cpp_version;
151+
};
152+
BOOST_DESCRIBE_STRUCT(library, (), (id, name, cpp_version))
153+
154+
// ...
155+
std::vector<library> libs;
156+
auto handler = nativepg::into(libs); // this is a valid response_handler
157+
```
158+
159+
## Network algorithms
160+
161+
Given a user request and response handler, how do we send these to the server?
162+
We need a set of network algorithms to achieve this. Some of these are trivial:
163+
sending a request to the server is an `asio::write` on the request's buffer.
164+
Others, however, are more involved:
165+
166+
- Reading a pipeline response needs to verify that the message
167+
sequence is what we expected, for security, and handle errors gracefully.
168+
- The handshake algorithm, in charge of authentication when we connect to the
169+
server, needs to respond to server authentication challenges, which may
170+
come in different forms.
171+
172+
Writing these using `asio::async_compose` is problematic because:
173+
174+
- They become tied to Boost.Asio.
175+
- They are difficult to test.
176+
- They result in long compile times and code bloat due to templating.
177+
178+
At the moment, these are written as finite state machines, similar to
179+
how OpenSSL behaves in non-blocking mode:
180+
181+
```cpp
182+
// Reads the response of a pipeline (simplified).
183+
// This is a hand-wired generator.
184+
class read_response_fsm {
185+
public:
186+
// User-supplied arguments: request and response
187+
read_response_fsm(const request& req, response_handler_ref handler);
188+
189+
// Yielded to signal that we should read from the server
190+
struct read_args { span<std::byte> buffer; };
191+
192+
// Yielded to signal that we're done
193+
struct done_args { system::error_code result; };
194+
195+
variant<read_args, done_args>
196+
resume(connection_state&, system::error_code io_result, std::size_t bytes_transferred);
197+
};
198+
```
199+
200+
The idea is that higher-level code should call `resume` until it returns
201+
a `done_args` value. This allows de-coupling from the underlying I/O runtime.
202+
203+
Since NativePG targets C++20, I'm considering rewriting this as a coroutine.
204+
Boost.Capy (currently under development - hopefully part of Boost soon)
205+
could be a good candidate for this.
206+
207+
## Putting everything together: the Asio interface
208+
209+
At the end of the day, most users just want a `connection` object they can easily
210+
use. Once all the sans-io parts are working, writing it is pretty straight-forward.
211+
This is what end user code looks like:
212+
213+
```cpp
214+
// Create a connection
215+
connection conn{co_await asio::this_coro::executor};
216+
217+
// Connect
218+
co_await conn.async_connect(
219+
{.hostname = "localhost", .username = "postgres", .password = "", .database = "postgres"}
220+
);
221+
std::cout << "Startup complete\n";
222+
223+
// Compose our request and response
224+
request req;
225+
req.add_query("SELECT * FROM libs WHERE author = $1", {"Ruben"});
226+
std::vector<library> libs;
227+
228+
// Run the request
229+
co_await conn.async_exec(req, into(libs));
230+
```
231+
232+
## Auto-batch connections
233+
234+
While `connection` is good, experience has shown me that it's still
235+
too low-level for most users:
236+
237+
- Connection establishment is manual with `async_connect`.
238+
- No built-in reconnection or health checks.
239+
- No built-in concurrent execution of requests.
240+
That is, `async_exec` first writes the request, then reads the response.
241+
Other requests may not be executed during this period.
242+
This limits the connection's throughput.
243+
244+
For this reason, NativePG will provide some higher-level interfaces
245+
that will make server communication easier and more efficient.
246+
To get a feel of what we need, we should first understand
247+
the two main usage patterns that we expect.
248+
249+
Most of the time, connections are used in a **stateless** way.
250+
For example, consider querying data from the server:
251+
252+
```cpp
253+
request req;
254+
req.add_query("SELECT * FROM libs WHERE author = $1", {"Ruben"});
255+
co_await conn.async_exec(req, res);
256+
```
257+
258+
This query is not mutating connection state in any way.
259+
Other queries could be inserted before and after it without
260+
making any difference.
261+
262+
I plan to add a higher-level connection type, similar to
263+
`redis::connection` in Boost.Redis, that automatically
264+
batches concurrent requests and handles reconnection.
265+
The key differences with `connection` would be:
266+
267+
- Several independent tasks can share an auto-batch connection.
268+
This is an error for `connection`.
269+
- If several requests are queued at the same time,
270+
the connection may send them together to the server using a single system call.
271+
- There is no `async_connect` in an auto-batch connection.
272+
Reconnection is handled automatically.
273+
274+
Note that this pattern is not exclusive to read-only or
275+
individual queries. Transactions can work by using protocol features:
276+
277+
```cpp
278+
request req;
279+
req.set_autosync(false); // All subsequent queries are part of the same transaction
280+
req.add_query("UPDATE table1 SET x = $1 WHERE y = 2", {42});
281+
req.add_query("UPDATE table2 SET x = $1 WHERE y = 42", {2});
282+
req.add_sync(); // The two updates run atomically
283+
co_await conn.async_exec(req, res);
284+
```
285+
286+
## Connection pools
287+
288+
I mentioned there were two main usage scenarios in the library.
289+
Sometimes, it is required to use connections in a **stateful** way:
290+
291+
```cpp
292+
request req;
293+
req.add_simple_query("BEGIN"); // start a transaction manually
294+
req.add_query("SELECT * FROM library WHERE author = $1 FOR UPDATE", {"Ruben"}); // lock rows
295+
co_await conn.async_exec(req, lib);
296+
297+
// Do something in the client that depends on lib
298+
if (lib.id == "Boost.MySQL")
299+
co_return; // don't
300+
301+
// Now compose another request that depends on what we read from lib
302+
req.clear();
303+
req.add_query("UPDATE library SET status = 'deprecated' WHERE id = $1", {lib.id});
304+
req.add_simple_query("COMMIT");
305+
co_await conn.async_exec(req, ignore);
306+
```
307+
308+
The key point here is that this pattern requires exclusive access to `conn`.
309+
No other requests should be interleaved between the first and the second
310+
`async_exec` invocations.
311+
312+
The best way to solve this is by using a connection pool.
313+
This is what client code could look like:
314+
315+
```cpp
316+
co_await pool.async_exec([&] (connection& conn) -> asio::awaitable<system::error_code> {
317+
request req;
318+
req.add_simple_query("BEGIN");
319+
req.add_query("SELECT balance, status FROM accounts WHERE user_id = $1 FOR UPDATE", {user_id});
320+
321+
account_info acc;
322+
co_await conn.async_exec(req, into(acc));
323+
324+
// Check if account has sufficient funds and is active
325+
if (acc.balance < payment_amount || acc.status != "active")
326+
co_return error::insufficient_funds;
327+
328+
// Call external payment gateway API - this CANNOT be done in SQL
329+
auto result = co_await payment_gateway.process_charge(user_id, payment_amount);
330+
331+
// Compose next request based on the external API response
332+
req.clear();
333+
if (result.success) {
334+
req.add_query(
335+
"UPDATE accounts SET balance = balance - $1 WHERE user_id = $2",
336+
{payment_amount, user_id}
337+
);
338+
req.add_simple_query("COMMIT");
339+
}
340+
co_await conn.async_exec(req, ignore);
341+
342+
// The connection is automatically returned to the pool when this coroutine completes
343+
co_return result.success ? error_code{} : error::payment_failed;
344+
});
345+
```
346+
347+
I explicitly want to avoid having a `connection_pool::async_get_connection()`
348+
function, like in Boost.MySQL. This function returns a proxy object that grants access
349+
to a free connection. When destroyed, the connection is returned to the pool.
350+
This pattern looks great on paper, but runs into severe complications in
351+
multi-threaded code. The proxy object's destructor needs to mutate the pool's state,
352+
thus needing at least an `asio::dispatch` to the pool's executor, which may or may not
353+
be a strand. It is so easy to get wrong that Boost.MySQL added a `pool_params::thread_safe` boolean
354+
option to take care of this automatically, adding extra complexity. Definitely something to avoid.
355+
356+
## SQL formatting
357+
358+
As we've seen, the protocol has built-in support for adding
359+
parameters to queries (see placeholders like `$1`). These placeholders
360+
are expanded in the server securely.
361+
362+
While this covers most cases, sometimes we need to generate SQL
363+
that is too dynamic to be handled by the server. For instance,
364+
a website might allow multiple optional filters, translating into
365+
`WHERE` clauses that might or might not be present.
366+
367+
These use cases require SQL generated in the client. To do so,
368+
we need a way of formatting user-supplied values without
369+
running into SQL injection vulnerabilities. The final piece
370+
of the library becomes a `format_sql` function akin to the
371+
one in Boost.MySQL.
372+
373+
## Final thoughts
374+
375+
While the plan is clear, there is still much to be done here.
376+
There are dedicated APIs for high-throughput data copying and
377+
push notifications that need to be implemented. Some of the described
378+
APIs have a solid working implementation, while others still need
379+
some work. All in all, I hope that this library can soon reach a state
380+
where it can be useful to people.

0 commit comments

Comments
 (0)