You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is the repository that will coordinate the 1 Billion Row Challenge for Object Pascal.
@@ -66,33 +68,51 @@ Submit your implementation and become part of the leader board!
66
68
67
69
## Rounding
68
70
69
-
Székely Balázs has provided code for rounding towards positive infinity per the original challenge.
70
-
This will be the official way to round the output values:
71
+
While I recognize that Székely's rounding code was a good effort, it was not simple and made a lot of people doubt it was even correct for negative temperatures.\
72
+
In a discussion with [Mr. Packman](https://pack.ac/) themselves, we came up with a simpler solution. They even added some _Unit Testing_:D.
73
+
74
+
This will be the official way to round the output values, so pick your poison:
71
75
```pas
72
-
function TBaseline.RoundEx(x: Double): Double;
76
+
function RoundEx(x: Double): Double; inline;
73
77
begin
74
-
Result := PascalRound(x*10.0)/10.0;
78
+
Result := Ceil(x * 10) / 10;
75
79
end;
76
80
77
-
function TBaseline.PascalRound(x: Double): Double;
81
+
function RoundExInteger(x: Double): Integer; inline;
82
+
begin
83
+
Result := Ceil(x * 10);
84
+
end;
85
+
86
+
function RoundExString(x: Double): String; inline;
78
87
var
79
-
t: Double;
88
+
V, Q, R: Integer;
80
89
begin
81
-
//round towards positive infinity
82
-
t := Trunc(x);
83
-
if (x < 0.0) and (t - x = 0.5) then
90
+
V := RoundExInteger(x);
91
+
if V < 0 then
84
92
begin
85
-
// Do nothing
93
+
Result := '-';
94
+
V := -V;
86
95
end
87
-
else if Abs(x - t) >= 0.5 then
88
-
begin
89
-
t := t + Math.Sign(x);
90
-
end;
91
-
92
-
if t = 0.0 then
93
-
Result := 0.0
94
96
else
95
-
Result := t;
97
+
Result := '';
98
+
Q := V div 10;
99
+
R := V - (Q * 10);
100
+
Result := IntToStr(Q) + '.' + IntToStr(R);
101
+
end;
102
+
103
+
procedure Test;
104
+
var
105
+
F: Double;
106
+
begin
107
+
for F in [10.01, 10.04, -10.01, -10.0, 0, -0, -0.01] do
> We are still waiting for the Delphi version to be completed in order for us to have an official `SHA256` hash for the output.
148
168
>
149
-
> Until then, this is the current one: `db3d79d31b50daa8c03a1e4f2025029cb137f9971aa04129d8bca004795ae524`
169
+
> Until then, this is the current one: `4256d19d3e134d79cc6f160d428a1d859ce961167bd01ca528daca8705163910`
150
170
> There's also an archived version of the [baseline output](./data/baseline.output.gz)
151
171
152
172
## Differences From Original
@@ -209,7 +229,8 @@ I'd like to thank [@paweld](https://github.com/paweld) for taking us from my mis
209
229
I'd like to thank [@mobius](https://github.com/mobius1qwe) for taking the time to provide the Delphi version of the generator.\
210
230
I'd like to thank [@dtpfl](https://github.com/dtpfl) for his invaluable work on maintaining the `README.md` file up to date with everything.\
211
231
I'd like to thank Székely Balázs for providing many patches to make everything compliant with the original challenge.\
212
-
I'd like to thank [@corneliusdavid](https://github.com/corneliusdavid) for giving some of the information files a once over and making things more legible and clear.
232
+
I'd like to thank [@corneliusdavid](https://github.com/corneliusdavid) for giving some of the information files a once over and making things more legible and clear.\
233
+
I'd like to thank Mr. **Pack**man, aka O, for clearing the fog around the rounding issues.
213
234
214
235
## Links
215
236
The original repository: https://github.com/gunnarmorling/1brc\
Copy file name to clipboardExpand all lines: entries/abouchez/README.md
+6-4Lines changed: 6 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -20,7 +20,7 @@ I am very happy to share decades of server-side performance coding techniques us
20
20
21
21
Here are the main ideas behind this implementation proposal:
22
22
23
-
-**mORMot** makes cross-platform and cross-compiler support simple - e.g. `TMemMap`, `TDynArray.Sort`,`TTextWriter`, `SetThreadCpuAffinity`, `crc32c`, `ConsoleWrite` or command-line parsing;
23
+
-**mORMot** makes cross-platform and cross-compiler support simple - e.g. `TMemMap`, `TDynArray`,`TTextWriter`, `SetThreadCpuAffinity`, `crc32c`, `ConsoleWrite` or command-line parsing;
24
24
- The entire 16GB file is `memmap`ed at once into memory - it won't work on 32-bit OS, but avoid any `read` syscall or memory copy;
25
25
- Process file in parallel using several threads - configurable via the `-t=` switch, default being the total number of CPUs reported by the OS;
26
26
- Input is fed into each thread as 64MB chunks: because thread scheduling is unbalanced, it is inefficient to pre-divide the size of the whole input file into the number of threads;
@@ -32,20 +32,22 @@ Here are the main ideas behind this implementation proposal:
32
32
- Parse temperatures with a dedicated code (expects single decimal input values);
33
33
- The station names are stored as UTF-8 pointers to the memmap location where they appear first, in `StationName[]`, to be emitted eventually for the final output, not during temperature parsing;
34
34
- No memory allocation (e.g. no transient `string` or `TBytes`) nor any syscall is done during the parsing process to reduce contention and ensure the process is only CPU-bound and RAM-bound (we checked this with `strace` on Linux);
35
-
- Pascal code was tuned to generate the best possible asm output on FPC x86_64 (which is our target);
35
+
- Pascal code was tuned to generate the best possible asm output on FPC x86_64 (which is our target) - perhaps making it less readable, because we used pointer arithmetics when it matters (I like to think as such low-level pascal code as [portable assembly](https://sqlite.org/whyc.html#performance) similar to "unsafe" code in managed languages);
36
36
- Can optionally output timing statistics and resultset hash value on the console to debug and refine settings (with the `-v` command line switch);
37
37
- Can optionally set each thread affinity to a single core (with the `-a` command line switch).
38
38
39
39
If you are not convinced by the "perfect hash" trick, you can define the `NOPERFECTHASH` conditional, which forces full name comparison, but is noticeably slower. Our algorithm is safe with the official dataset, and gives the expected final result - which was the goal of this challenge: compute the right data reduction with as little time as possible, with all possible hacks and tricks. A "perfect hash" is a well known hacking pattern, when the dataset is validated in advance. And since our CPUs offers `crc32c` which is perfect for our dataset... let's use it! https://en.wikipedia.org/wiki/Perfect_hash_function ;)
40
40
41
41
## Why L1 Cache Matters
42
42
43
-
Take great care of the "64 bytes cache line" is quite unique among all implementations of the "1brc" I have seen in any language - and it does make a noticeable difference in performance.
43
+
Taking special care of the "64 bytes cache line" is quite unique among all implementations of the "1brc" I have seen in any language - and it does make a noticeable difference in performance.
44
44
45
45
The L1 cache is well known in the performance hacking litterature to be the main bottleneck for any efficient in-memory process. If you want things to go fast, you should flatter your CPU L1 cache.
46
46
47
47
Min/max values will be reduced as 16-bit smallint - resulting in temperature range of -3276.7..+3276.8 which seems fair on our planet according to the IPCC. ;)
48
48
49
+
As a result, each `Station[]` entry takes only 16 bytes, so we can fit exactly 4 entries in a single CPU L1 cache line. To be fair, if we put some more data into the record (e.g. use `Int64` instead of `smallint`/`integer`), the performance degrades only for a few percents. The main fact seems to be that the entry is likely to fit into a single cache line, even if filling two cache lines may be sometimes needed for misaligned data.
50
+
49
51
In our first attempt (see "Old Version" below), we stored the name into the `Station[]` array, so that each entry is 64 bytes long exactly. But since `crc32c` is a perfect hash function for our dataset, it is enough to just store the 32-bit hash instead, and not the actual name.
50
52
51
53
Note that if we reduce the number of stations from 41343 to 400, the performance is much higher, also with a 16GB file as input. The reason is that since 400x16 = 6400, each dataset could fit entirely in each core L1 cache. No slower L2/L3 cache is involved, therefore performance is better. The cache memory seems to be the bottleneck of our code. Which is a good sign.
@@ -236,6 +238,6 @@ Benchmark 1: abouchez
236
238
```
237
239
It is a known fact from experiment that forcing thread affinity is not a good idea, and it is always much better to let any modern Operating System do the threads scheduling to the CPU cores, because it has a much better knowledge of the actual system load and status. Even on a "fair" CPU architecture like AMD Zen. For a "pure CPU" process, affinity may help a very little. But for our "old" process working outside of the L1 cache limits, we better let the OS decide.
238
240
239
-
So with this "old" version, it was decided to use `-t=16`. The "old" version is using a whole cache line (16 bytes) for its `Station[]` record, so it may be the responsible of using too much CPU cache, so more than 16 threads does not make a difference with it. Whereas our "new" version, with its `Station[]` of only 16 bytes, could use `-t=32` with benefits. The cache memory access is likely to be the bottleneck from now on.
241
+
So with this "old" version, it was decided to use `-t=16`. The "old" version is using a whole cache line (64 bytes) for its `Station[]` record, so it may be the responsible of using too much CPU cache, so more than 16 threads does not make a difference with it. Whereas our "new" version, with its `Station[]` of only 16 bytes, could use `-t=32` with benefits. The cache memory access is likely to be the bottleneck from now on.
0 commit comments