| From: | Alexandre Felipe <o(dot)alexandre(dot)felipe(at)gmail(dot)com> |
|---|---|
| To: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de> |
| Subject: | Addressing buffer private reference count scalability issue |
| Date: | 2026-03-08 16:09:07 |
| Message-ID: | CAE8JnxNTETEUiAOF31=_yo=pvyAi9npOeJfcTvEJJbi4vomtYA@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi Hackers,
This patch addresses a performance issue pointed out by Andres Freund,
1.
Benchmark buffer pinning: You know benchmark code, implemented a few
functions that can be use in postgres queries, and a python script that
runs them and produces CSV files and SVG plots for the current build.
2.
Refactoring reference counting: Before starting to change code and
potentially breaking things I considered prudent to isolate it to limit the
damage. This code was part of a 8k+ LOC file.
3.
Using simplehash: Simply replacing the HTAB for a simplehash, and adding
a new set of macros SH_ENTRY_EMPTY, SH_MAKE_EMPTY, SH_MAKE_IN_USE. To allow
using the InvalidBuffer special value instead of allocating extra space for
a validity flag. Here I assume that the buffer buffer sequence is
independent enough from the array size, so I use the buffer as the hash key
directly, omitting a hash function call.
4.
Compact PrivateRefCountEntry: The original implementation used a 4-byte
key and 8-byte value. Reference count uses 32 bits, and it is unreasonable
to expect one backend to pin the same buffer 1 billion times. The lock mode
uses 32 bits but can only assume 4 values. So I packed them in one single
uint32, giving 30 bits for count and 2 bits for lock mode. This makes the
entries 8-byte long, on 64-bit CPUs this represents more than a 1/3
reduction in memory. This makes the array aligned with the 64-bit words,
copying one entry can be completed in one instruction, and every entry will
be aligned.
5.
REFCOUNT_ARRAY_ENTRIES=0: since the simplehash is basically some array
lookup, it is worth trying to remove it completely and keep only the hash.
For small values we would be trading a few branches for a buffer % SIZE,
for the use case of prefetch where pin/unpin in a FIFO fashion, it will
save an 8-entry array lookup, and some extra data moves.
In addition to the patch I am including
- A bash script to apply and benchmark the patches sequentially. You might
have to adjust REPO_ROOT, in my case it gets it relative to the script
path, that is under $REPO_ROOT/.patches/pins/.
- A compare-patches.py script that can be copied to
src/test/modules/test_buffer_pin to process the benchmark CSV in figures
showing one metric for different patches instead of different metrics for
one patch as the benchmark.py produces.
- A nicely formatted post about this [2]
Regards,
Alexandre
[1]
https://www.postgresql.org/message-id/s5p7iou7pdhxhvmv4rohmskwqmr36dc4rybvwlep5yvwrjs4pa%406oxsemms5mw4
[2] https://afelipe.hashnode.dev/postgres-backend-buffer-pinning-algorithm
| Attachment | Content-Type | Size |
|---|---|---|
| v1-0003-Using-simplehash.patch | application/octet-stream | 12.8 KB |
| v1-0004-Compact-PrivateRefCountEntry.patch | application/octet-stream | 10.8 KB |
| v1-0002-Refactoring-reference-counting.patch | application/octet-stream | 50.6 KB |
| v1-0001-Benchmark-buffer-pinning.patch | application/octet-stream | 26.6 KB |
| v1-0005-REFCOUNT_ARRAY_ENTRIES-0.patch | application/octet-stream | 6.5 KB |
| run-all.sh | text/x-sh | 2.7 KB |
| compare-patches.py | text/x-python-script | 3.4 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | jian he | 2026-03-08 16:16:08 | Re: Emitting JSON to file using COPY TO |
| Previous Message | Andrei Lepikhov | 2026-03-08 14:37:15 | Re: Skipping schema changes in publication |