Optimizing ResouceOwner to speed up COPY

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Optimizing ResouceOwner to speed up COPY
Date: 2025-10-16 17:46:49
Message-ID: 84f20db7-7d57-4dc0-8144-7e38e0bbe75d@vondra.me
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

While reviewing and testing a nearby patch (using COPY for batching in
postgres_fdw), I noticed some of the COPY queries are spending a
substantial amount of time in ResourceOwnerAddToHash(). The exact figure
depends on amount of data in the COPY, but it was often close to 50%
(according to perf-record/report).

The reason is pretty simple - ResourceOwner tracks the resources in a
very simple hash table, with O(n^2) behavior with duplicates. This
happens with COPY, because COPY creates an array of a 1000 tuple slots,
and each slot references the same tuple descriptor. And the descriptor
is added to ResourceOwner for each slot.

The trouble is the ResourceOwner uses a simple open addressing hash
table, and the duplicate values end up in a long dense "train" of
entries, slowing down both additions and deletions. Starting with an
empty hash table, adding the first entry is fine - the first slot is
available. But the following additions need to skip over the current
entries - the n-th addition needs to skip over n-1 entries.

With N entries this means (N * (N-1) / 2). And then the same thing
happens for deletions, in reverse.

This is not the first time I noticed ResourceOwner in profiles, but I
never investigated it. I don't know if all those cases were caused by
this same behavior, but it's likely at least some were ...

There's an easy way to improve this by allowing a single hash entry to
represent multiple references to the same resource. The attached patch
adds a "count" to the ResourceElem, tracking how many times that
resource was added. So if you add 1000 tuples slots, the descriptor will
have just one ResourceElem entry with count=1000.

This reduces the insert/delete complexity - not quite to O(1), but much
lower than O(n^2). It also reduces the memory usage with duplicates, and
postpones when the hash table needs to be enlarged. Of course, if there
are no duplicates, it'll use a bit more memory.

The deduplication is "opportunistic" in the sense that you may still end
up with multiple entries for the same value/kind. When adding a resource
finds an empty slot before hitting the other entry, it'll use the slot.
Also, only the hash table is deduplicated, not the initial array (which
however quite small).

This is a reasonable trade off because it means it does not get more
expensive for cases with no duplicates, while still improving cases with
many duplicate values.

To demonstrate the benefits, here's a throughput (tps) from a COPY of a
certain number of rows from a file into an UNLOGGED table, with 1, 4 and
8 clients.

master patched
rows | 1 4 8 | 1 4 8
---------|--------------------------|-------------------------
10 | 112090 418421 594905 | 112871 419043 589646
100 | 35265 121903 168238 | 42536 138578 183740
1000 | 1450 5700 10725 | 5847 21594 30713
10000 | 531 1943 2988 | 743 2619 3557
100000 | 72 244 324 | 76 256 332

Or, as relative throughput:

patched / master
rows | 1 4 8
---------------------------------------
10 | 101% 100% 99%
100 | 121% 114% 109%
1000 | 403% 379% 286%
10000 | 140% 135% 119%
100000 | 106% 105% 102%

Those are some nice improvements, especially with 1000 rows. Which makes
sense, because 1000 is the internal batch size in COPY. So the overhead
gradually increases up to 1000 rows, and then it amortizes over many
batches. It has very little impact for large COPY.

I've done the same test with a logged table. The results significantly
depend on the storage (how fast it can handle commits) - not surprising.
But with good storage the overall behavior is almost exactly the same.

I didn't do any benchmarking focused on cases with no/few duplicates
(although the COPY with 10 rows could qualify) yet. But as I explained
earlier, the behavior should not change at all. There's an extra uint32,
but that's it.

regards

--
Tomas Vondra

Attachment Content-Type Size
v20251016-0001-Deduplicate-entries-in-ResourceOwner.patch text/x-patch 6.6 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2025-10-16 18:12:47 Re: Optimizing ResouceOwner to speed up COPY
Previous Message Masahiko Sawada 2025-10-16 17:41:36 Re: POC: enable logical decoding when wal_level = 'replica' without a server restart