Re: Optimising compactify_tuples()

From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Peter Geoghegan <pg(at)bowt(dot)ie>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Optimising compactify_tuples()
Date: 2020-09-08 16:30:15
Message-ID: CAApHDvq7H=o065VKgyaGe+Wjn_i8RGsPuaPuunsbk758xC35ig@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, 7 Sep 2020 at 19:47, David Rowley <dgrowleyml(at)gmail(dot)com> wrote:
> I wondered if we could get around that just by having another buffer
> somewhere and memcpy the tuples into that first then copy the tuples
> out that buffer back into the page. No need to worry about the order
> we do that in as there's no chance to overwrite memory belonging to
> other tuples.
>
> Doing that gives me 79.166 seconds in the same recovery test. Or about
> 46% faster, instead of 22% (I mistakenly wrote 28% yesterday)

I did some more thinking about this and considered if there's a way to
just get rid of the sorting version of compactify_tuples() completely.
In the version from yesterday, I fell back on the sort version for
when more than half the tuples from the page were being pruned. I'd
thought that in this case copying out *all* of the page from pd_upper
up to the pd_special (the tuple portion of the page) would maybe be
more costly since that would include (needlessly) copying all the
pruned tuples too. The sort also becomes cheaper in that case since
the number of items to sort is less, hence I thought it was a good
idea to keep the old version for some cases. However, I now think we
can just fix this by conditionally copying all tuples when in 1 big
memcpy when not many tuples have been pruned and when more tuples are
pruned we can just do individual memcpys into the separate buffer.

I wrote a little .c program to try to figure out of there's some good
cut off point to where one method becomes better than the other and I
find that generally if we're pruning away about 75% of tuples then
doing a memcpy() per non-pruned tuple is faster, otherwise, it seems
better just to copy the entire tuple area of the page. See attached
compact_test.c

I ran this and charted the cut off at (nitems < totaltups / 4) and
(nitems < totaltups / 2), and nitems < 16)
./compact_test 32 192
./compact_test 64 96
./compact_test 128 48
./compact_test 256 24
./compact_test 512 12

The / 4 one gives me the graphs with the smallest step when the method
switches. See attached 48_tuples_per_page.png for comparison.

I've so far come up with the attached
compactify_tuples_dgr_v2.patch.txt. Thomas pointed out to me off-list
that using PGAlignedBlock is the general way to allocate a page-sized
data on the stack. I'm still classing this patch as PoC grade. I'll
need to look a bit harder at the correctness of it all.

I did spend quite a bit of time trying to find a case where this is
slower than master's version. I can't find a case where there's any
noticeable slowdown. Using the same script from [1] I tried a few
variations of the t1 table by adding an additional column to pad out
the tuple to make it wider. Obviously a wider tuple means fewer
tuples on the page so less tuples for master's qsort to sort during
compactify_tuples(). I did manage to squeeze a bit more performance
out of the test cases. Yesterday I got 79.166 seconds. This version
gets me 76.623 seconds.

Here are the results of the various tuple widths:

narrow width row test: insert into t1 select x,0 from
generate_series(1,10000000) x; (32 byte tuples)

patched: 76.623
master: 137.593

medium width row test: insert into t1 select x,0,md5(x::text) ||
md5((x+1)::Text) from generate_series(1,10000000) x; (about 64 byte
tuples)

patched: 64.411
master: 95.576

wide row test: insert into t1 select x,0,(select
string_agg(md5(y::text),'') from generate_Series(x,x+30) y) from
generate_series(1,1000000)x; (1024 byte tuples)

patched: 88.653
master: 90.077

Changing the test so instead of having 10 million rows in the table
and updating a random row 12 million times. I put just 10 rows in the
table and updated them 12 million times. This results in
compactify_tuples() pruning all but 1 row (since autovac can't keep up
with this, each row does end up on a page by itself). I wanted to
ensure I didn't regress a workload that master's qsort() version would
have done very well at. qsorting 1 element is pretty fast.

10-row narrow test:

patched: 10.633 <--- small regression
master: 10.366

I could special case this and do a memmove without copying the tuple
to another buffer, but I don't think the slowdown is enough to warrant
having such a special case.

Another thing I tried was to instead of compacting the page in
compactify_tuples(), I just get rid of that function and did the
compacting in the existing loop in PageRepairFragmentation(). This
does require changing the ERROR check to a PANIC since we may have
already started shuffling tuples around when we find the corrupted
line pointer. However, I was just unable to make this faster than the
attached version. I'm still surprised at this as I can completely get
rid of the itemidbase array. The best run-time I got with this method
out the original test was 86 seconds, so 10 seconds slower than what
the attached can do. So I threw that idea away.

David

[1] https://www.postgresql.org/message-id/CAApHDvoKwqAzhiuxEt8jSquPJKDpH8DNUZDFUSX9P7DXrJdc3Q@mail.gmail.com

Attachment Content-Type Size
compact_test.c text/plain 3.8 KB
48_tuples_per_page.png image/png 89.6 KB
compactify_tuples_dgr_v2.patch.txt text/plain 3.8 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexey Kondratov 2020-09-08 17:00:44 Re: Global snapshots
Previous Message David Rowley 2020-09-08 15:47:05 Re: Optimising compactify_tuples()