Re: Batch TIDs lookup in ambulkdelete

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: John Naylor <johncnaylorls(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Batch TIDs lookup in ambulkdelete
Date: 2025-06-06 22:58:02
Message-ID: CAD21AoCGzTu66ZjatOR2yxu6nVkpK7=XBh2reZXiJ819rT-38Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Jun 1, 2025 at 11:01 PM John Naylor <johncnaylorls(at)gmail(dot)com> wrote:
>
> On Thu, May 1, 2025 at 5:36 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> > HEAD PATCHED DIFF
> > case-1: 3,021 ms 2.818 ms 93.29%
> > case-2: 5, 697 ms 5.545 ms 97.34%
> > case-3: 2,833 ms 2.790 ms 98.48%
> > case-4: 2,564 ms 2.279 ms 88.86%
> > case-5: 4,657 ms 4.706 ms 101.04%
>
> 1 and 4 look significant -- do the other cases have reproducible
> differences or is it just noise?

These results are the average of 3 times executions, so they are
reproducible differences.

>
> > Here are the summary of each attached patch:
> >
> > 0001: Introduce TIdStoreIsMemberMulti() where we can do IsMember check
> > for multiple TIDs in one function call. If the given TIDs are sorted
> > (at least in block number), we can save radix tree lookup for the same
> > page entry.
>
> My only comment is that TidStoreIsMember() is now unused in core (or
> maybe just the tests?). It seems like we could just change the API for
> it rather than introduce a new function?

Good point, changed in the latest patch I posted[1].

>
> > 0003: Use batch TIDs lookup in btree index bulk-deletion.
> >
> > In patch 0003, we implement batch TID lookups for both each
> > deduplicated index tuple and remaining all regular index tuples, which
> > seems to be the most straightforward approach.
>
> Seems like a good approach. btvacuumpage() needs to sort if there is a
> mix of posting tuples and regular index tuples. Was that covered by
> any of the tests above?

Good point, I think this case was not covered. I've measured the
performance with the following queries:

# Case-6:
create unlogged table test (c int) with (autovacuum_enabled = off);
insert into test select (2 * c - 1) from generate_series(1, ${NROWS}) c;
insert into test select c from generate_series(1, ${NROWS}) c;
create index on test (c);
select pg_relation_size('test') / 8192 as relpages \gset
delete from test where c < (${NROWS} * 0.3)::int
vacuum test;

And here is the result:

HEAD PATCHED DIFF
case-6: 3,320 ms 3,617 ms 108.94%

I'll consider how to deal with overheads of sorting TIDs.

> > While further
> > optimizations are possible, such as performing batch TID lookups for
> > all index tuples on a single page, these could introduce additional
> > overhead from sorting and re-sorting TIDs. Moreover, considering that
> > radix tree lookups are relatively inexpensive, the benefits of sorting
> > TIDs and using TidStoreIsMemberMulti() might be minimal. Nevertheless,
> > these potential optimizations warrant further evaluation to determine
> > their actual impact on performance.
>
> My guess is that always sorting by TID and than back by index tuple
> offset is too much overhead to be worth it, but I'm not sure.

Agreed. Given the above test results, it's unlikely always sorting the
array helps speedups.

Regards,

[1] https://www.postgresql.org/message-id/CAD21AoAv55DhJ%2B19zaemx-_eO7z%2Bu4gtFmeADsMBFqtHhyUySQ%40mail.gmail.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2025-06-06 23:27:46 Re: Batch TIDs lookup in ambulkdelete
Previous Message Jeff Davis 2025-06-06 22:47:16 Re: CREATE DATABASE command for non-libc providers