Re: Use virtual tuple slot for Unique node

From: Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
To: David Rowley <dgrowleyml(at)gmail(dot)com>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Денис Смирнов <darthunix(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Use virtual tuple slot for Unique node
Date: 2023-10-25 09:48:01
Message-ID: CAExHW5vLUg=OG-QWv+3t1cSwPj9_iDS9n3kyt7bQyvpj+M8oJQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Oct 24, 2023 at 4:30 AM David Rowley <dgrowleyml(at)gmail(dot)com> wrote:
>
> On Fri, 20 Oct 2023 at 22:30, Ashutosh Bapat
> <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com> wrote:
> > I ran my experiments again. It seems on my machine the execution times
> > do vary a bit. I ran EXPLAIN ANALYZE on the query 5 times and took
> > average of execution times. I did this three times. For each run the
> > standard deviation was within 2%. Here are the numbers
> > master: 13548.33, 13878.88, 14572.52
> > master + 0001: 13734.58, 14193.83, 14574.73
> >
> > So for me, I would say, this particular query performs the same with
> > or without patch.
>
> I'm not really sure which of the 7 queries you're referring to here.
> The times you're quoting seem to align best to Q7 from your previous
> results, so I'll assume you mean Q7.
>
> I'm not really concerned with Q7 as both patched and unpatched use
> TTSOpsMinimalTuple.

It's Q7. Yes. I was responding to your statement: " However, Q7 does
seem to be above noise level faster and I'm not sure why.". Anyway, we
can set that aside.

>
> I also think you need to shrink the size of your benchmark down. With
> 1 million tuples, you're more likely to be also measuring the time it
> takes to get cache lines from memory into the CPU. A smaller scale
> test will make this less likely. Also, you'd be better timing SELECT
> rather than the time it takes to EXPLAIN ANALYZE. They're not the same
> thing. EXPLAIN ANALYZE has additional timing going on and we may end
> up not de-toasting toasted Datums.

I ran experiments with 10K rows and measured timing using \timing in
psql. The measurements are much more flaky than a larger set of rows
and EXPLAIN ANALYZE. But I think your observations are good enough.

>
> > On Thu, Oct 19, 2023 at 4:26 PM David Rowley <dgrowleyml(at)gmail(dot)com> wrote:
> > > We can see that Q2 and Q3 become a bit slower. This makes sense as
> > > tts_virtual_materialize() is quite a bit more complex than
> > > heap_copy_minimal_tuple() which is a simple palloc/memcpy.
> > >
> >
> > If the source slot is a materialized virtual slot,
> > tts_virtual_copyslot() could perform a memcpy of the materialized data
> > itself rather than materialising from datums. That might be more
> > efficient.
>
> I think you're talking about just performing a memcpy() of the
> VirtualTupleTableSlot->data field. Unfortunately, you'd not be able
> to do just that as you'd also need to repoint the non-byval Datums in
> tts_values at the newly memcpy'd memory. If you skipped that part,
> those would remain pointing to the original memory. If that memory
> goes away, then bad things will happen. I think you'd still need to do
> the 2nd loop in tts_virtual_materialize()

Yes, we will need repoint non-byval Datums ofc.

>
> > May be we should fix the above said inefficiency in
> > tt_virtual_copyslot()?
>
> I don't have any bright ideas on how to make tts_virtual_materialize()
> itself faster. If there were some way to remember !attbyval
> attributes for the 2nd loop, that might be good, but creating
> somewhere to store that might result in further overheads.

We may save the size of data in VirtualTupleTableSlot, thus avoiding
the first loop. I assume that when allocating
VirtualTupleTableSlot->data, we always know what size we are
allocating so it should be just a matter of saving it in
VirtualTupleTableSlot->size. This should avoid the first loop in
tts_virtual_materialize() and give some speed up. We will need a loop
to repoint non-byval Datums. I imagine that the pointers to non-byval
Datums can be computed as dest_slot->tts_values[natts] =
dest_vslot->data + (src_slot->tts_values[natts] - src_vslot->data).
This would work as long as all the non-byval datums in the source slot
are all stored flattened in source slot's data. I am assuming that
that would be true in a materialized virtual slot. The byval datums
are copied as is. I think, this will avoid multiple memcpy calls, one
per non-byval attribute and hence some speedup. I may be wrong though.

--
Best Wishes,
Ashutosh Bapat

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Drouvot, Bertrand 2023-10-25 09:48:33 Re: Synchronizing slots from primary to standby
Previous Message Drouvot, Bertrand 2023-10-25 09:45:13 Re: Synchronizing slots from primary to standby