Re: Use generation memory context for tuplestore.c

From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
Cc: PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Use generation memory context for tuplestore.c
Date: 2024-05-10 21:56:26
Message-ID: CAApHDvobfoHnWB8mmWm=3b6ZUaGh-Bmh7_fYRXgd2hJZ4xDUAQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks for having a look at this.

On Sat, 11 May 2024 at 04:34, Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote:
> Do I understand correctly, that the efficiency of generation memory
> context could be measured directly via counting number of malloc/free
> calls? In those experiments I've conducted the numbers were indeed
> visibly lower for the patched version (~30%), but I would like to
> confirm my interpretation of this difference.

For the test case I had in the script, I imagine the reduction in
malloc/free is just because of the elimination of the power-of-2
roundup. If the test case had resulted in tuplestore_trim() being
used, e.g if Material was used to allow mark and restore for a Merge
Join or WindowAgg, then the tuplestore_trim() would result in
pfree()ing of tuples and further reduce malloc of new blocks. This is
true because AllocSet records pfree'd non-large chunks in a freelist
element and makes no attempt to free() blocks.

In the tuplestore_trim() scenario with the patched version, the
pfree'ing of chunks is in a first-in-first-out order which allows an
entire block to become empty of palloc'd chunks which allows it to
become the freeblock and be reused rather than malloc'ing another
block when there's not enough space on the current block. If the
scenario is that tuplestore_trim() always manages to keep the number
of tuples down to something that can fit on one GenerationBlock, then
we'll only malloc() 2 blocks and the generation code will just
alternate between the two, reusing the empty one when it needs a new
block rather than calling malloc.

> > 5. Higher likelihood of neighbouring tuples being stored consecutively
> > in memory, resulting in better CPU memory prefetching.
>
> I guess this roughly translates into better CPU cache utilization.
> Measuring cache hit ratio for unmodified vs patched versions in my case
> indeed shows about 10% less cache misses.

Thanks for testing that. In simple test cases that's likely to come
from removing the power-of-2 round-up behaviour of AllocSet allowing
the allocations to be more dense. In more complex cases when
allocations are making occasional use of chunks from AllowSet's
freelist[], the access pattern will be more predictable and allow the
CPU to prefetch memory more efficiently, resulting in a better CPU
cache hit ratio.

> The query you use in the benchmark, is it something like a "best-case"
> scenario for generation memory context? I was experimenting with
> different size of the b column, lower values seems to produce less
> difference between generation and aset (although still generation
> context is distinctly faster regarding final query latencies, see the
> attached PDF estimation, ran for 8192 rows). I'm curious what could be a
> "worst-case" workload type for this patch?

It's not purposefully "best-case". Likely there'd be more improvement
if I'd scanned the Material node more than twice. However, the tuple
size I picked is close to the best case as it's just over 1024 bytes.
Generation will just round the size up to the next 8-byte alignment
boundary, whereas AllocSet will round that up to 2048 bytes.

A case where the patched version would be even better would be where
the unpatched version spilled to disk but the patched version did not.
I imagine there's room for hundreds of percent speedup for that case.

> I've also noticed the first patch disables materialization in some tests.
>
> --- a/src/test/regress/sql/partition_prune.sql
> +++ b/src/test/regress/sql/partition_prune.sql
>
> +set enable_material = 0;
> +
> -- UPDATE on a partition subtree has been seen to have problems.
> insert into ab values (1,2);
> explain (analyze, costs off, summary off, timing off)
>
> Is it due to the volatility of Maximum Storage values? If yes, could it
> be covered with something similar to COSTS OFF instead?

I don't think not showing this with COSTS OFF is very consistent with
Sort and Memoize's memory stats output. I opted to disable the
Material node for that plan as it seemed like the easiest way to make
the output stable. There are other ways that could be done. See
explain_memoize() in memoize.sql. I thought about doing that, but it
seemed like overkill when there was a much more simple way to get what
I wanted. I'd certainly live to regret that if disabling Material put
the Nested Loop costs dangerously close to the costs of some other
join method and the plan became unstable on the buildfarm.

David

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2024-05-11 00:35:15 Re: Support tid range scan in parallel?
Previous Message Bruce Momjian 2024-05-10 21:37:25 Re: First draft of PG 17 release notes