Re: Todo: Teach planner to evaluate multiple windows in the optimal order

From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
Cc: Ankit Kumar Pandey <itsankitkp(at)gmail(dot)com>, pghackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Vik Fearing <vik(at)postgresfriends(dot)org>
Subject: Re: Todo: Teach planner to evaluate multiple windows in the optimal order
Date: 2023-02-16 03:02:44
Message-ID: CAApHDvp20eyP-thT7HiuoNZmmTa4b9sw2oefqAbKjDAKYxqsMg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 16 Feb 2023 at 00:45, John Naylor <john(dot)naylor(at)enterprisedb(dot)com> wrote:
> Okay, here's a rerun including the sort hack, and it looks like incremental sort is only ahead with the smallest set, otherwise same or maybe slightly slower:
>
> HEAD:
>
> 4 ^ 8: latency average = 113.461 ms
> 5 ^ 8: latency average = 786.080 ms
> 6 ^ 8: latency average = 3948.388 ms
> 7 ^ 8: latency average = 15733.348 ms
>
> tiebreaker:
>
> 4 ^ 8: latency average = 106.556 ms
> 5 ^ 8: latency average = 734.834 ms
> 6 ^ 8: latency average = 3640.507 ms
> 7 ^ 8: latency average = 14470.199 ms
>
> tiebreaker + incr sort hack:
>
> 4 ^ 8: latency average = 93.998 ms
> 5 ^ 8: latency average = 740.120 ms
> 6 ^ 8: latency average = 3715.942 ms
> 7 ^ 8: latency average = 14749.323 ms

Sad news :( the sort hacks are still quite a bit faster for 4 ^ 8.

I was fooling around with the attached (very quickly and crudely put
together) patch just there. The idea is to sort during
tuplesort_puttuple_common() when the memory consumption goes over some
approximation of L3 cache size in the hope we still have cache lines
for the tuples in question still. The code is not ideal there as we
track availMem rather than the used mem, so maybe I need to do that
better as we could cross some boundary without actually having done
very much, plus we USEMEM for other reasons too.

I found that the patch didn't really help:

create table t (a int not null, b int not null);
insert into t select (x*random())::int % 100,(x*random())::int % 100
from generate_Series(1,10000000)x;
vacuum freeze t;
select pg_prewarm('t');
show work_mem;
work_mem
----------
4GB

explain (analyze, timing off) select * from t order by a,b;

master:
Execution Time: 5620.704 ms
Execution Time: 5506.705 ms

patched:
Execution Time: 6801.421 ms
Execution Time: 6762.130 ms

I suspect it's slower because the final sort must sort the entire
array still without knowledge that portions of it are pre-sorted. It
would be very interesting to improve this and do some additional work
and keep track of the "memtupsortedto" index by pushing them onto a
List each time we cross the availMem boundary, then do then qsort just
the final portion of the array in tuplesort_performsort() before doing
a k-way merge on each segment rather than qsorting the entire thing
again. I suspect this would be faster when work_mem exceeds L3 by some
large amount.

David

Attachment Content-Type Size
tuplesort_cache_aware_hacks.patch text/plain 5.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2023-02-16 03:24:16 Re: [PoC] Improve dead tuple storage for lazy vacuum
Previous Message Tomas Vondra 2023-02-16 02:04:16 Bug in processing conditions on multi-column BRIN indexes