Re: Add a greedy join search algorithm to handle large join problems

From: Chengpeng Yan <chengpeng_yan(at)Outlook(dot)com>
To: Tomas Vondra <tomas(at)vondra(dot)me>
Cc: "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, John Naylor <johncnaylorls(at)gmail(dot)com>
Subject: Re: Add a greedy join search algorithm to handle large join problems
Date: 2025-12-12 09:50:38
Message-ID: 16E7617D-823A-4F24-BB99-D3DCDCCC1CA8@Outlook.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


> On Dec 10, 2025, at 18:20, Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>
> I can confirm v2 makes it work for planning all 99 TPC-DS queries, i.e.
> there are no more crashes during EXPLAIN.

Thanks a lot for testing this — much appreciated.

> It's also tricky as plan choices depend on GUCs like random_page_cost,
> and if those values are not good enough, the optimizer may still end up
> with a bad plan. I'm not sure what's the best approach.

I agree that many other cost-related parameters can influence the join
order. For now, I’m still running my tests with the default settings,
and I’m not entirely sure how large the impact of those cost parameters
will be. This is something I’ll probably consider at a later stage.

> I did however notice an interesting thing - running EXPLAIN on the 99
> queries (for 3 scales and 0/4 workers, so 6x 99) took this much time:
>
> master: 8s
> master/geqo: 20s
> master/goo: 5s
>
> It's nice that "goo" seems to be faster than "geqo" - assuming the plans
> are comparable or better.

One additional advantage of goo is that its memory usage should be
better than DP/GEQO (though I haven’t done any measurements yet).
But I don’t think this is the main concern at the moment — I’m just
mentioning it briefly here. What matters more right now is the plan
quality itself.

> On Dec 12, 2025, at 01:30, Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>
> A very simple summary of the results is the total duration of the run,
> for all 99 queries combined:
>
> scale workers caching master master-geqo master-goo
> ===================================================================
> 1 0 cold 816 399 1124
> warm 784 369 1097
> 4 cold 797 384 1085
> warm 774 366 1069
> -------------------------------------------------------------------
> 10 0 cold 2760 2653 2340
> warm 2580 2470 2177
> 4 cold 2563 2423 1969
> warm 2439 2325 1859
>
> This is interesting, and also a bit funny.
>
> The funny part is that geqo seems to do better than master - on scale 1
> it's pretty clear, on scale 10 the difference is much smaller.
>
> The interesting part is that "goo" is doing worse than master (or geqo)
> on scale 1, and better on scale 10. I wonder how would it do on larger
> scales, but I don't have such results.

It’s interesting to see that goo performs better at scale 10. I’ll try
to dig into the results and understand the reasons behind this in
follow-up work.

> It might be interesting to look at some of the queries that got worse,
> and check why. Maybe that'd help you with picking the heuristics?
>
> FWIW I still think no heuristics can be perfect, so getting slower plans
> for some queries should not be treated as "hard failure".

I agree that no set of heuristics can be perfect. The best we can do is
to look at existing workloads and understand why certain heuristics
don’t work there, and then evolve them toward strategies that are more
general and more aligned with the nature of joins themselves. There
doesn’t seem to be a silver bullet here — it really comes down to
analyzing specific SQL queries case by case.

Finally, thank you very much for the testing you did and for all the
insights you shared — it helped a lot.

--
Best regards,
Chengpeng Yan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2025-12-12 09:53:32 Re: Fix and improve allocation formulas
Previous Message Michael Paquier 2025-12-12 09:48:09 Re: Fix memory leak in gist_page_items() of pageinspect