Re: sqlsmith crash incremental sort

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: James Coleman <jtc331(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: sqlsmith crash incremental sort
Date: 2020-04-22 22:59:19
Message-ID: 20200422225919.ucs5xnlgnr5ekhod@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Apr 20, 2020 at 01:47:29AM +0200, Tomas Vondra wrote:
>On Sat, Apr 18, 2020 at 02:23:25PM -0400, James Coleman wrote:
>>On Thu, Apr 16, 2020 at 9:26 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>>
>>>Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> writes:
>>>> I think we have essentially three options:
>>>> 1) assuming there's just a single group
>>>> 2) assuming each row is a separate group
>>>> 3) something in between
>>>> If (1) and (2) are worst/best-case scenarios, maybe we should pick
>>>> something in between. We have DEFAULT_NUM_DISTINCT (200) which
>>>> essentially says "we don't know what the number of groups is" so maybe
>>>> we should use that.
>>>
>>>I wouldn't recommend picking either the best or worst cases.
>>>
>>>Possibly DEFAULT_NUM_DISTINCT is a sane choice, though it's fair to
>>>wonder if it's quite applicable to the case where we already know
>>>we've grouped by some columns.
>>
>>Do you think defining a new default, say,
>>DEFAULT_NUM_DISTINCT_PRESORTED is preferred then? And choose some
>>value like "1/2 of the normal DEFAULT_NUM_DISTINCT groups" or some
>>such?
>>
>
>If we had a better intuition what a better value is, maybe. But I don't
>think we have that at all, so I'd just use the existing one.
>

I've pushed fix with the DEFAULT_NUM_DISTINCT. The input comes from a
set operation (which is where we call generate_append_tlist), so it's
probably fairly unique, so maybe we should use input_tuples. But it's
not guaranteed, so DEFAULT_NUM_DISTINCT seems reasonably defensive.

One detail I've changed is that instead of matching the expression
directly to a Var, it now calls pull_varnos() to also detect Vars
somewhere deeper. Lookig at examine_variable() it calls find_base_rel
for such case too, but I haven't tried constructing a query triggering
the issue.

One improvement I can think of is handling lists with only some
expressions containing varno 0. We could still call estimate_num_groups
for expressions with varno != 0, and multiply that by the estimate for
the other part (be it DEFAULT_NUM_DISTINCT). This might produce a higher
estimate than just using DEFAULT_NUM_DISTINCT directly, resulting in a
lower incremenal sort cost. But it's not clear to me if this can even
happen - AFAICS either all Vars have varno 0 or none, so I haven't done
this.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-04-22 23:11:05 Re: Parallel Append can break run-time partition pruning
Previous Message Alvaro Herrera 2020-04-22 22:59:11 Re: Logical replication subscription owner