Re: sqlsmith crash incremental sort

From: James Coleman <jtc331(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Justin Pryzby <pryzby(at)telsasoft(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: sqlsmith crash incremental sort
Date: 2020-04-17 01:02:39
Message-ID: CAAaqYe8wcgS8++9_n-Ci+ugQPmCDmCPHu+5aisJk9DfYKxzNcg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Apr 16, 2020 at 8:54 PM Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> On Wed, Apr 15, 2020 at 11:26:12AM -0400, James Coleman wrote:
> >On Wed, Apr 15, 2020 at 10:47 AM Tomas Vondra
> ><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> >>
> >> ...
> >>
> >> Yeah. And I'm not even sure having that information would allow good
> >> estimates e.g. for UNIONs of multiple relations etc.
> >>
> >> >> Another option is to use something as simple as checking for Vars with
> >> >> varno==0 in cost_incremental_sort() and ignoring them somehow. We could
> >> >> simply use some arbitrary estimate - by assuming the rows are unique or
> >> >> something like that. Yes, I agree it's pretty ugly and I'd much rather
> >> >> find a way to generate something sensible, but I'm not even sure we can
> >> >> generate good estimate when doing UNION of data from different relations
> >> >> and so on. The attached (ugly) patch does this.
> >> >
> >> >...therefore I think this is worth proceeding with.
> >> >
> >>
> >> OK, then the question is what estimate to use in this case. Should we
> >> assume 1 group or uniqueness? I'd assume a single group produces costs
> >> slightly above regular sort, right?
> >
> >Originally I'd intuitively leaned towards assuming they were unique.
> >But that would be the best case for memory/disk space usage, for
> >example, and the costing for incremental sort is always (at least
> >mildly) higher than regular sort if the number of groups is 1. That
> >also guarantees the startup cost is higher than regular sort also.
> >
> >So I think using a number of groups estimate of 1, we just wouldn't
> >choose an incremental sort ever in this case.
> >
> >Maybe that's the right choice? It'd certainly be the conservative
> >choice. What are your thoughts on the trade-offs there?
> >
>
> I think we have essentially three options:
>
> 1) assuming there's just a single group
>
> This should produce cost estimate higher than plain sort, disabling
> incremental sort. I'd say this is "worst case" assumption. I think this
> might be overly pessimistic, though.
>
> 2) assuming each row is a separate group
>
> If (1) is worst case scenario, then this is probably the best case,
> particularly when the query is sensitive to startup cost.
>
>
> 3) something in between
>
> If (1) and (2) are worst/best-case scenarios, maybe we should pick
> something in between. We have DEFAULT_NUM_DISTINCT (200) which
> essentially says "we don't know what the number of groups is" so maybe
> we should use that. Another option would be nrows/10, which is the cap
> we use in estimate_num_groups without extended stats.
>
>
> I was leaning towards (1) as "worst case" choice seems natural to
> prevent possible regressions. But consider this:
>
> create table small (a int);
> create table large (a int);
>
> insert into small select mod(i,10) from generate_series(1,100) s(i);
> insert into large select mod(i,10) from generate_series(1,100000) s(i);
>
> analyze small;
> analyze large;
>
> explain select i from large union select i from small;
>
> QUERY PLAN
> -------------------------------------------------------------------------------
> Unique (cost=11260.35..11760.85 rows=100100 width=4)
> -> Sort (cost=11260.35..11510.60 rows=100100 width=4)
> Sort Key: large.i
> -> Append (cost=0.00..2946.50 rows=100100 width=4)
> -> Seq Scan on large (cost=0.00..1443.00 rows=100000 width=4)
> -> Seq Scan on small (cost=0.00..2.00 rows=100 width=4)
> (6 rows)
>
> The estimate fo number of groups is clearly bogus - we know there are
> only 10 groups in each relation, but here we end up with 100100.
>
> So perhaps we should do (2) to keep the behavior consistent?

First of all, I agree that we shouldn't (at this point in the cycle)
try to apply pathkey reordering etc. We already have ideas about
additional ways to apply and improve usage of incremental sort, and it
seem like this naturally fits into that list.

Second of all, I like the idea of keeping it consistent (even if
consistency boils down to "we're just guessing").

James

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message James Coleman 2020-04-17 01:04:41 Re: [DOC] Document concurrent index builds waiting on each other
Previous Message Tomas Vondra 2020-04-17 00:54:20 Re: sqlsmith crash incremental sort