Re: Yet another abort-early plan disaster on 9.3

From: Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Merlin Moncure <mmoncure(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Yet another abort-early plan disaster on 9.3
Date: 2014-09-30 02:12:00
Message-ID: 542A1170.2010600@archidevsys.co.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-performance

On 30/09/14 12:00, Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
>> The way I'm seeing it, you can't assume the LIMIT will apply to any
>> IndexScan that doesn't have an index condition. If it has just a
>> filter, or nothing at all, just an ordering then it could easily scan
>> the whole index if the stats are wrong.
> That statement applies with equal force to *any* plan with a LIMIT;
> it's not just index scans.
>
> The real question is to what extent are the tuples satisfying the extra
> filter condition randomly distributed with respect to the index order
> (or physical order, if it's a seqscan). The existing cost estimation
> code effectively assumes that they're perfectly uniformly distributed;
> which is a good average-case assumption but can be horribly wrong in
> the worst case.
>
> If we could settle on some other model for the probable distribution
> of the matching tuples, we could adjust the cost estimates for LIMIT
> accordingly. I have not enough statistics background to know what a
> realistic alternative would be.
>
> Another possibility is to still assume a uniform distribution but estimate
> for, say, a 90% probability instead of 50% probability that we'll find
> enough tuples after scanning X amount of the table. Again, I'm not too
> sure what that translates to in terms of the actual math, but it sounds
> like something a statistics person could do in their sleep.
>
> I do not think we should estimate for the worst case though. If we do,
> we'll hear cries of anguish from a lot of people, including many of the
> same ones complaining now, because the planner stopped picking fast-start
> plans even for cases where they are orders of magnitude faster than the
> alternatives.
>
> regards, tom lane
>

If you analyzed the tables in most production databases, you would find that they are almost invariably not uniformly distributed.

Most likely they will be clumped, and if you plotted the frequency of values of a given column in a given range against the number of blocks, you are likely to see one or more distinct peaks. If the table has been CLUSTERed on that column, then they will very likely to be in one clump spanning contiguous blocks.

I suspect that there are two distinct populations: one relating to values present before the last VACUUM, and ones added since.

There are so many factors to consider, pattern of CRUD operations, range of values in query, ... that probably prevent using very sophisticated approaches, but I would be happy to be proved wrong!

Though I am fairly confident that if the distribution is not known in advance for a given table, then the percentage required to process to satisfy the limit is likely to be a lot larger than a uniform distribution would suggest.

Would it be feasible to get a competent statistician to advise what data to collect, and to analyze it? Maybe it is possible to get a better estimate on how much of a table needs to be scanned, based on some fairly simple statistics. But unless research is done, it is probably impossible to determine what statistics might be useful, and how effective a better estimate could be.

I have a nasty feeling that assuming a uniform distribution, may still
end up being the best we can do - but I maybe being unduly pessimistic!.

Cheers,
Gavin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Craig Ringer 2014-09-30 02:28:56 Re: INSERT ... ON CONFLICT {UPDATE | IGNORE}
Previous Message Tom Lane 2014-09-30 01:40:28 Upcoming Postgres 9.4beta3 release

Browse pgsql-performance by date

  From Date Subject
Next Message Burgess, Freddie 2014-09-30 02:59:19 Re: Very slow postgreSQL 9.3.4 query
Previous Message Tom Lane 2014-09-29 23:00:37 Re: Yet another abort-early plan disaster on 9.3