Re: [HACKERS] CLUSTER command progress monitor

From: Antonin Houska <ah(at)cybertec(dot)at>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tatsuro Yamada <yamada(dot)tatsuro(at)lab(dot)ntt(dot)co(dot)jp>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] CLUSTER command progress monitor
Date: 2017-11-21 09:14:11
Message-ID: 18222.1511255651@localhost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Antonin Houska <ah(at)cybertec(dot)at> writes:
> > Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> >> These two phases overlap, though. I believe progress reporting for
> >> sorts is really hard.
>
> > Whatever complexity is hidden in the sort, cost_sort() should have taken it
> > into consideration when called via plan_cluster_use_sort(). Thus I think that
> > once we have both startup and total cost, the current progress of the sort
> > stage can be estimated from the current number of input and output
> > rows. Please remind me if my proposal appears to be too simplistic.
>
> Well, even if you assume that the planner's cost model omits nothing
> (which I wouldn't bet on), its result is only going to be as good as the
> planner's estimate of the number of rows to be sorted. And, in cases
> where people actually care about progress monitoring, it's likely that
> the planner got that wrong, maybe horribly so. I think it's a bad idea
> for progress monitoring to depend on the planner's estimates in any way
> whatsoever.

The general idea was that some sort of prediction of the total cost is needed
anyway if we should tell during execution what fraction of work has already
been done. And also that the cost computation that we perform during execution
shouldn't (ideally) differ from cost_sort(). So I thought that it's easier to
refine cost_sort() than to implement the same computation from scratch
elsewhere.

Besides that I see 2 circumstances that make the estimate of the number of
input tuples simpler in the CLUSTER case:

* There's only 1 input relation w/o any kind of clause.

* CLUSTER uses SnapshotAny, so pg_class(reltuples) is closer to the actual
number of input rows than it would be in general case. (Of course, pg_class
would only be useful for the initial estimate.)

Unlike planner, the executor could recalculate the cost estimate at some
point(s) as it recognizes that the actual number of tuples per page appears to
differ from the density derived from pg_class initially. Still wrong?

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2017-11-21 09:28:34 Re: Anybody care about having the verbose form of the tzdata files?
Previous Message Masahiko Sawada 2017-11-21 09:12:29 Re: Failed to delete old ReorderBuffer spilled files