Re: progress report for ANALYZE

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Tatsuro Yamada <tatsuro(dot)yamada(dot)tf(at)nttcom(dot)co(dot)jp>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Julien Rouhaud <rjuju123(at)gmail(dot)com>, Anthony Nowocien <anowocien(at)gmail(dot)com>
Subject: Re: progress report for ANALYZE
Date: 2019-08-01 17:48:06
Message-ID: CA+Tgmoaq0aWMw6zbNfcGjR8jxD94Ad_Ns_u3mQqJJL7xJhw9AA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Aug 1, 2019 at 4:45 AM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
> On Tue, Jul 23, 2019 at 4:51 PM Tatsuro Yamada
> <tatsuro(dot)yamada(dot)tf(at)nttcom(dot)co(dot)jp> wrote:
> > Attached v4 patch file only includes this fix.
>
> I've moved this to the September CF, where it is in "Needs review" state.

/me reviews.

+ <entry><structfield>scanning_table</structfield></entry>

I think this should be retitled to something that ends in 'relid',
like all of the corresponding cases in existing progress views.
Perhaps 'active_relid' or 'current_relid'.

+ The command is computing extended stats from the samples
obtained in the previous phase.

I think you should change this (and the previous one) to say "from the
samples obtained during the table scan."

+ Total number of heap blocks in the scanning_table.

Perhaps I'm confused, but it looks to me like what you are advertising
is the number of blocks that will be sampled, not the total number of
blocks in the table. I think that's the right thing to advertise, but
then the column should be named and documented that way.

+ {
+ const int index[] = {
+ PROGRESS_ANALYZE_TOTAL_BLOCKS,
+ PROGRESS_ANALYZE_SCANREL
+ };
+ const int64 val[] = {
+ nblocks,
+ RelationGetRelid(onerel)
+ };
+
+ pgstat_progress_update_multi_param(2, index, val);
+ }

This block seems to be introduced just so you can declare variables; I
don't think that's good style. It's arguably unnecessary because we
now are selectively allowing variable declarations within functions,
but I think you should just move the first array to the top of the
function and the second declaration to the top of the function
dropping const, and then just do val[0] = nblocks and val[1] =
RelationGetRelid(onerel). Maybe you can also come up with better
names than 'index' and 'val'. Same comment applies to another place
where you have something similar.

Patch seems to need minor rebasing.

Maybe "scanning table" should be renamed "acquiring sample rows," to
match the names used in the code?

I'm not a fan of the way you set the scan-table phase and inh flag in
one place, and then slightly later set the relation OID and block
count. That creates a race during which users could see the first bit
of data set and the second not set. I don't see any reason not to set
all four fields together.

Please be sure to make the names of the constants you use match up
with the external names as far as it reasonably makes sense, e.g.

+#define PROGRESS_ANALYZE_PHASE_SCAN_TABLE 1
+#define PROGRESS_ANALYZE_PHASE_COMPUTING 2
+#define PROGRESS_ANALYZE_PHASE_COMPUTING_EXTENDED 3
+#define PROGRESS_ANALYZE_PHASE_FINALIZE 4

vs.

+ WHEN 0 THEN 'initializing'::text
+ WHEN 1 THEN 'scanning table'::text
+ WHEN 2 THEN 'computing stats'::text
+ WHEN 3 THEN 'computing extended stats'::text
+ WHEN 4 THEN 'finalizing analyze'::text

Not terrible, but it could be closer.

Similarly with the column names (include_children vs. INH).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2019-08-01 17:50:45 Re: pgbench - implement strict TPC-B benchmark
Previous Message Andres Freund 2019-08-01 17:19:51 Re: UCT (Re: pgsql: Update time zone data files to tzdata release 2019a.)