Re: [Proposal] Allow users to specify multiple tables in VACUUM commands

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: "Bossart, Nathan" <bossartn(at)amazon(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Subject: Re: [Proposal] Allow users to specify multiple tables in VACUUM commands
Date: 2017-08-18 05:56:55
Message-ID: CAB7nPqQDyLXa8=+Xm7b_ixnpPia8QOPUo801xUvGuCD_Mo+FGw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Aug 15, 2017 at 4:05 PM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> On Tue, Aug 15, 2017 at 8:28 AM, Bossart, Nathan <bossartn(at)amazon(dot)com> wrote:
>> I’ve rebased this patch with master to create v7, which is attached.
>
> Thanks for the rebased patch. I am switching into review mode actively
> now, so I'll look at it soon.

Another pass for this patch.

+ /*
+ * We have already checked the column list in vacuum(...),
+ * but the columns may have disappeared since then. If
+ * this happens, emit a nice warning message and skip the
+ * undefined column.
+ */
I think that this would be reworded. "nice" is cute is this context.
Why not just saying something like:
"Do not issue an ERROR if a column is missing but use a WARNING
instead. At the beginning of the VACUUM run, the code already checked
for undefined columns and informed about an ERROR, but we want the
processing to move on for existing columns."

+ /*
+ * Now dedupe the list to avoid any redundant work (e.g. user specifies
+ * the same relation twice). We also take care of combining any
+ * separate column lists for duplicate relations.
+ *
+ * We do this after resolving the OIDs so that we do not miss entries
+ * that have unequal RangeVars but resolve to the same set of OIDs.
+ * For example, "foo" and "public.foo" could be the same relation.
+ */
+ relations = dedupe_relations(relations);
This has been introduced in v5. If we would want to put some effort
for that, I think that it could be a separate patch and a separate
discussion. This patch does not make things worse than they are, see
HEAD for example with the same column specified twice:
=# create table aa as select generate_series(1, 10000) as a;
SELECT 10000
=# vacuum (analyze) aa (a, a);
ERROR: 23505: duplicate key value violates unique constraint
"pg_statistic_relid_att_inh_index"
DETAIL: Key (starelid, staattnum, stainherit)=(16385, 1, f) already exists.
SCHEMA NAME: pg_catalog
TABLE NAME: pg_statistic
CONSTRAINT NAME: pg_statistic_relid_att_inh_index
LOCATION: _bt_check_unique, nbtinsert.c:434

And actually, your patch does not seem to work, and makes things worse:
=# analyze aa (a, a);
ERROR: XX000: tuple already updated by self
LOCATION: simple_heap_update, heapam.c:4482

+/*
+ * This is used to keep track of a relation and an optional list of
+ * column names, as may be specified in VACUUM and ANALYZE.
+ */
+typedef struct RelationAndColumns
+{
+ NodeTag type;
+ RangeVar *relation; /* single table to process */
+ List *va_cols; /* list of column names, or NIL for all */
+ List *oids; /* corresponding OIDs (filled in by
[auto]vacuum.c) */
+} RelationAndColumns;
This name is a bit awkward. Say VacuumRelation? I also don't
understand why there are multiple OIDs here. There should be only one,
referring to the relation of this RangeVar. Even for inherited
relations what should be done is to add one entry RelationAndColumns
(or VacuumRelation) for each child relation.

+ /*
+ * Check that all specified columns exist so that we can fast-fail
+ * commands with multiple tables. If the column disappears before we
+ * actually process it, we will emit a WARNING and skip it later on.
+ */
+ foreach(relation, relations)
+ check_columns_exist(lfirst(relation));
The full list could be processed in there.

+static void
+check_columns_exist(RelationAndColumns *relation)
[...]
+ Relation rel;
+ ListCell *lc;
+
+ rel = try_relation_open(lfirst_oid(oid), NoLock);
+ if (!rel)
This really meritates a comment. In short: why is it fine to not take
a lock here? The answer I think is that even if the relation does not
exist vacuum would do nothing, but this should be written out.

+ foreach(relation, relations)
+ {
+ if (((RelationAndColumns *) lfirst(relation))->va_cols != NIL &&
+ !(options & VACOPT_ANALYZE))
Saving the data in a variable makes for a better style and easier
debugging. When doing a bitwise operation, please use as well != 0 or
== 0 as you are looking here for a boolean result.

+ relinfo = makeNode(RelationAndColumns);
+ relinfo->oids = list_make1_oid(HeapTupleGetOid(tuple));
+ *vacrels = lappend(*vacrels, relinfo);
Assigning variables even for nothing is good practice for readability,
and shows the intention behind the code.

- * relid, if not InvalidOid, indicate the relation to process; otherwise,
- * the RangeVar is used. (The latter must always be passed, because it's
- * used for error messages.)
+ * If we intend to process all relations, the 'relations' argument may be
+ * NIL.
This comment actually applies to RelationAndColumns. If the OID is
invalid, then RangeVar is used, and should always be set. You are
breaking that promise actually for autovacuum. The comment here should
say that if relations is NIL all the relations of the database are
processes, and for an ANALYZE all the columns are done.
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeevan Ladhe 2017-08-18 05:57:41 Re: Adding support for Default partition in partitioning
Previous Message Amit Langote 2017-08-18 05:25:10 Re: expanding inheritance in partition bound order