Re: Document atthasmissing default optimization avoids verification table scan

From: "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>
To: James Coleman <jtc331(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>, "Bossart, Nathan" <bossartn(at)amazon(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Document atthasmissing default optimization avoids verification table scan
Date: 2022-03-28 03:12:16
Message-ID: CAKFQuwZyBaJjNepdTM3kO8PLaCpRdRd8+mtLT8QdE73oAsGv8Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Mar 27, 2022 at 11:17 AM James Coleman <jtc331(at)gmail(dot)com> wrote:

> Hmm, I didn't realize that was project policy,

Guideline/Rule of Thumb is probably a better concept.

> but I'm a bit
> surprised given that the sentence which 0001 replaces seems like a
> direct violation of that also: "In neither case is a rewrite of the
> table required."
>
>
IMO mostly due to the structuring of the paragraphs; something like the
following makes it less problematic (and as shown below may be
sufficient to address the purpose of this patch)

"""
[...]
The following alterations of the table require the entire table, and/or its
indexes, to be rewritten; which may take a significant amount of time for a
large table, and will temporarily require as much as double the disk space.

Changing the type of an existing column will require the entire table and
its indexes to be rewritten. As an exception, if the USING clause does not
change the column contents and the old type is either binary coercible to
the new type, or an unconstrained domain over the new type, a table rewrite
is not needed; but any indexes on the affected columns must still be
rewritten.

Adding a column with a volatile DEFAULT also requires the entire table and
its indexes to be rewritten.

The reason a non-volatile (or absent) DEFAULT does not require a rewrite of
the table is because the DEFAULT expression (or NULL) is evaluated at the
time of the statement and the result is stored in the table's metadata.

The following alterations of the table require that it be scanned in its
entirety to ensure that no existing values are contrary to the new
constraints placed on the table. Constraints backed by indexes will scan
the table as a side-effect of populating the new index with data.

Adding a CHECK constraint requires scanning the table to verify that
existing rows meet the constraint. The same goes for adding a NOT NULL
constraint to an existing column.

A newly attached partition requires scanning the table to verify that
existing rows meet the partition constraint.

A foreign key constraint requires scanning the table to verify that all
existing values exist on the referenced table.

The main reason for providing the option to specify multiple changes in a
single ALTER TABLE is that multiple table scans or rewrites can thereby be
combined into a single pass over the table.

Scanning a large table to verify a new constraint can take a long time, and
other updates to the table are locked out until the ALTER TABLE ADD
CONSTRAINT command is committed. For CHECK and FOREIGN KEY constraints
there is an option, NOT VALID, that reduces the impact of adding a
constraint on concurrent updates. With NOT VALID, the ADD CONSTRAINT
command does not scan the table and can be committed immediately. After
that, a VALIDATE CONSTRAINT command can be issued to verify that existing
rows satisfy the constraint. The validation step does not need to lock out
concurrent updates, since it knows that other transactions will be
enforcing the constraint for rows that they insert or update; only
pre-existing rows need to be checked. Hence, validation acquires only a
SHARE UPDATE EXCLUSIVE lock on the table being altered. (If the constraint
is a foreign key then a ROW SHARE lock is also required on the table
referenced by the constraint.) In addition to improving concurrency, it can
be useful to use NOT VALID and VALIDATE CONSTRAINT in cases where the table
is known to contain pre-existing violations. Once the constraint is in
place, no new violations can be inserted, and the existing problems can be
corrected at leisure until VALIDATE CONSTRAINT finally succeeds.

The DROP COLUMN form does not physically remove the column, but simply
makes it invisible to SQL operations. Subsequent insert and update
operations in the table will store a null value for the column. Thus,
dropping a column is quick but it will not immediately reduce the on-disk
size of your table, as the space occupied by the dropped column is not
reclaimed. The space will be reclaimed over time as existing rows are
updated.

To force immediate reclamation of space occupied by a dropped column, you
can execute one of the forms of ALTER TABLE that performs a rewrite of the
whole table. This results in reconstructing each row with the dropped
column replaced by a null value.

The rewriting forms of ALTER TABLE are not MVCC-safe. After a table
rewrite, the table will appear empty to concurrent transactions, if they
are using a snapshot taken before the rewrite occurred. See Section 13.5
for more details.
[...]
"""

I'm liking the idea of breaking out multiple features into their own
sentences or paragraphs instead of saying:

"Adding a column with a volatile DEFAULT or changing the type of an
existing column"

"Adding a CHECK or NOT NULL constraint"

This later combination probably doesn't catch my attention except for this
discussion and the fact that there are multiple ways to add these
constraints and we might as well be clear about whether ALTER COLUMN or ADD
COLUMN makes a difference. On that note, the behavior implied by this
wording is that adding a check constraint even during ADD COLUMN will
result in scanning the table even when a table rewrite is not required. If
that is the case at present nothing actually says that - if one agrees that
the exact same sentence doesn't imply that a table scan is performed when
adding a NOT NULL constraint during ADD COLUMN (which doesn't happen).
That seems like enough material to extract out from the ALTER TABLE page
and stick elsewhere if one is so motivated. There may be other stuff too -
but the next paragraph covers some SET DATA TYPE nuances which seem like a
different dynamic.

David J.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Julien Rouhaud 2022-03-28 03:12:38 Re: Add a pg_get_query_def function (was Re: Deparsing rewritten query)
Previous Message Thomas Munro 2022-03-28 03:11:24 Re: A test for replay of regression tests