Skip site navigation (1) Skip section navigation (2)

Re: Avoiding rewrite in ALTER TABLE ALTER TYPE

From: Noah Misch <noah(at)leadboat(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Avoiding rewrite in ALTER TABLE ALTER TYPE
Date: 2010-12-29 23:46:13
Message-ID: 20101229234613.GB30520@tornado.gateway.2wire.net (view raw or flat)
Thread:
Lists: pgsql-hackers
On Wed, Dec 29, 2010 at 10:56:39AM -0500, Robert Haas wrote:
> On Dec 29, 2010, at 7:56 AM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> > Having thought on it
> > more, though, it actually seems best to attempt the verification scan *every*
> > time.  In most ineligible conversions, an inequality will appear very early, so
> > the scan is effectively O(1) in the negative case.  A notable exception is
> > something like char(6)->varchar(6) in a table with a billion tuples having
> > length(col) = 6 and one with length(col) = 5.  The verification scan might read
> > most of the table before finding the one tuple that forces a rewrite. That
> > isn't a particularly regular scenario in my experience, so the "just do the
> > right thing" aspect of preceding every potential rewrite with a verification
> > scan seems to win out.
> 
> I think this scenario will be more common than you might think.  Tables don't contain random data; they contain data that the DBA thinks is valid.  The situation where the data is mostly as you expect but with a few kooky rows is, in my experience, extremely common.

Perhaps.  A few kooky rows is indeed common, but we're talking about a specific
breed of kookiness: 99.9% of the rows have identical bits after an ALTER TYPE
transformation expression, and 0.1% have different bits.  Is that common?

In case it was not obvious, I'll note that any error thrown by a transformation
expression during the verification scan still aborts the ALTER TABLE.  A
varchar(20)->varchar(10) that finds an 11-char string will fail permanently
during the verification scan.  Indeed, the primary value of the verification
scan is to distinguish positive and error, not positive and negative.

Expanding on my introduction, none of the following can yield a negative
verification scan; the result is always positive or an error:

CREATE DOMAIN loosedom AS text;
CREATE DOMAIN tightdom AS text CHECK (value LIKE '<%/>');
CREATE TABLE t (c varchar(6));
INSERT INTO t VALUES ('<abc/>'),('<de/>');
ALTER TABLE t ALTER c TYPE varchar(8);
ALTER TABLE t ALTER c TYPE text;
ALTER TABLE t ALTER c TYPE loosedom;
ALTER TABLE t ALTER c TYPE xml USING c::xml;
ALTER TABLE t ALTER c TYPE varchar(64);
ALTER TABLE t ALTER c TYPE tightdom;

Adding a bpchar into the mix makes a negative verification scan possible, as
does a USING clause having a truncating effect.  Continuing the example, these
can and would get a negative verification scan:
ALTER TABLE t ALTER c TYPE character(6);
ALTER TABLE t ALTER c TYPE varchar(5) USING c::varchar(5);
Plenty of academic USING clause examples exist:
ALTER TABLE t ALTER c TYPE varchar(8) USING CASE c WHEN '<de/>' THEN 'foo' ELSE c END;

Verification scans for conversions between fundamentally different types will
generally end in the negative at the first tuple.  Consider {timestamp,bigint,
real,numeric,varbit}->text, int->bigint, interval->reltime, etc.  I can't think
of a decent non-academic example where heterogeneous conversions like these will
dupe the verification scan for even a handful of tuples.  Granted, one can flip
that around as an argument for declaring the conversions that have a chance.

> And it makes the worst case a LOT worse.

I suppose the absolute worst case would involve an ALTER TABLE adding brutally
expensive CHECK constraints, such that the cost of computing those constraints
would dominate the cost of both the verification scan and the rewrite, yielding
a 100% slower ALTER TABLE run.  A more realistic bad case might be a table much
larger than memory with no indexes, and the verification scan adds a full seq
scan for nothing.  A crude test here has rewriting such a table taking 7x as
long as a seq scan on it.  By the back of that envelope, we'll take about 15%
more time.  The right mix of expensive expressions will raise that percentage,
and the presence of indexes will drop it.  Remember though, we're still only
talking about the relatively-rare cases that even can get a negative
verification scan.

> I really doubt this is worth the complexity anyway -	

We'd have the verification scan regardless of how we choose when to use it,
because how else would we implement no-rewrite varchar(8)->varchar(4) or
text->xml?  An unconditional verification scan is merely the most trivial
algorithm for deciding when to employ it.  Those conversions are semantically
similar to adding CHECK constraints, and in that sense we already have an
initial verification scan implementation: ATRewriteTable(..., InvalidOid, ...).

> converting between two types that are mostly-but-not-quite binary compatible seems like an edge case in every sense.

Yes.  Indeed, that's the intuitive basis for my hypothesis that the verification
scan will usually either fail early.  I don't advocate this approach to pick up
edge cases, but to pick up reasonable cases _without explicit annotations_
showing them to be achievable.  Take the text->xml example, certainly of genuine
value if not top-frequency.  I see three ways to ensure we do a verification
scan for it:

1. Have the user tell us: ALTER TABLE t ALTER c TYPE xml USING c::xml IMPLICIT
2. Mark the text->xml cast as "possibly no-rewrite" and look for that
3. Do a verification scan every time

(1) gets the job done, but quality of user experience is lacking.  (2) is best
for the user, as long as the annotations are well-maintained, but it adds the
most maintenance burden.  (3) has the lowest maintenance burden and, for common
use cases, the user-experience quality of (2), but it can significantly add to
the ALTER TABLE runtime in rare cases.  One can also do (1) with (2) to provide
an override when the annotations are incomplete.  Ultimately, any of these would
work for my own needs.  What's your preference?  Are there other notable options
that preserve full functionality?

Thanks,
nm

In response to

Responses

pgsql-hackers by date

Next:From: Kevin GrittnerDate: 2010-12-29 23:46:55
Subject: SLRU API tweak
Previous:From: Robert TreatDate: 2010-12-29 23:45:49
Subject: Re: Anyone for SSDs?

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group