Errors with schema migration and logical replication — expected?

From: Mike Lissner <mlissner(at)michaeljaylissner(dot)com>
To: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Errors with schema migration and logical replication — expected?
Date: 2018-12-09 07:26:46
Message-ID: CAMp9=Eydz258bTK-7=upANs+Pff4wYymQuG=uNjeupCCrbpKFQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi, first time poster.

I just ran into a rather messy problem when doing a schema migration with
logical replication. I'm not entirely sure what went wrong, why, or how to
prevent it in the future. The migration I ran was pretty simple (though
auto-generated by Django):

BEGIN;ALTER TABLE "search_docketentry" ADD COLUMN
"pacer_sequence_number" smallint NULL;ALTER TABLE "search_docketentry"
ALTER COLUMN "pacer_sequence_number" DROP DEFAULT;ALTER TABLE
"search_docketentry" ADD COLUMN "recap_sequence_number" varchar(50)
DEFAULT '' NOT NULL;ALTER TABLE "search_docketentry" ALTER COLUMN
"recap_sequence_number" DROP DEFAULT;ALTER TABLE "search_docketentry"
ALTER COLUMN "entry_number" DROP NOT NULL;ALTER TABLE
"search_recapdocument" ALTER COLUMN "document_number" SET DEFAULT
'';ALTER TABLE "search_recapdocument" ALTER COLUMN "document_number"
DROP DEFAULT;ALTER TABLE "search_docketentry" DROP CONSTRAINT
"search_docketentry_docket_id_12fd448b9aa007ca_uniq";CREATE INDEX
"search_docketentry_recap_sequence_number_1c82e51988e2d89f_idx" ON
"search_docketentry" ("recap_sequence_number", "entry_number");CREATE
INDEX "search_docketentry_eb19fcf7" ON "search_docketentry"
("pacer_sequence_number");CREATE INDEX "search_docketentry_bff4d47b"
ON "search_docketentry" ("recap_sequence_number");CREATE INDEX
"search_docketentry_recap_sequence_number_d700f0391e8213a_like" ON
"search_docketentry" ("recap_sequence_number" varchar_pattern_ops);
COMMIT;
BEGIN;ALTER TABLE "search_docketentry"
ALTER COLUMN "pacer_sequence_number" TYPE integer;
COMMIT;

And after running this migration, I started getting this error on the
subscriber:

2018-12-09 05:59:45 UTC::@:[13373]:LOG: logical replication apply
worker for subscription "replicasubscription" has started
2018-12-09 05:59:45 UTC::@:[13373]:ERROR: null value in column
"recap_sequence_number" violates not-null constraint
2018-12-09 05:59:45 UTC::@:[13373]:DETAIL: Failing row contains
(48064261, 2018-12-07 04:48:40.388377+00, 2018-12-07
04:48:40.388402+00, null, 576, , 4571214, null, null).
2018-12-09 05:59:45 UTC::@:[6342]:LOG: worker process: logical
replication worker for subscription 18390 (PID 13373) exited with exit
code 1

So, my migration created a new column with a null constraint and somehow
the subscriber got data that violated that. I don't know how that's
possible since this was a new column and it was never nullable.

I applied the above migration simultaneously on my publisher and subscriber
thinking that postgresql was smart enough to do the right thing. I think
the subscriber finished first (it has less traffic).

The docs hint that postgresql might be smart enough to not worry about the
order you do migrations:

> *Logical replication is robust when schema definitions change in a live
database:* When the schema is changed on the publisher and replicated data
starts arriving at the subscriber but does not fit into the table schema,
replication will error until the schema is updated.

And it even hints that doing a migration on the subscriber first is a good
thing in some cases:

> In many cases, intermittent errors can be avoided by applying additive
schema changes to the subscriber first.

But I'm now supremely skeptical that doing anything at the subscriber first
is a good idea. Are the docs wrong? Does the above error make sense? Is the
process for schema migrations documented somewhere beyond the above?

I have lots of questions because I thought this would have gone smoother
than it did.

As for the fix: I made the column nullable on the subscriber and I'm
waiting for it to catch up. Once it does I'll re-sync its schema with the
publisher. Anybody interested in following along with all this (or finding
this later and having questions) can follow the issue here:

https://github.com/freelawproject/courtlistener/issues/919

Thank you for the lovely database! I hope this is helpful.

Mike

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Andrew Dunstan 2018-12-09 12:51:33 Re: amazon aroura config - seriously overcommited defaults? (May be Off Topic)
Previous Message Jeremy Schneider 2018-12-09 00:47:10 Re: Transaction Id Space, Freezing and Wraparound