Re: Inconsistent error handling in START_REPLICATION command

From: "Shulgin, Oleksandr" <oleksandr(dot)shulgin(at)zalando(dot)de>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Craig Ringer <craig(at)2ndquadrant(dot)com>
Subject: Re: Inconsistent error handling in START_REPLICATION command
Date: 2016-01-05 10:35:50
Message-ID: CACACo5T+Ri140A3GqpB6SrivXD1cSHhtaymvst1J8nOZxgKKGg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jan 5, 2016 at 10:39 AM, Shulgin, Oleksandr <
oleksandr(dot)shulgin(at)zalando(dot)de> wrote:

> Hackers,
>
> It looks like there's an inconsistency in error handling during
> START_REPLICATION command of replication protocol:
>
> $ psql postgres://localhost/psycopg2test?replication=database
> psql (9.6devel)
> Type "help" for help.
>
> psycopg2test=# IDENTIFY_SYSTEM;
> systemid | timeline | xlogpos | dbname
> ---------------------+----------+-----------+--------------
> 6235978519197579707 | 1 | 0/2CE0F78 | psycopg2test
> (1 row)
>
> psycopg2test=# START_REPLICATION SLOT "TEST1" LOGICAL 0/0 ("invalid"
> "value");
> ERROR: syntax error
>
> 1) syntax errors are reported and client can retry with corrected command:
>
> psycopg2test=# START_REPLICATION SLOT "TEST1" LOGICAL 0/0 ("invalid"
> 'value');
> ERROR: replication slot name "TEST1" contains invalid character
> HINT: Replication slot names may only contain lower case letters,
> numbers, and the underscore character.
>
> 2) further errors are reported and we can retry:
>
> psycopg2test=# START_REPLICATION SLOT "test1" LOGICAL 0/0 ("invalid"
> 'value');
> ERROR: replication slot "test1" does not exist
>
> psycopg2test=# CREATE_REPLICATION_SLOT "test1" LOGICAL "test_decoding";
> slot_name | consistent_point | snapshot_name | output_plugin
> -----------+------------------+---------------+---------------
> test1 | 0/2CF5340 | 0000088C-1 | test_decoding
> (1 row)
>
> psycopg2test=# START_REPLICATION SLOT "test1" LOGICAL 0/0 ("invalid"
> 'value');
> unexpected PQresultStatus: 8
>
> The last command results in the following output sent to the server log:
>
> ERROR: option "invalid" = "value" is unknown
> CONTEXT: slot "test1", output plugin "test_decoding", in the startup
> callback
>
> But the client has no way to learn about the error, nor can it restart
> with correct one (the server has entered COPY protocol mode):
>
> psycopg2test=# START_REPLICATION SLOT "test1" LOGICAL 0/0;
> PQexec not allowed during COPY BOTH
>
> I recall Craig Rigner mentioning this issue in context of the
> pglogical_output plugin, but I thought that was something to do with the
> startup packet. The behavior above doesn't strike me as very consistent,
> we should be able to detect and report errors during output plugin startup
> and let the client retry with the corrected command as we do for syntax or
> other errors.
>
> I didn't look in the code yet, but if someone knows off top of the head
> the reason to this behavior, I'd be glad to learn it.
>

Looks like the attached patch fixes it for me:

psycopg2test=# START_REPLICATION SLOT "test1" LOGICAL 0/0 ("invalid"
'value');
ERROR: option "invalid" = "value" is unknown
CONTEXT: slot "test1", output plugin "test_decoding", in the startup
callback

psycopg2test=# START_REPLICATION SLOT "test1" LOGICAL 0/0;
unexpected PQresultStatus: 8

There was a similar problem with PHYSICAL replication--if
the requested start LSN is in the future, the error is not reported to the
client and connection is unusable:

psycopg2test=# START_REPLICATION SLOT "test2" PHYSICAL 1FF0/0;
unexpected PQresultStatus: 8

psycopg2test=# START_REPLICATION SLOT "test2" PHYSICAL 0/0;
PQexec not allowed during COPY BOTH

After the patch:

psycopg2test=# START_REPLICATION SLOT "test2" PHYSICAL 1FF0/0;
ERROR: requested starting point 1FF0/0 is ahead of the WAL flush position
of this server 0/2DE6C28

psycopg2test=# START_REPLICATION SLOT "test2" PHYSICAL 0/0;
unexpected PQresultStatus: 8

Actually, would it make sense to move the block that sends CopyBothResponse
message to the WalSndLoop() instead? I think you can't get any closer than
that...

--
Alex

Attachment Content-Type Size
0001-Move-CopyBothResponse-closer-to-WalSndLoop.patch text/x-patch 2.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Haribabu Kommi 2016-01-05 10:49:01 Re: Function and view to retrieve WAL receiver status
Previous Message Christoph Berg 2016-01-05 10:16:37 Re: Building pg_xlogdump reproducibly