From: | Korry Douglas <korry(dot)douglas(at)enterprisedb(dot)com> |
---|---|
To: | Victor Wagner <vitus(at)wagner(dot)pp(dot)ru> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Patch: Implement failover on libpq connect level. |
Date: | 2015-12-07 20:26:33 |
Message-ID: | 5665EB79.9070805@enterprisedb.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers pgsql-jdbc |
> I've tried to deal with some of these problems.
>
> My patch have support for following things:
>
> 1. Check whether database instance is in the recovery/standby mode and
> try to find another one if so.
> 2. Let cluster management software to have some time to promote one of
> the standbys to master. I.e. there can be failover timeout specified to
> allow retry after some time if no working master found.
>
> Really there is room for some improvements in handling of connect
> timeout (which became much more important thing when ability to try
> next host appears). Now it is handled only by blocking-mode connect
> functions, not by async state machine. But I decided to publish patch
> without these improvements to get feedback from community.
A bit of testing on this turns up a problem.
Consider a connection string that specifies two hosts and a read/write
connection:
postgresql://korry(at)127(dot)0(dot)0(dot)1:5301,127.0.0.1:5300/edb?readonly=0
If the first host is a healthy standby (meaning that I can connect to it
but pg_is_in_recovery() returns 't'), the state machine will never move
on to the second host.
The problem seems to be in PQconnectPoll() in the case for
CONNECTION_AUTH_OK, specifically this code:
/* We can release the address list now. */
pg_freeaddrinfo_all(conn->addrlist_family, conn->addrlist);
conn->addrlist = NULL;
conn->addr_cur = NULL;
That frees up the list of alternative host addresses. The state machine
then progresses to CONNECTION_CHECK_RO (which invokes
pg_is_in_recovery()), then CONNECTION_CHECK_RW (waiting for the response
from the server). Since we just connected to a standby replica,
pg_is_in_recovery() returns 't' and the state changes to
CONNECTION_NEEDED. The next call to try_next_address() will fail to
find a next address because we freed the list in the case for
CONNECTION_AUTH_OK.
A related issue: when the above sequence occurs, no error message is
returned (because the case for CONNECTION_NEEDED thinks "an appropriate
error message is already set up").
In short, if you successfully connect to standby replica (and specify
readonly=0), the remaining hosts are ignored, even though one of those
hosts is a master.
And one comment about the code itself - in connectDBStart(), you've
added quite a bit of code to parse multiple hosts/ports. I would
recommend adding a comment that shows the expected format, and then
choosing better variable names (better than 'p', 'q', and 'r'); perhaps
the variable names could refer to components of the connection string
that you are parsing (like 'host', 'port', 'delimiter', ...). That
would make the code much easier to read/maintain.
Thanks.
-- Korry
From | Date | Subject | |
---|---|---|---|
Next Message | Korry Douglas | 2015-12-07 21:12:41 | Re: Patch: Implement failover on libpq connect level. |
Previous Message | Gavin Flower | 2015-12-07 20:09:33 | Re: [PATCH] Equivalence Class Filters |
From | Date | Subject | |
---|---|---|---|
Next Message | Korry Douglas | 2015-12-07 21:12:41 | Re: Patch: Implement failover on libpq connect level. |
Previous Message | Sehrope Sarkuni | 2015-12-05 23:17:35 | Re: How to avoid SET application_name = '' |