Re: Synchronizing slots from primary to standby

From: shveta malik <shveta(dot)malik(at)gmail(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "Drouvot, Bertrand" <bertranddrouvot(dot)pg(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Ajin Cherian <itsajin(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, shveta malik <shveta(dot)malik(at)gmail(dot)com>
Subject: Re: Synchronizing slots from primary to standby
Date: 2023-12-27 10:13:44
Message-ID: CAJpy0uCrd6kO5ghkVqhBjtqmGO=7NN7LBDj37AncAWQrrtd=Fw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Dec 27, 2023 at 11:36 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> Hi,
>
> Thank you for working on this.
>
> On Tue, Dec 26, 2023 at 9:27 PM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
> >
> > On Tue, Dec 26, 2023 at 4:41 PM Zhijie Hou (Fujitsu)
> > <houzj(dot)fnst(at)fujitsu(dot)com> wrote:
> > >
> > > On Wednesday, December 20, 2023 7:37 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > >
> > > > On Wed, Dec 20, 2023 at 3:29 PM shveta malik <shveta(dot)malik(at)gmail(dot)com>
> > > > wrote:
> > > > >
> > > > > On Wed, Dec 20, 2023 at 9:12 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> > > > wrote:
> > > > > >
> > > > > > On Tue, Dec 19, 2023 at 5:30 PM shveta malik <shveta(dot)malik(at)gmail(dot)com>
> > > > wrote:
> > > > > > >
> > > > > > > Thanks for reviewing. I have addressed these in v50.
> > > > > > >
> > > > > >
> > > > > > I was looking at this patch to see if something smaller could be
> > > > > > independently committable. I think we can extract
> > > > > > pg_get_slot_invalidation_cause() and commit it as that function
> > > > > > could be independently useful as well. What do you think?
> > > > > >
> > > > >
> > > > > Sure, forked another thread [1]
> > > > > [1]:
> > > > >
> > > > https://www.postgresql.org/message-id/CAJpy0uBpr0ym12%2B0mXpjcRFA6
> > > > N%3D
> > > > > anX%2BYk9aGU4EJhHNu%3DfWykQ%40mail.gmail.com
> > > > >
> > > >
> > > > Thanks, thinking more, we can split the patch into the following three patches
> > > > which can be committed separately (a) Allowing the failover property to be set
> > > > for a slot via SQL API and subscription commands
> > > > (b) sync slot worker infrastructure (c) GUC standby_slot_names and the the
> > > > corresponding wait logic in server-side.
> > > >
> > > > Thoughts?
> > >
> > > I agree. Here is the V54 patch set which was split based on the suggestion.
> > > The commit message in each patch is also improved.
> > >
> >
> > I would like to revisit the current dependency of slotsync worker on
> > dbname used in 002 patch. Currently we accept dbname in
> > primary_conninfo and thus the user has to make sure to provide one (by
> > manually altering it) even in case of a conf file auto-generated by
> > "pg_basebackup -R".
> > Thus I would like to discuss if there are better ways to do it.
> > Complete background is as follow:
> >
> > We need dbname for 2 purposes:
> >
> > 1) to connect to remote db in order to run SELECT queries to fetch the
> > info needed by slotsync worker.
> > 2) to make connection in slot-sync worker itself in order to be able
> > to use libpq APIs for 1)
> >
> > We run 3 kind of select queries in slot-sync worker currently:
> >
> > a) To fetch all failover slots (logical slots) info at once in
> > synchronize_slots().
> > b) To fetch a particular slot info during
> > wait_for_primary_slot_catchup() logic (logical slot).
> > c) To validate primary slot (physical one) and also to distinguish
> > between standby and cascading standby by running pg_is_in_recovery().
> >
> > 1) One approach to avoid dependency on dbname is using commands
> > instead of SELECT. This will need implementing LIST_SLOTS command for
> > a), and for b) we can use LIST_SLOTS and fetch everything (even though
> > it is not needed) or have LIST_SLOTS with a filter on slot-name or
> > extend READ_REPLICATION_SLOT, and for c) we can have some other
> > command to get pg_is_in_recovery() info. But, I feel by relying on
> > commands we will be making the extension of the slot-sync feature
> > difficult. In future, if there is some more requirement to fetch any
> > other info,
> > then there too we have to implement a command. I am not sure if it is
> > good and extensible approach.
> >
> > 2) Another way to avoid asking for a dbname in primary_conninfo is to
> > use the default dbname internally. This brings us to two questions:
> > 'How' and 'Which default db'?
> >
> > 2.1) To answer 'How':
> > Using default dbname is simpler for the purpose of slot-sync worker
> > having its own db-connection, but is a little tricky for the purpose
> > of connection to remote_db. This is because we have to inject this
> > dbname internally in our connection-info.
> >
> > 2.1.1) Say we use primary_conninfo (i.e. original one w/o dbname),
> > then currently it could have 2 formats:
> >
> > a) The simple "=" format for key-value pairs, example:
> > 'user=replication host=127.0.0.1 port=5433 dbname=postgres'.
> > b) URI format, example:
> > postgresql://other(at)localhost/otherdb?connect_timeout=10&application_name=myapp
> >
> > We can distinguish between the 2 formats using 'uri_prefix_length' but
> > injecting the dbname part will be messy specially for URI format. If
> > we want to do it w/o injecting and only by changing libpq interfaces
> > to accept dbname separately apart from conninfo, then there is no
> > current simpler way available. It will need a good amount of changes
> > in libpq.
> >
> > 2.1.2) Another way is to not rely on primary_conninfo directly but
> > rely on 'WalRcv->conninfo' in order to connect to remote_db. This is
> > because the latter is never URI format, it is some parsed format and
> > appending may work. As an example, primary_conninfo =
> > 'postgresql://replication(at)localhost:5433', WalRcv->conninfo loaded
> > internally is:
> > "user=replication passfile=/home/shveta/.pgpass channel_binding=prefer
> > dbname=replication host=localhost port=5433
> > fallback_application_name=walreceiver sslmode=prefer sslcompression=0
> > sslcertmode=allow sslsni=1 ssl_min_protocol_version=TLSv1.2
> > gssencmode=disable krbsrvname=postgres gssdelegation=0
> > target_session_attrs=any load_balance_hosts=disable", '\000'
> >
> > So we can try appending our default dbname to this. But all the
> > defaults loaded in WalRcv->conninfo need some careful analysis to
> > figure out if they work for slot-sync worker case.
> >
> > 2.2) Now coming to 'Which default db':
> >
> > 2.2.1) If we use 'template1' as default db, it may block 'create db'
> > operations on primary for the time when the slot-sync worker is
> > connected to remote using this dbname. Example:
> >
> > postgres=# create database newdb1;
> > ERROR: source database "template1" is being accessed by other users
> > DETAIL: There is 1 other session using the database.
> >
> > 2.2.2) If we use 'postgres' as default db, there are chances that it
> > can be dropped as unlike 'template1', it is allowed to be dropped by
> > user, and if slotsync worker is connected to it, user may see:
> > newdb1=# drop database postgres;
> > ERROR: database "postgres" is being accessed by other users
> > DETAIL: There is 1 other session using the database.
> >
> > But once the slot-sync worker or standby goes down, user can always
> > drop this and next time slot-sync worker may not be able to come up.
> >
>
> Other random ideas for discussion are:
>
> 3) The slotsync worker uses primary_conninfo but also uses a new GUC
> parameter, say slot_sync_dbname, to specify the database to connect.
> The slot_sync_dbname overwrites the dbname if primary_conninfo also
> specifies it. If both don't have a dbname, raise an error.
>
> 4) The slotsync worker uses a new GUC parameter, say
> slot_sync_conninfo, to specify the connection string to the primary
> aside from primary_conninfo. And pg_basebackup -R generates
> slot_sync_conninfo as well if required (new option required).
>
> BTW given that the slotsync worker executes only normal SQL queries,
> is there any reason why it uses a replication connection?

Thank You for the feedback.
Do you mean why are we using libpqwalreceiver.c APIs instead of using
libpq directly? I was not aware if there is any way to connect if we
want to run SQL queries. I initially tried using 'PQconnectdbParams'
but couldn't make it work. Perhaps it is to be used only by front-end
and extensions as the header files indicate as well:
* libpq-fe.h : This file contains definitions for structures and
externs for functions used by frontend postgres applications.
* libpq-be-fe-helpers.h: Helper functions for using libpq in
extensions . Code built directly into the backend is not allowed to
link to libpq directly.

Do you mean some other kind of connection here?

thanks
Shveta

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2023-12-27 10:42:53 Re: Synchronizing slots from primary to standby
Previous Message shveta malik 2023-12-27 09:44:51 Re: Function to get invalidation cause of a replication slot.