Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Amir Rohan <amir(dot)rohan(at)zoho(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, Greg Smith <gsmith(at)gregsmith(dot)com>
Subject: Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.
Date: 2015-10-08 13:47:53
Message-ID: CAB7nPqQJ7X4Q+hDPxvVHY5Ucic3E7pGb335u7k_qq-yqCdSGaw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Oct 8, 2015 at 6:03 PM, Amir Rohan wrote:
> On 10/08/2015 10:39 AM, Michael Paquier wrote:
>>> Someone mentioned a daisy chain setup which sounds fun. Anything else in
>>> particular? Also, it would be nice to have some canned way to measure
>>> end-to-end replication latency for variable number of nodes.
>>
>> Hm. Do you mean comparing the LSN position between two nodes even if
>> both nodes are not connected to each other? What would you use it for?
>>
>
> In a cascading replication setup, the typical _time_ it takes for a
> COMMIT on master to reach the slave (assuming constant WAL generation
> rate) is an important operational metric.

Hm. You mean the exact amount of time it gets to be sure that a given
WAL position has been flushed on a cascading standby, be it across
multiple layers. Er, that's a bit tough without patching the backend
where I guess we would need to keep a track of when a LSN position has
been flushed. And calls of gettimeofday are expensive, so that does
not sound like a plausible alternative here to me...

> It would be useful to catch future regressions for that metric,
> which may happen even when a patch doesn't outright break cascading
> replication. Just automating the measurement could be useful if
> there's no pg facility that tracks performance over time in
> a regimented fashion. I've seen multiple projects which consider
> a "benchmark suite" to be part of its testing strategy.

Ah, OK. I see. That's a bit out of scope of this patch, and that's
really OS-dependent, but as long as the comparisons can be done on the
same OS it would make sense.

> As for the "daisy chain" thing, it was (IIRC) mentioned in a josh berkus
> talk I caught on youtube. It's possible to setup cascading replication,
> take down the master, and then reinsert it as replicating slave, so that
> you end up with *all* servers replicating from the
> ancestor in the chain, and no master. I think it was more
> a fun hack then anything, but also an interesting corner case to
> investigate.

Ah, yes. I recall this one. I am sure it made the audience smile. All
the nodes link to each other in closed circle.

>>> What about going back through the commit log and writing some regression
>>> tests for the real stinkers, if someone care to volunteer some candidate
>>> bugs
>>
>> I have drafted a list with a couple of items upthread:
>> http://www.postgresql.org/message-id/CAB7nPqSgffSPhOcrhFoAsDAnipvn6WsH2nYkf1KayRm+9_MTGw@mail.gmail.com
>> So based on the existing patch the list becomes as follows:
>> - wal_retrieve_retry_interval with a high value, say setting to for
>> example 2/3s and loop until it is applied by checking it is it has
>> been received by the standby every second.
>> - recovery_target_action
>> - archive_cleanup_command
>> - recovery_end_command
>> - pg_xlog_replay_pause and pg_xlog_replay_resume
>> In the list of things that could have a test, I recall that we should
>> test as well 2PC with the recovery delay, look at a1105c3d. This could
>> be included in 005.
>
> a1105c3 Mar 23 Fix copy & paste error in 4f1b890b137. Andres Freund
> 4f1b890 Mar 15 Merge the various forms of transaction commit & abort
> records. Andres Freund
>
> Is that the right commit?

That's this one. a1105c3 was actually rather tricky... The idea is to
simply check the WAL replay delay with COMMIT PREPARED.

>> The advantage of implementing that now is that we could see if the
>> existing routines are solid enough or not.
>
> I can do this if you point me at a self-contained thread/#issue.

Hm. This patch is already 900 lines, perhaps it would be wiser not to
make it more complicated for now..
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Beena Emerson 2015-10-08 13:59:54 Re: Support for N synchronous standby servers - take 2
Previous Message Tomas Vondra 2015-10-08 13:24:35 Re: PATCH: index-only scans with partial indexes