Re: WAL logging of SELECT ... INTO command

From: Csaba Nagy <nagy(at)ecircle-ag(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject: Re: WAL logging of SELECT ... INTO command
Date: 2006-03-22 16:19:58
Message-ID: 1143044398.23597.80.camel@coppola.muc.ecircle.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-patches pgsql-performance

On Wed, 2006-03-22 at 16:35, Tom Lane wrote:
> "Jim C. Nasby" <jnasby(at)pervasive(dot)com> writes:
> > Oh, so in other words, SELECT * INTO temp FROM table is inherently
> > non-deterministic at the physical level, so the only way to be able to
> > allow PITR to work is to duplicate all the physical changes. Darn.
>
> Well, lemme put it this way: I'm not prepared to require that PG be
> deterministic at the physical level. One obvious source of
> non-determinancy is the FSM, which is likely to hand out different free
> space to different transactions depending on what else is going on at
> the same time. There are others, such as deliberately random
> tie-breaking during btree index insertion.

While you're at talking about WAL and PITR... I see from the aboce
discussion that PITR is already demanding special handling in the code
(I hope I got this one right, as the following are based on this).

What if the PITR logging would be disconnected from the WAL logging
completely ?

What I mean is to introduce a WAL subscription mechanism, which
basically means some incoming connections where we stream the log
records. We don't need to write them to disk at all in the normal case,
I guess usually PITR will store the records on some other machine so it
means network, not disk. And it doesn't need to be done synchronously,
it can lag behind the running transactions, and we can do it in batches
of WAL records.

It also would mean that the local WAL does not need to log the things
which are only needed for the PITR... that would likely mean some spared
WAL disk activity. Of course it also would mean that the local WAL and
PITR WAL are not the same, but that is not an issue I guess.

It would also permit immediate recycling of the WAL files if the current
archiving style is not used.

The drawbacks I can see (please add yours):
1) the need for the subscription management code with the added
complexity it implies;
2) problems if the WAL stream lags too much behind;
3) problems if the subscribed client's connection is interrupted;

Nr. 2 could be solved by saving the PITR WAL separately if the lag grows
over a threshold, and issue a warning. This could still be acceptable,
as the writing doesn't have to be synchronous and can be made in
relatively large blocks.
There could be a second bigger lag threshold which completely cancels
the subscription. All these thresholds should be configurable, as it
depends on the application what's more important, to have the standby
available all the time or have the primary faster if loaded...

Nr. 3. can be solved by either canceling the subscription on connection
drop, or by allowing a certain amount of time after which the
subscription is canceled. The client can reconnect before this timeout
expires. In the meantime the primary can store the PITR WAL on disk as
mentioned above...

Cheers,
Csaba.

In response to

Browse pgsql-patches by date

  From Date Subject
Next Message Mark Wong 2006-03-22 18:42:39 WIP: splitting BLCKSZ
Previous Message Luke Lonergan 2006-03-22 16:19:40 Re: Automatically setting work_mem

Browse pgsql-performance by date

  From Date Subject
Next Message Guillaume Cottenceau 2006-03-22 16:25:56 Re: planner with index scan cost way off actual cost,
Previous Message Scott Marlowe 2006-03-22 16:16:14 Re: planner with index scan cost way off actual cost,