Re: [Bizgres-general] WAL bypass for INSERT, UPDATE and

From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>, bizgres-general <bizgres-general(at)pgfoundry(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [Bizgres-general] WAL bypass for INSERT, UPDATE and
Date: 2005-12-23 10:18:43
Message-ID: 1135333123.2964.589.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 2005-12-22 at 17:36 -0500, Stephen Frost wrote:
> * Simon Riggs (simon(at)2ndquadrant(dot)com) wrote:
> > On Thu, 2005-12-22 at 21:18 +0100, Martijn van Oosterhout wrote:
> > > Considering "WAL bypass" is code for "breaks PITR"
> >
> > No it isn't. All of the WAL bypass logic does *not* operate when PITR is
> > active. The WAL bypass logic is aimed at Data Warehouses, which
> > typically never operate in PITR mode for performance reasons, however
> > the choice is yours.

OK, thanks for saying all of that; you probably speak for many in
raising these concerns. I'll answer each bit as we come to it. Suffice
to say, your concerns are good and so are the answers:

> Eh? PITR mode is bad for performance? Maybe I missed something but I
> wouldn't have thought PITR would degrade regular performance all that
> badly.

PITR mode is *not* bad for performance. On a very heavily loaded
write-intensive test system, the general PITR overhead on regular
performance was around 1% - so almost negligible.

We have been discussing a number of optimizations to specific commands
that would allow them to avoid writing WAL and thus speed up their
performance. If archive_command is set then WAL will always be written;
if it is not set then these commands will (or could) go faster:

- CREATE TABLE AS SELECT (in 8.1)
- COPY LOCK (patch submitted)
- COPY in same transaction as CREATE TABLE (patch submitted)
- INSERT SELECT in same transaction as CREATE TABLE (this discussion)

(There are a number of other conditions also, such as there must be no
indexes on a table. All of which now documented with the patch)

> So long as it doesn't take 15 minutes or some such to move the
> WAL to somewhere else (and I'm not sure that'd even slow things down..).
> For a Data Warehouse, have you got a better way of doing backups such
> that you don't lose at minimum most of a day's work?

Yes. Don't just use the backup facilities on their own. Think about how
the architecture of your systems will work and see if there is a better
way when you look at very large systems.

> I'm not exactly a
> big fan do doing a pg_dump every night either given that the database is
> 360GB. Much nicer to take a weekly dump of the database and then do
> PITR for a week or two before taking another dump of the db.

e.g. Keep your reference data (low volume) in an Operational Data Store
(ODS) database, protected by archiving. Keep your main fact data (high
volume) in the Data Warehouse, but save the data in slices as you load
it, so that a recovery is simply a reload of the database: no PITR or
pg_dump required, so high performance data transformation and load work
is possible. This is a commonly used architectural design pattern.

> I like the idea of making COPY go faster, but please don't break my
> backup system while you're at it.

On a personal note, I would only add that I spent a long time working on
PITR and I would never design anything that would intentionally break it
(nor would patches be accepted that did that). That probably gives me
the confidence to approach designs that might look like I'm doing that,
but without actually straying over the edge.

> I'm honestly kind of nervous about
> what you mean by checking it PITR is active- how is that done, exactly?
> Check if you have a script set to rotate the logs elsewhere? Or is it
> checking if you're in the taking-a-full-database-backup stage? Or what?

Internally, we use XLogArchivingActive(). Externally this will be set
when the admin sets archive_command to a particular value.

My original preference was for a parameter called archive_mode= ON | OFF
which would allow us to more easily discuss this, but this does not
currently exist.

> What's the performance decrease when using PITR, and what's it from? Is
> it just that COPY isn't as fast? Honestly, I could live with COPY being
> not as fast as it could be if my backups work. :)

These commands will not be optimized for speed when archive_command is set:
- CREATE TABLE AS SELECT (in 8.1)
- COPY LOCK (patch submitted)

> Sorry for sounding concerned but, well, backups are very important and
> so is performance and I'm afraid either I've not read all the
> documentation about the issues being discussed here or there isn't
> enough out there to make sense of it all yet. :)

If you choose PITR, then you are safe. If you do not, the crash recovery
of the database is not endangered by these optimizations.

Hope that covers all of your concerns?

I'm just writing a course that explains many of these techniques,
available in the New Year.

Best Regards, Simon Riggs

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message ohp 2005-12-23 13:19:09 Re: where is the output
Previous Message Tom Lane 2005-12-23 05:39:51 Re: [pgadmin-hackers] Client-side password encryption