Re: WALInsertLock contention

From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: WALInsertLock contention
Date: 2011-06-08 05:59:00
Message-ID: BANLkTik0tOwqsZ3n2JL6wW9arFpeviwpUw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 16, 2011 at 11:02 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> I've been thinking about the problem of $SUBJECT, and while I know
> it's too early to think seriously about any 9.2 development, I want to
> get my thoughts down in writing while they're fresh in my head.
>
> It seems to me that there are two basic approaches to this problem.
> We could either split up the WAL stream into several streams, say one
> per database or one per tablespace or something of that sort, or we
> could keep it as a single stream but try not to do so much locking
> whilst in the process of getting it out the door.  Or we could try to
> do both, and maybe ultimately we'll need to.  However, if the second
> one is practical, it's got two major advantages: it'll probably be a
> lot less invasive, and it won't add any extra fsync traffic.  In
> thinking about how we might accomplish the goal of reducing lock
> contention, it occurred to me there's probably no need for the final
> WAL stream to reflect the exact order in which WAL is generated.
>
> For example, suppose transaction T1 inserts a tuple into table A;
> transaction T2 inserts a tuple into table B; T1 commits; T2 commits.
> The commit records need to be in the right order, and all the actions
> that are part of a given transaction need to precede the associated
> commit record, but, for example, I don't think it would matter if you
> emitted the commit record for T1 before T2's insert into B.  Or you
> could switch the order in which you logged the inserts, since they're
> not touching the same buffers.
>
> So here's the basic idea.  Each backend, if it so desires, is
> permitted to maintain a per-backend WAL buffer.  Per-backend WAL
> buffers live in shared memory and can be accessed by any backend, but
> the idea is that most of the time only one backend will be accessing
> them, so that the locks won't be heavily contended.  Any WAL written
> to a per-backend WAL buffer will eventually be transferred into the
> main WAL buffers, and flushed.  When a process writes to a per-backend
> WAL buffer, it writes (1) the actual WAL data and (2) the list of
> buffers affected.  Those buffers are stamped with a fake LSN that
> points back to the per-backend WAL buffer, and they can't be written
> until the WAL has been moved from the per-backend WAL buffers to the
> main WAL buffers.
>
> So, if a buffer with a fake LSN needs to be (a) written back to the OS
> or (b) modified by a backend other than the one that owns the fake
> LSN, this triggers a flush of the per-backend WAL buffers to the main
> WAL buffers.  When this happens, all the affected buffers get stamped
> with a real LSN and the entries are discarded from the per-backend WAL
> buffers.  Such a flush would also be needed when a backend commits or
> otherwise needs an XLOG flush, or when there's no more per-backend
> buffer space.  In theory, all of this taken together should mean that
> WAL gets pushed out in larger chunks: a transaction that does three
> inserts and commits should only need to grab WALInsertLock once,
> instead of once per heap insert, once per index insert, and again for
> the commit, though it'll l have to write a bigger chunk of data when it
> does get the lock.  It'lhave to repeatedly grab the lock on its
> per-backend WAL buffer, but ideally that's uncontended.

There's probably an obvious explanation that I'm not seeing, but if
you're not delegating the work of writing the buffers out to someone
else, why do you need to lock the per backend buffer at all? That is,
why does it have to be in shared memory? Suppose that if the
following are true:
*) Writing qualifying data (non commit, non switch)
*) There is room left in whatever you are copying to
you could trylock WalInsertLock, and if failing to get it, just copy
qualifying data into a private buffer and punt if the following are
true...otherwise just do the current behavior.

When you *do* get a lock, either because you got lucky or because you
had to wait anyways, you write out the data your previously staged,
fixing up the LSNs as you go. Even if you do have to write it to
shared memory, I think your idea is a winner -- probably a fair amount
of work can get done before ultimately forced to wait...maybe enough
to change the scaling dyanmics.

> A further refinement would be to try to jigger things so that as a
> backend fills up per-backend WAL buffers, it somehow throws them over
> the fence to one of the background processes to write out.  For
> short-running transactions, that won't really make any difference,
> since the commit will force the per-backend buffers out to the main
> buffers anyway.  But for long-running transactions it seems like it
> could be quite useful; in essence, the task of assembling the final
> WAL stream from the WAL output of individual backends becomes a
> background activity, and ideally the background process doing the work
> is the only one touching the cache lines being shuffled around.  Of
> course, to make this work, backends would need a steady supply of
> available per-backend WAL buffers.  Maybe shared buffers could be used
> for this purpose, with the buffer header being marked in some special
> way to indicate that this is what the buffer's being used for.

That seems complicated -- plus I think the key is to distribute as
much of the work as possible. Why would the forward lateral to the
background processor not require a similar lock to WalInsertLock?

merlin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2011-06-08 06:15:14 Re: reindex creates predicate lock on index root
Previous Message Tom Lane 2011-06-08 05:32:47 Re: [Pgbuildfarm-members] CREATE FUNCTION hang on test machine polecat on HEAD