Re: cheaper snapshots redux

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Markus Wanner <markus(at)bluegap(dot)ch>
Cc: Jim Nasby <jim(at)nasby(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: cheaper snapshots redux
Date: 2011-08-25 13:24:22
Message-ID: CA+Tgmoa0E0jNd=F1i9NtavjmKhg5cPWcyQHVVLkqH0woC7obfA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Aug 25, 2011 at 1:55 AM, Markus Wanner <markus(at)bluegap(dot)ch> wrote:
>> One difference with snapshots is that only the latest snapshot is of
>> any interest.
>
> Theoretically, yes.  But as far as I understood, you proposed the
> backends copy that snapshot to local memory.  And copying takes some
> amount of time, possibly being interrupted by other backends which add
> newer snapshots...  Or do you envision the copying to restart whenever a
> new snapshot arrives?

My hope (and it might turn out that I'm an optimist) is that even with
a reasonably small buffer it will be very rare for a backend to
experience a wraparound condition. For example, consider a buffer
with ~6500 entries, approximately 64 * MaxBackends, the approximate
size of the current subxip arrays taken in aggregate. I hypothesize
that a typical snapshot on a running system is going to be very small
- a handful of XIDs at most - because, on the average, transactions
are going to commit in *approximately* increasing XID order and, if
you take the regression tests as representative of a real workload,
only a small fraction of transactions will have more than one XID. So
it seems believable to think that the typical snapshot on a machine
with max_connections=100 might only be ~10 XIDs, even if none of the
backends are read-only. So the backend taking a snapshot only needs
to be able to copy < ~64 bytes of information from the ring buffer
before other backends write ~27k of data into that buffer, likely
requiring hundreds of other commits. That seems vanishingly unlikely;
memcpy() is very fast. If it does happen, you can recover by
retrying, but it should be a once-in-a-blue-moon kind of thing. I
hope.

Now, as the size of the snapshot gets bigger, things will eventually
become less good. For example if you had a snapshot with 6000 XIDs in
it then every commit would need to write over the previous snapshot
and things would quickly deteriorate. But you can cope with that
situation using the same mechanism we already use to handle big
snapshots: toss out all the subtransaction IDs, mark the snapshot as
overflowed, and just keep the toplevel XIDs. Now you've got at most
~100 XIDs to worry about, so you're back in the safety zone. That's
not ideal in the sense that you will cause more pg_subtrans lookups,
but that's the price you pay for having a gazillion subtransactions
floating around, and any system is going to have to fall back on some
sort of mitigation strategy at some point. There's no useful limit on
the number of subxids a transaction can have, so unless you're
prepared to throw an unbounded amount of memory at the problem you're
going to eventually have to punt.

It seems to me that the problem case is when you are just on the edge.
Say you have 1400 XIDs in the snapshot. If you compact the snapshot
down to toplevel XIDs, most of those will go away and you won't have
to worry about wraparound - but you will pay a performance penalty in
pg_subtrans lookups. On the other hand, if you don't compact the
snapshot, it's not that hard to imagine a wraparound occurring - four
snapshot rewrites could wrap the buffer. You would still hope that
memcpy() could finish in time, but if you're rewriting 1400 XIDs with
any regularity, it might not take that many commits to throw a spanner
into the works. If the system is badly overloaded and the backend
trying to take a snapshot gets descheduled for a long time at just the
wrong moment, it doesn't seem hard to imagine a wraparound happening.

Now, it's not hard to recover from a wraparound. In fact, we can
pretty easily guarantee that any given attempt to take a snapshot will
suffer a wraparound at most once. The writers (who are committing)
have to be serialized anyway, so anyone who suffers a wraparound can
just grab the same lock in shared mode and retry its snapshot. Now
concurrency decreases significantly, because no one else is allowed to
commit until that guy has got his snapshot, but right now that's true
*every time* someone wants to take a snapshot, so falling back to that
strategy occasionally doesn't seem prohibitively bad. However, you
don't want it to happen very often, because even leaving aside the
concurrency hit, it's double work: you have to try to take a snapshot,
realize you've had a wraparound, and then retry. It seems pretty
clear that with a big enough ring buffer the wraparound problem will
become so infrequent as to be not worth worrying about. I'm
theorizing that even with a quite small ring buffer the problem will
still be infrequent enough not to worry about, but that might be
optimistic. I think I'm going to need some kind of test case that
generates very large, frequently changing snapshots.

Of course even if wraparound turns out not to be a problem there are
other things that could scuttle this whole approach, but I think the
idea has enough potential to be worth testing. If the whole thing
crashes and burns I hope I'll at least learn enough along the way to
design something better...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2011-08-25 13:29:40 Re: patch to slightly improve clarity of a comment in postgresql.conf.sample
Previous Message Dean Rasheed 2011-08-25 10:08:25 Re: Inputting relative datetimes