Re: Speed up Clog Access by increasing CLOG buffers

From: Andres Freund <andres(at)anarazel(dot)de>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: David Steele <david(at)pgmasters(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thom Brown <thom(at)linux(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-03-22 10:52:21
Message-ID: 20160322105221.GD3790@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

Hi,

On 2016-03-15 10:47:12 +0530, Amit Kapila wrote:
> @@ -248,12 +256,67 @@ set_status_by_pages(int nsubxids, TransactionId *subxids,
> * Record the final state of transaction entries in the commit log for
> * all entries on a single page. Atomic only on this page.
> *
> + * Group the status update for transactions. This improves the efficiency
> + * of the transaction status update by reducing the number of lock
> + * acquisitions required for it. To achieve the group transaction status
> + * update, we need to populate the transaction status related information
> + * in shared memory and doing it for overflowed sub-transactions would need
> + * a big chunk of shared memory, so we are not doing this optimization for
> + * such cases. This optimization is only applicable if the transaction and
> + * all child sub-transactions belong to same page which we presume to be the
> + * most common case, we might be able to apply this when they are not on same
> + * page, but that needs us to map sub-transactions in proc's XidCache based
> + * on pageno for which each time a group leader needs to set the transaction
> + * status and that can lead to some performance penalty as well because it
> + * needs to be done after acquiring CLogControlLock, so let's leave that
> + * case for now. We don't do this optimization for prepared transactions
> + * as the dummy proc associated with such transactions doesn't have a
> + * semaphore associated with it and the same is required for group status
> + * update. We choose not to create a semaphore for dummy procs for this
> + * purpose as the advantage of using this optimization for prepared transactions
> + * is not clear.
> + *

I think you should try to break up some of the sentences, one of them
spans 7 lines.

I'm actually rather unconvinced that it's all that common that all
subtransactions are on one page. If you have concurrency - otherwise
there'd be not much point in this patch - they'll usually be heavily
interleaved, no? You can argue that you don't care about subxacts,
because they're more often used in less concurrent scenarios, but if
that's the argument, it should actually be made.

> * Otherwise API is same as TransactionIdSetTreeStatus()
> */
> static void
> TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
> TransactionId *subxids, XidStatus status,
> - XLogRecPtr lsn, int pageno)
> + XLogRecPtr lsn, int pageno,
> + bool all_xact_same_page)
> +{
> + /*
> + * If we can immediately acquire CLogControlLock, we update the status
> + * of our own XID and release the lock. If not, use group XID status
> + * update to improve efficiency and if still not able to update, then
> + * acquire CLogControlLock and update it.
> + */
> + if (LWLockConditionalAcquire(CLogControlLock, LW_EXCLUSIVE))
> + {
> + TransactionIdSetPageStatusInternal(xid, nsubxids, subxids, status, lsn, pageno);
> + LWLockRelease(CLogControlLock);
> + }
> + else if (!all_xact_same_page ||
> + nsubxids > PGPROC_MAX_CACHED_SUBXIDS ||
> + IsGXactActive() ||
> + !TransactionGroupUpdateXidStatus(xid, status, lsn, pageno))
> + {
> + LWLockAcquire(CLogControlLock, LW_EXCLUSIVE);
> +
> + TransactionIdSetPageStatusInternal(xid, nsubxids, subxids, status, lsn, pageno);
> +
> + LWLockRelease(CLogControlLock);
> + }
> +}
>

This code is a bit arcane. I think it should be restructured to
a) Directly go for LWLockAcquire if !all_xact_same_page || nsubxids >
PGPROC_MAX_CACHED_SUBXIDS || IsGXactActive(). Going for a conditional
lock acquire first can be rather expensive.
b) I'd rather see an explicit fallback for the
!TransactionGroupUpdateXidStatus case, this way it's too hard to
understand. It's also harder to add probes to detect whether that

> +
> +/*
> + * When we cannot immediately acquire CLogControlLock in exclusive mode at
> + * commit time, add ourselves to a list of processes that need their XIDs
> + * status update.

At this point my "ABA Problem" alarm goes off. If it's not an actual
danger, can you please document close by, why not?

> The first process to add itself to the list will acquire
> + * CLogControlLock in exclusive mode and perform TransactionIdSetPageStatusInternal
> + * on behalf of all group members. This avoids a great deal of contention
> + * around CLogControlLock when many processes are trying to commit at once,
> + * since the lock need not be repeatedly handed off from one committing
> + * process to the next.
> + *
> + * Returns true, if transaction status is updated in clog page, else return
> + * false.
> + */
> +static bool
> +TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
> + XLogRecPtr lsn, int pageno)
> +{
> + volatile PROC_HDR *procglobal = ProcGlobal;
> + PGPROC *proc = MyProc;
> + uint32 nextidx;
> + uint32 wakeidx;
> + int extraWaits = -1;
> +
> + /* We should definitely have an XID whose status needs to be updated. */
> + Assert(TransactionIdIsValid(xid));
> +
> + /*
> + * Add ourselves to the list of processes needing a group XID status
> + * update.
> + */
> + proc->clogGroupMember = true;
> + proc->clogGroupMemberXid = xid;
> + proc->clogGroupMemberXidStatus = status;
> + proc->clogGroupMemberPage = pageno;
> + proc->clogGroupMemberLsn = lsn;
> + while (true)
> + {
> + nextidx = pg_atomic_read_u32(&procglobal->clogGroupFirst);
> +
> + /*
> + * Add the proc to list, if the clog page where we need to update the
> + * current transaction status is same as group leader's clog page.
> + * There is a race condition here such that after doing the below
> + * check and before adding this proc's clog update to a group, if the
> + * group leader already finishes the group update for this page and
> + * becomes group leader of another group which updates different clog
> + * page, then it will lead to a situation where a single group can
> + * have different clog page updates. Now the chances of such a race
> + * condition are less and even if it happens, the only downside is
> + * that it could lead to serial access of clog pages from disk if
> + * those pages are not in memory. Tests doesn't indicate any
> + * performance hit due to different clog page updates in same group,
> + * however in future, if we want to improve the situation, then we can
> + * detect the non-group leader transactions that tries to update the
> + * different CLOG page after acquiring CLogControlLock and then mark
> + * these transactions such that after waking they need to perform CLOG
> + * update via normal path.
> + */

Needs a good portion of polishing.

> + if (nextidx != INVALID_PGPROCNO &&
> + ProcGlobal->allProcs[nextidx].clogGroupMemberPage != proc->clogGroupMemberPage)
> + return false;

I think we're returning with clogGroupMember = true - that doesn't look
right.

> + pg_atomic_write_u32(&proc->clogGroupNext, nextidx);
> +
> + if (pg_atomic_compare_exchange_u32(&procglobal->clogGroupFirst,
> + &nextidx,
> + (uint32) proc->pgprocno))
> + break;
> + }

So this indeed has ABA type problems. And you appear to be arguing above
that that's ok. Need to ponder that for a bit.

So, we enqueue ourselves as the *head* of the wait list, if there's
other waiters. Seems like it could lead to the first element after the
leader to be delayed longer than the others.

FWIW, You can move the nextidx = part of out the loop,
pgatomic_compare_exchange will update the nextidx value from memory; no
need for another load afterwards.

> + /*
> + * If the list was not empty, the leader will update the status of our
> + * XID. It is impossible to have followers without a leader because the
> + * first process that has added itself to the list will always have
> + * nextidx as INVALID_PGPROCNO.
> + */
> + if (nextidx != INVALID_PGPROCNO)
> + {
> + /* Sleep until the leader updates our XID status. */
> + for (;;)
> + {
> + /* acts as a read barrier */
> + PGSemaphoreLock(&proc->sem);
> + if (!proc->clogGroupMember)
> + break;
> + extraWaits++;
> + }
> +
> + Assert(pg_atomic_read_u32(&proc->clogGroupNext) == INVALID_PGPROCNO);
> +
> + /* Fix semaphore count for any absorbed wakeups */
> + while (extraWaits-- > 0)
> + PGSemaphoreUnlock(&proc->sem);
> + return true;
> + }
> +
> + /* We are the leader. Acquire the lock on behalf of everyone. */
> + LWLockAcquire(CLogControlLock, LW_EXCLUSIVE);
> +
> + /*
> + * Now that we've got the lock, clear the list of processes waiting for
> + * group XID status update, saving a pointer to the head of the list.
> + * Trying to pop elements one at a time could lead to an ABA problem.
> + */
> + while (true)
> + {
> + nextidx = pg_atomic_read_u32(&procglobal->clogGroupFirst);
> + if (pg_atomic_compare_exchange_u32(&procglobal->clogGroupFirst,
> + &nextidx,
> + INVALID_PGPROCNO))
> + break;
> + }

Hm. It seems like you should should simply use pg_atomic_exchange_u32(),
rather than compare_exchange?

> diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
> index c4fd9ef..120b9c0 100644
> --- a/src/backend/access/transam/twophase.c
> +++ b/src/backend/access/transam/twophase.c
> @@ -177,7 +177,7 @@ static TwoPhaseStateData *TwoPhaseState;
> /*
> * Global transaction entry currently locked by us, if any.
> */
> -static GlobalTransaction MyLockedGxact = NULL;
> +GlobalTransaction MyLockedGxact = NULL;

Hm, I'm doubtful it's worthwhile to expose this, just so we can use an
inline function, but whatever.

> +#include "access/clog.h"
> #include "access/xlogdefs.h"
> #include "lib/ilist.h"
> #include "storage/latch.h"
> @@ -154,6 +155,17 @@ struct PGPROC
>
> uint32 wait_event_info; /* proc's wait information */
>
> + /* Support for group transaction status update. */
> + bool clogGroupMember; /* true, if member of clog group */
> + pg_atomic_uint32 clogGroupNext; /* next clog group member */
> + TransactionId clogGroupMemberXid; /* transaction id of clog group member */
> + XidStatus clogGroupMemberXidStatus; /* transaction status of clog
> + * group member */
> + int clogGroupMemberPage; /* clog page corresponding to
> + * transaction id of clog group member */
> + XLogRecPtr clogGroupMemberLsn; /* WAL location of commit record for
> + * clog group member */
> +

Man, we're surely bloating PGPROC at a prodigious rate.

That's my first pass over the code itself.

Hm. Details aside, what concerns me most is that the whole group
mechanism, as implemented, only works als long as transactions only span
a short and regular amount of time. As soon as there's some variance in
transaction duration, the likelihood of building a group, where all xids
are on one page, diminishes. That likely works well in benchmarking, but
I'm afraid it's much less the case in the real world, where there's
network latency involved, and where applications actually contain
computations themselves.

If I understand correctly, without having followed the thread, the
reason you came up with this batching on a per-page level is to bound
the amount of effort spent by the leader; and thus bound the latency?

I think it's worthwhile to create a benchmark that does something like
BEGIN;SELECT ... FOR UPDATE; SELECT pg_sleep(random_time);
INSERT;COMMIT; you'd find that if random is a bit larger (say 20-200ms,
completely realistic values for network RTT + application computation),
the success rate of group updates shrinks noticeably.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Etsuro Fujita 2016-03-22 11:35:49 Re: Odd system-column handling in postgres_fdw join pushdown patch
Previous Message Tatsuo Ishii 2016-03-22 10:41:01 Re: multivariate statistics v14