Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)
Date: 2012-02-21 11:19:41
Message-ID: CAHGQGwHac95hTDjex14=ovPWNin7vbQ9RNiwLWeLRqmHZ=yFvA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Feb 18, 2012 at 12:36 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> Attached is a new version, fixing that, and off-by-one bug you pointed out
> in the slot wraparound handling. I also moved code around a bit, I think
> this new division of labor between the XLogInsert subroutines is more
> readable.

This patch includes not only xlog scaling improvement but also other ones.
I think it's better to extract them as separate patches and commit them first.
If we do so, the main patch would become more readable.
For example, I think that the followings can be extracted as a separate patch.

(1) Make walwriter try to initialize as many of the no-longer-needed
WAL buffers
for future use as we can.

(2) Refactor the "update full_page_writes code".

(3) Get rid of XLogCtl->Write.LogwrtResult and XLogCtl->Insert.LogwrtResult.

(4) Call TRACE_POSTGRESQL_XLOG_SWITCH() even if the xlog switch has no
work to do.

Others?

I'm not sure if (3) makes sense. In current master, those two shared variables
are used to reduce the contention of XLogCtl->info_lck and WALWriteLock.
You think they have no effect on reducing the lock contention?

In some places, the spinlock "insertpos_lck" is taken while another
spinlock "info_lck"
is being held. Is this OK? What if unfortunately inner spinlock takes
long to be taken?

+ * An xlog-switch record consumes all the remaining space on the
+ * WAL segment. We have already reserved it for us, but we still need
+ * to make sure it's been allocated and zeroed in the WAL buffers so
+ * that when the caller (or someone else) does XLogWrite(), it can
+ * really write out all the zeros.

Why do we need to write out all the remaining space with zeros? In
current master,
we don't do that. A recovery code ignores the data following XLOG_SWITCH record,
so I don't think that's required.

+ /* XXX: before this patch, TRACE_POSTGRESQL_XLOG_SWITCH was not called
+ * if the xlog switch had no work to do, ie. if we were already at the
+ * beginning of a new XLOG segment. You can check if RecPtr points to
+ * beginning of a segment if you want to keep the distinction.
+ */
+ TRACE_POSTGRESQL_XLOG_SWITCH();

If so, RecPtr (or the flag indicating whether the xlog switch has no
work to do) should
be given to TRACE_POSTGRESQL_XLOG_SWITCH() as an argument?

The followings are trivial comments:

+ * insertion, but ẃhile we're at it, advance lastslot as much as we

Typo: 'ẃ' should be 'w'

In XLogRecPtrToBufIdx() and XLogRecEndPtrToBufIdx(), why don't you use
XLogFileSize
instead of XLogSegsPerFile * XLogSegSize?

There are some source code comments which still refer to WALInsertLock.

It's cleaner if pg_start_backup_callback() uses Insert instead of
XLogCtl->Insert,
like do_pg_start_backup(), do_pg_stop_backup() and do_pg_abort_backup() do.

+ freespace = XLOG_BLCKSZ - EndRecPtr.xrecoff % XLOG_BLCKSZ;

Though this is not incorrect, it's better to use EndOfLog instead of EndRecPtr,
like near code does.

while (extraWaits-- > 0)
PGSemaphoreUnlock(&MyProc->sem);

This should be executed also in WaitForXLogInsertionSlotToBecomeFree()?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Etsuro Fujita 2012-02-21 11:25:25 Re: pgsql_fdw, FDW for PostgreSQL server
Previous Message Kyotaro HORIGUCHI 2012-02-21 11:15:07 Re: Speed dblink using alternate libpq tuple storage