|From:||Jeff Janes <jeff(dot)janes(at)gmail(dot)com>|
|To:||Andres Freund <andres(at)anarazel(dot)de>|
|Subject:||Re: subscription worker signalling wal writer too much|
|Views:||Raw Message | Whole Thread | Download mbox|
On Wed, Jun 14, 2017 at 4:29 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> On 2017-06-14 16:24:27 -0700, Jeff Janes wrote:
> > On Wed, Jun 14, 2017 at 3:20 PM, Andres Freund <andres(at)anarazel(dot)de>
> > > On 2017-06-14 15:08:49 -0700, Jeff Janes wrote:
> > > > On Wed, Jun 14, 2017 at 11:55 AM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
> > > wrote:
> > > >
> > > > > If I publish a pgbench workload and subscribe to it, the
> > > > > worker is signalling the wal writer thousands of times a second,
> > > for
> > > > > every async commit. This has a noticeable performance cost.
> > > > >
> > > >
> > > > I've used a local variable to avoid waking up the wal writer more
> > > once
> > > > for the same page boundary. This reduces the number of wake-ups by
> > > > 7/8.
> > >
> > > Maybe I'm missing something here, but isn't that going to reduce our
> > > guarantees about when asynchronously committed xacts are flushed out?
> > > You can easily fit a number of commits into the same page... As this
> > > isn't specific to logical-rep, I don't think that's ok.
> > >
> > The guarantee is based on wal_writer_delay not on SIGUSR1, so I don't
> > this changes that. (Also, it isn't really a guarantee, the fsync can take
> > many seconds to complete once we do initiate it, and there is absolutely
> > nothing we can do about that, other than do the fsync synchronously in
> > first place).
> Well, wal_writer_delay doesn't work if walwriter is in sleep mode, and
> this afaics would allow wal writer to go into sleep mode with half a
> page filled, and it'd not be woken up again until the page is filled.
If it is taking the big sleep, then we always wake it up, since
I didn't change that part. I only changed what we do if it not hibernating.
> > > Have you chased down why there's that many wakeups? Normally I'd have
> > > expected that a number of the SetLatch() calls get consolidated
> > > together, but I guess walwriter is "too quick" in waking up and
> > > resetting the latch?
> > I'll have to dig into that some more. The 7/8 reduction I cited was just
> > in calls to SetLatch from that part of the code, I didn't measure whether
> > the SetLatch actually called kill(owner_pid, SIGUSR1) or not when I
> > determined that reduction, so it wasn't truly wake ups I measured.
> > wake ups were measured only indirectly via the impact on performance.
> > need to figure out how to instrument that without distorting the
> > performance too much in the process..
> I'd suspect that just measuring the number of kill() calls should be
> doable, if measured via perf or something like hta.t
It looks like only limited consolidation was happening, the number of kills
invoked was more than half of the number of SetLatch. I think the reason
is what when kill(owner_pid, SIGUSR1) is called, the kernel immediately
suspends the calling process and gives the signalled process a chance to
run in that time slice. The Wal Writer gets woken up, sees that it only
has a partial page to write and decides not to do anything, but resets the
latch. It does this fast enough that the subscription worker hasn't
migrated CPUs yet.
But, if the WAL Writer sees it has already flushed up to the highest
eligible page boundary, why doesn't the SetAsync code see the same thing?
Because it isn't flushed that high, but only written that high. It looks
like a story of three git commits:
Author: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Date: Sun Nov 13 09:00:57 2011 +0000
Wakeup WALWriter as needed for asynchronous commit performance.
Author: Andres Freund <andres(at)anarazel(dot)de>
Date: Mon Feb 15 22:15:35 2016 +0100
Allow SetHintBits() to succeed if the buffer's LSN is new enough.
Author: Andres Freund <andres(at)anarazel(dot)de>
Date: Mon Feb 15 23:52:38 2016 +0100
Allow the WAL writer to flush WAL at a reduced rate.
The first change made the WALWriter wake up and do a write and flush
whenever an async commit occurred and there was a completed WAL page to be
written. This way the hint bits could be set while the heap page was still
hot, because they couldn't get set until the WAL covering the hinted-at
transaction commit is flushed.
The second change said we can set hint bits before the WAL covering the
hinted-at transaction is flushed, as long the page LSN is newer than that
(so the wal covering the hinted-at transaction commit must be flushed
before the page containing the hint bit can be written).
Then the third change makes the wal writer only write the full pages most
of the times it is woken up, not flush them. It only flushes them once
every wal_writer_delay or wal_writer_flush_after, regardless of how often
it is woken up.
It seems like the third change rendered the first one useless. Wouldn't it
better, and much simpler, just to have reverted the first change once the
second one was done? If that were done, would the third change still be
needed? (It did seem to add some other features as well, so I'm going to
assume we still want those...).
But now the first change is even worse than useless, it is positively
harmful. The only thing to stop it from waking the WALWriter for every
async commit is this line:
/* if we have already flushed that far, we're done */
if (WriteRqstPtr <= LogwrtResult.Flush)
Since LogwrtResult.Flush doesn't advance anymore, this condition doesn't
becomes false due to us waking walwriter, it only becomes false once the
timeout expires (which it would have done anyway with no help from us), or
once wal_writer_flush_after is exceeded. So now every async commit is
waking the walwriter. Most of those wake up do nothing (there is not a
completely new patch to write), some write one completed page but don't
flush anything, and very few do a flush, and that one would have been done
So for the sake of wal_writer_flush_after, I guess the simplest thing to do
would be to change repeat the flushbytes calculation and wake the walwriter
only if it looks like that will be exceeded? Attached patch does that.
Maybe the computation of flushbytes should be put into a macro, rather than
repeated in the code? Or even done on the fly, I don't see any reason it
needs to be stored in a variable before testing.
This new patch is simpler than the previous one, and more effective at
speeding up replication. I assume it would speed up pgbench with
synchronous_commit turned off (or against unlogged tables) as well, but I
don't think I have the hardware needed to test that.
|Next Message||Stephen Frost||2017-06-15 22:41:08||Re: WIP: Data at rest encryption|
|Previous Message||Robert Haas||2017-06-15 21:50:06||Re: intermittent failures in Cygwin from select_parallel tests|