Re: Snapbuild woes followup

From: Andres Freund <andres(at)anarazel(dot)de>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Petr Jelinek <petr(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: Snapbuild woes followup
Date: 2021-01-25 20:00:08
Message-ID: 20210125200008.nwk55c7elvlo2hf2@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Thomas, CCing you because of the condvar bit below.

On 2021-01-25 19:28:33 +0200, Heikki Linnakangas wrote:
> In SnapBuildFinalSnapshot(), we have this comment:
> > /*
> > * c) transition from START to BUILDING_SNAPSHOT.
> > *
> > * In START state, and a xl_running_xacts record with running xacts is
> > * encountered. In that case, switch to BUILDING_SNAPSHOT state, and
> > * record xl_running_xacts->nextXid. Once all running xacts have finished
> > * (i.e. they're all >= nextXid), we have a complete catalog snapshot. It
> > * might look that we could use xl_running_xact's ->xids information to
> > * get there quicker, but that is problematic because transactions marked
> > * as running, might already have inserted their commit record - it's
> > * infeasible to change that with locking.
> > */
>
> This was added in commit 955a684e040, before that we did in fact use the
> xl_running_xacts list of XIDs, but it was buggy. Commit 955a684e040 fixed
> that by waiting for *two* xl_running_xacts, such that the second
> xl_running_xact doesn't contain any of the XIDs from the first one.

Not really just that, but we also just don't believe ->xids to be
consistent for visibility purposes...

> To fix the case mentioned in that comment, where a transaction listed in
> xl_running_xacts is in fact already committed or aborted, wouldn't it be
> more straightforward to check each XID, if they are in fact already
> committed or aborted? The CLOG is up-to-date here, I believe.

Well, we can't *just* go the clog since that will contain transactions
as committed/aborted even when not yet visible, due to still being in
the procarray. And I don't think it's easy to figure out how to
crosscheck between clog and procarray in this instance (since we don't
have the past procarray). This is different in the recovery path because
there we know that changes to the procarray / knownassignedxids
machinery are only done by one backend.

> I bumped into this, after I noticed that read_local_xlog_page() has a pretty
> busy polling loop, with just 1 ms delay to keep it from hogging CPU.

Hm - but why is that really related to the initial snapshot building
logic? Logical decoding constantly waits for WAL outside of that too,
no?

ISTM that we should improve the situation substantially in a fairly easy
way. Like:

1) Improve ConditionVariableBroadcast() so it doesn't take the spinlock
if there are no wakers - afaict that's pretty trivial.
2) Replace WalSndWakeup() with ConditionVariableBroadcast().
3) Replace places that need to wait for new WAL to be written with a
call to function doing something like

XLogRecPtr flushed_to = GetAppropriateFlushRecPtr(); // works for both normal / recovery

if (flush_requirement <= flushed_to)
break;

ConditionVariablePrepareToSleep(&XLogCtl->flush_progress_cv);

while (true)
{
flushed_to = GetAppropriateFlushRecPtr();

if (flush_requirement <= flushed_to)
break;

ConditionVariableSleep(&XLogCtl->flush_progress_cv);
}

this should end up being more efficient than the current WalSndWakeup()
mechanism because we'll only wake up the processes that need to be woken
up, rather than checking/setting each walsenders latch.

> P.S. There's this in SnapBuildNextPhaseAt():
>
> > /*
> > * For backward compatibility reasons this has to be stored in the wrongly
> > * named field. Will be fixed in next major version.
> > */
> > return builder->was_running.was_xmax;
>
> We could fix that now... Andres, what did you have in mind for a proper
> name?

next_phase_at seems like it'd do the trick?

> P.P.S. Two thinkos in comments in snapbuild.c: s/write for/wait for/.

Will push a fix.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexey Kondratov 2021-01-25 20:11:38 Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
Previous Message Tom Lane 2021-01-25 19:58:16 Re: Fixing cache pollution in the Kerberos test suite