Quick Links

Re: less expensive pg_buffercache on big shmem

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Ivan Kartyshov <i(dot)kartyshov(at)postgrespro(dot)ru>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: less expensive pg_buffercache on big shmem
Date:	2016-09-20 23:41:43
Message-ID:	4ffe7b75-e0c6-d996-dc81-264598821451@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 09/05/2016 06:19 PM, Ivan Kartyshov wrote:
> On 09/03/2016 05:04 AM, Tomas Vondra wrote:
>> This patch needs a rebase, as 06d7fd6e bumped the version to 1.2.
> Thank you for a valuable hint.
>

So, will we get a rebased patch? I see the patch is back in 'needs
review' but there's no new version.

>> > If we will replace consistent method, then we should replace it with
> the
>> > partially consistent method (called "nonconsistent") because:
>> > 1) it's based on fast spinlocks (it's not fully up to its name, though)
>>
>> In other words, it does exactly the thing proposed up-thread, i.e.
>> locking only buffer headers.
>>
>> What do you mean by fast spinlocks? And why aren't they up to the name?
>
> Not they (spinlocks), but the name “nonconsistent” is somewhat
> misleading. At the moment we can't implement concurrent shared memory
> access without locks in general, so most inconsistent method that has
> been proposed was the "nonconsistent" one. But roughly speaking
> *nonconsistent* is not as such by the name, because it contains a
> locking mechanism (spinlocks).
>

I'm a bit confused by the naming, but my understanding is that
nonconsistent only acquires (spin)locks on individual buffers, so it
does not have a consistent view on shared_buffers as a whole, but data
about individual buffers are consistent. I think that's fine and
perfectly acceptable for this extension.

>> > 2) it's *almost* the fastest one (the less time needed for execution of
>> > method, the less data core will change and as a consequence the more
>> > consistent snapshot will be)
>>
>> I'm not sure I understand what you're saying here? What do you mean by
>> "almost the fastest one"?
>
> I mean the fastest one that has been proposed ("consistent"|
> "semiconsistent"| "nonconsistent").
>
>> I'm a bit confused by the results you reported before, i.e. that
>>
>> 1) nonconsistent is 10% faster than consistent method
>> 2) semiconsistent is 5-times slower than nonconsistent method
>>
>> What does that mean? Are you refering to duration of the queries reading
>> data from pg_buffercache, or to impact on the other queries?
>
> Here I mean "working duration time".
>

I have no idea what "working duration time" is.

>> How can be semiconsistent 5x slower than nonconsistent? Wouldn't that
>> make it significantly slower than the consistent method?
>
> Firstly, when we want to measure the quality of pg_buffercache, we must
> measure several values:
> 1) Execution time (duration of the queries reading data from
> pg_buffercache)
> 2) How it influences the system (and other queries) during its work
>
> Secondly, the semiconsistent is slower than nonconsistent and consistent
> method, but it makes less influence on other queries then consistent.
>
> Thirdly, it is slower because it locks the partitions of shared memory
> in a different way than in consistent or nonconsistent methods.
> The semi-consistent strategy implies that only one buffer is locked at a
> time. Therefor has no significant effect on the system, but it takes
> more time.
>

It'd be really useful if you could provide actual numbers, explain what
metrics you compare and how. I'm sure it makes sense to you but it's
utterly confusing for everyone else, and it also makes it impossible to
reproduce the benchmark.

I've looked at the code (applied on top of e3b607c) and I see this in
pg_buffercache_pages:

for (j = 0; j < num_partit; j++)
{
if (snaptype == CONS_SNAP)
for (i = 0; i < NUM_BUFFER_PARTITIONS; i++)
LWLockAcquire(BufMappingPartitionLockByIndex(i),
LW_SHARED);
else if (snaptype == SEMICONS_SNAP)
LWLockAcquire(BufMappingPartitionLockByIndex(j), LW_SHARED);

for (i = 0; i < NBuffers; i++)
{
... walk through all shared buffers ...
}
}

How is the SEMICONS_SNAP case supposed to only scan the currently locked
partition, when we simply scans all shared buffers for SEMICONS_SNAP?
That seems outright broken, it should scan only the currently locked
partition.

Secondly, I see this bit added to the loop over buffers:

if (bufHdr->tag.forkNum == -1)
{
fctx->record[i].blocknum = InvalidBlockNumber;
continue;
}

and I have no idea why this is needed (when it was not needed before).

The patch also breaks a few comments, because it adds code between the
comment and the related code. For example the (likely unnecessary) bit
of code changes this:

/* Lock each buffer header before inspecting. */
buf_state = LockBufHdr(bufHdr);

into this

/* Lock each buffer header before inspecting. */
if (bufHdr->tag.forkNum == -1)
{
fctx->record[i].blocknum = InvalidBlockNumber;
continue;
}
buf_state = LockBufHdr(bufHdr);

which is confusing (and I'd argue broken).

Similarly when releasing the locks, the original comment/code block
looks like this:

/*
* And release locks. We do this in reverse order for two reasons:
* (1) Anyone else who needs more than one of the locks will be trying
* to lock them in increasing order; we don't want to release the
* other process until it can get all the locks it needs. (2) This
* avoids O(N^2) behavior inside LWLockRelease.
*/
for (i = NUM_BUFFER_PARTITIONS; --i >= 0;)
LWLockRelease(BufMappingPartitionLockByIndex(i));

but the patch changes is to this:

if (snaptype == CONS_SNAP)
for (i = NUM_BUFFER_PARTITIONS; --i >= 0;)
LWLockRelease(BufMappingPartitionLockByIndex(i));
else if (snaptype == SEMICONS_SNAP)
/*
* And release locks. We do this in reverse order for two reasons:
* (1) Anyone else who needs more than one of the locks will be
* trying to lock them in increasing order; we don't want to
* release the other process until it can get all the locks it
* needs. (2) This avoids O(N^2) behavior inside LWLockRelease.
*/
LWLockRelease(BufMappingPartitionLockByIndex(j));

Notice how the comment is moved from the loop to the block that only
unlocks a single partition. That clearly makes no sense, as the comment
is about the CONS_SNAP loop.

Now that I read the comment, I wonder whether the semiconsistent case
(locking one partition at a time) changes this reverse-order unlocking
behavior or not - clearly, the guarantee that the other process will get
all the locks at once is no longer true, as pg_buffercache will likely
hold the next one.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Re: less expensive pg_buffercache on big shmem at 2016-09-05 16:19:45 from Ivan Kartyshov

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tomas Vondra	2016-09-20 23:43:24	Re: less expensive pg_buffercache on big shmem
Previous Message	Thomas Munro	2016-09-20 23:13:42	Re: Tracking wait event for latches