Re: Relation extension scalability

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Relation extension scalability
Date: 2015-03-29 19:52:47
Message-ID: 20150329195247.GA4878@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2015-03-29 15:21:44 -0400, Tom Lane wrote:
> > There's two things that seem to make sense to me:
>
> > First, decouple relation extension from ReadBuffer*, i.e. remove P_NEW
> > and introduce a bufmgr function specifically for extension.
>
> I think that removing P_NEW is likely to require a fair amount of
> refactoring of calling code, so I'm not thrilled with doing that.
> On the other hand, perhaps all that code would have to be touched
> anyway to modify the scope over which the extension lock is held.

It's not *that* many locations that need to extend relations. In my
playing around it seemed to me they all would need to be modified
anyway; if we want to remove/reduce the scope of extension locks to deal
with the fact that somebody else could have started to use the buffer.

> > Secondly I think we could maybe remove the requirement of needing an
> > extension lock alltogether. It's primarily required because we're
> > worried that somebody else can come along, read the page, and initialize
> > it before us. ISTM that could be resolved by *not* writing any data via
> > smgrextend()/mdextend().
>
> I'm afraid this would break stuff rather thoroughly, in particular
> handling of out-of-disk-space cases.

Hm. Not a bad point.

> And I really don't see how you get
> consistent behavior at all for multiple concurrent callers if there's no
> locking.

What I was thinking is something like this:

while (true)
{
targetBuffer = AcquireFromFSMEquivalent();

if (targetBuffer == InvalidBuffer)
targetBuffer = ExtendRelation();

LockBuffer(targetBuffer, BUFFER_LOCK_EXCLUSIVE);

if (BufferHasEnoughSpace(targetBuffer))
break;

LockBuffer(BUFFER_LOCK_UNLOCK);
}

where ExtendRelation() would basically work like
while (true)
{
targetBlock = (lseek(fd, BLCKSZ, SEEK_END) - BLCKSZ)/8192;
buffer = ReadBuffer(rel, targetBlock);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
page = BufferGetPage(buffer);
if (PageIsNew(page))
{
PageInit(page);
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
FlushBuffer(buffer);
break;
}
else
{
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
continue;
}
}

Obviously it's a tad more complex than that pseudocode, but I think that
basically should work. Except that it, as you can say, lead to some
oddities with out of space handling. I think it should actually be ok,
it might just be confusing to the user.

I think we might be able to address those issues by not using
lseek(SEEK_SET) but instead
fcntl(fd, F_SETFL, O_APPEND, 1);
write(fd, pre-init-block, BLCKSZ);
fcntl(fd, F_SETFL, O_APPEND, 0);
newblock = (lseek(SEEK_CUR) - BLCKSZ)/BLCKSZ;

by using O_APPEND and a pre-initialized block we can be sure to write a
block at the end, that's valid, and shouldn't run afould of any out of
space issues that we don't already have.

Unfortunately I'm not sure whether fcntl for O_APPEND is portable :(

> One idea that might help is to change smgrextend's API so that it doesn't
> need a buffer to write from, but just has an API of "add a prezeroed block
> on-disk and tell me the number of the block you added". On the other
> hand, that would then require reading in the block after allocating a
> buffer to hold it (I don't think you can safely assume otherwise) so the
> added read step might eat any savings.

Yea, I was thinking that as well. We simply could skip the reading step
by setting up the contents in the buffer manager without a read in this
case...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2015-03-29 20:07:49 Re: Relation extension scalability
Previous Message James Cloos 2015-03-29 19:51:11 Re: Rounding to even for numeric data type