Re: extending relations more efficiently

From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: extending relations more efficiently
Date: 2012-05-01 15:42:45
Message-ID: 201205011742.46203.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tuesday, May 01, 2012 05:06:11 PM Robert Haas wrote:
> On Tue, May 1, 2012 at 10:31 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> >> efficient than our current method - I'm guessing that it actually
> >> writes the updated metadata back to disk, where write() does not (this
> >> makes one wonder how safe it is to count on write to have the behavior
> >> we need here in the first place).
> >
> > Currently the write() doesn't need to be crashsafe because it will be
> > repeated on crash-recovery and a checkpoint will fsync the file.
>
> That's not what I'm worried about. If the write() succeeds and then a
> subsequent close() on the filehandle reports an ENOSPC condition that
> means the write didn't really write after all, I am concerned that we
> might not handle that cleanly.
Hm. While write() might not write its state to disk I don't think that can
imply than that the *in memory* state is inconsistent.
Posix doesn't allow ENOSPC for close() as far as I can see.

> > I don't really see why it would need to compare in the 8kb case. What
> > reason would there be to further extend in that small increments?
> In previous discussions, the concern has been that holding the
> relation extension lock across a multi-block extension would cause
> latency spikes for both the process doing the extensions and any other
> concurrent processes that need the lock. Obviously if it were
> possible to extend by 64kB in the same time it takes to extend by 8kB
> that would be awesome, but if it takes eight times longer then things
> don't look so good.
Yes, sure.

> > There is the question whether this should be done in the background
> > though, so the relation extension lock is never hit in anything
> > time-critical...
> Yeah, although I'm fuzzy on how and whether that can be made to work,
> which is not to say that it can't.
The biggest problem I see is knowing when to trigger the extension of which
file without scanning files all the time.

Using some limited size shm-queue of {reltblspc, relfilenode} of to-be-
extended files + a latch is the first thing I can think of. Every time a
backend initializes a page with offset % EXTEND_SIZE == 0 it adds that table
to the queue. The background writer extends the file by EXTEND_SIZE * 2 if
necessary. If the queue is overflown all files are checked. Or the backends
extend themselves again...
EXTEND_SIZE should probably scale with the table size up to 64MB or so...

> It might also be interesting to provide a mechanism to pre-extend a
> relation to a certain number of blocks, though if we did that we'd
> have to make sure that autovac got the memo not to truncate those
> pages away again.
Hm. I have to say I don't really see a big need to do this if the size of
preallocation is adaptive to the file size. Sounds like it would add to much
complications for little benefit.

Andres

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Joey Adams 2012-05-01 15:49:28 Re: JSON in 9.2 - Could we have just one to_json() function instead of two separate versions ?
Previous Message Peter Geoghegan 2012-05-01 15:09:40 Re: proposal: additional error fields