From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: extending relations more efficiently |
Date: | 2012-05-01 15:42:45 |
Message-ID: | 201205011742.46203.andres@anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tuesday, May 01, 2012 05:06:11 PM Robert Haas wrote:
> On Tue, May 1, 2012 at 10:31 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> >> efficient than our current method - I'm guessing that it actually
> >> writes the updated metadata back to disk, where write() does not (this
> >> makes one wonder how safe it is to count on write to have the behavior
> >> we need here in the first place).
> >
> > Currently the write() doesn't need to be crashsafe because it will be
> > repeated on crash-recovery and a checkpoint will fsync the file.
>
> That's not what I'm worried about. If the write() succeeds and then a
> subsequent close() on the filehandle reports an ENOSPC condition that
> means the write didn't really write after all, I am concerned that we
> might not handle that cleanly.
Hm. While write() might not write its state to disk I don't think that can
imply than that the *in memory* state is inconsistent.
Posix doesn't allow ENOSPC for close() as far as I can see.
> > I don't really see why it would need to compare in the 8kb case. What
> > reason would there be to further extend in that small increments?
> In previous discussions, the concern has been that holding the
> relation extension lock across a multi-block extension would cause
> latency spikes for both the process doing the extensions and any other
> concurrent processes that need the lock. Obviously if it were
> possible to extend by 64kB in the same time it takes to extend by 8kB
> that would be awesome, but if it takes eight times longer then things
> don't look so good.
Yes, sure.
> > There is the question whether this should be done in the background
> > though, so the relation extension lock is never hit in anything
> > time-critical...
> Yeah, although I'm fuzzy on how and whether that can be made to work,
> which is not to say that it can't.
The biggest problem I see is knowing when to trigger the extension of which
file without scanning files all the time.
Using some limited size shm-queue of {reltblspc, relfilenode} of to-be-
extended files + a latch is the first thing I can think of. Every time a
backend initializes a page with offset % EXTEND_SIZE == 0 it adds that table
to the queue. The background writer extends the file by EXTEND_SIZE * 2 if
necessary. If the queue is overflown all files are checked. Or the backends
extend themselves again...
EXTEND_SIZE should probably scale with the table size up to 64MB or so...
> It might also be interesting to provide a mechanism to pre-extend a
> relation to a certain number of blocks, though if we did that we'd
> have to make sure that autovac got the memo not to truncate those
> pages away again.
Hm. I have to say I don't really see a big need to do this if the size of
preallocation is adaptive to the file size. Sounds like it would add to much
complications for little benefit.
Andres
From | Date | Subject | |
---|---|---|---|
Next Message | Joey Adams | 2012-05-01 15:49:28 | Re: JSON in 9.2 - Could we have just one to_json() function instead of two separate versions ? |
Previous Message | Peter Geoghegan | 2012-05-01 15:09:40 | Re: proposal: additional error fields |