Re: Relation extension scalability

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Relation extension scalability
Date: 2015-03-30 00:47:09
Message-ID: 20150330004709.GC4878@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2015-03-29 20:02:06 -0400, Robert Haas wrote:
> On Sun, Mar 29, 2015 at 2:56 PM, Andres Freund <andres(at)2ndquadrant(dot)com>
> > As a quick recap, relation extension basically works like:
> > 1) We lock the relation for extension
> > 2) ReadBuffer*(P_NEW) is being called, to extend the relation
> > 3) smgrnblocks() is used to find the new target block
> > 4) We search for a victim buffer (via BufferAlloc()) to put the new
> > block into
> > 5) If dirty the victim buffer is cleaned
> > 6) The relation is extended using smgrextend()
> > 7) The page is initialized
> >
> > The problems come from 4) and 5) potentially each taking a fair
> > while. If the working set mostly fits into shared_buffers 4) can
> > requiring iterating over all shared buffers several times to find a
> > victim buffer. If the IO subsystem is buys and/or we've hit the kernel's
> > dirty limits 5) can take a couple seconds.
>
> Interesting. I had always assumed the bottleneck was waiting for the
> filesystem to extend the relation.

That might be the case sometimes, but it's not what I've actually
observed so far. I think most modern filesystems doing preallocation
resolved this to some degree.

> > Secondly I think we could maybe remove the requirement of needing an
> > extension lock alltogether. It's primarily required because we're
> > worried that somebody else can come along, read the page, and initialize
> > it before us. ISTM that could be resolved by *not* writing any data via
> > smgrextend()/mdextend(). If we instead only do the write once we've read
> > in & locked the page exclusively there's no need for the extension
> > lock. We probably still should write out the new page to the OS
> > immediately once we've initialized it; to avoid creating sparse files.
> >
> > The other reason we need the extension lock is that code like
> > lazy_scan_heap() and btvacuumscan() that tries to avoid initializing
> > pages that are about to be initilized by the extending backend. I think
> > we should just remove that code and deal with the problem by retrying in
> > the extending backend; that's why I think moving extension to a
> > different file might be helpful.
>
> I thought the primary reason we did this is because we wanted to
> write-and-fsync the block so that, if we're out of disk space, any
> attendant failure will happen before we put data into the block.

Well, we only write and register a fsync. Afaics we don't actually
perform the fsync it at that point. I don't think having to do the
fsync() necessarily precludes removing the extension lock.

> Once we've initialized the block, a subsequent failure to write or
> fsync it will be hard to recover from;

At the very least the buffer shouldn't become dirty before we
successfully wrote once, right. It seems quite doable to achieve that
without the lock though. We'll have to do the write without going
through the buffer manager, but that seems doable.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2015-03-30 01:05:04 Re: Relation extension scalability
Previous Message Michael Paquier 2015-03-30 00:35:09 Re: Rounding to even for numeric data type