Re: Relation extension scalability

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Relation extension scalability
Date: 2015-03-30 00:02:06
Message-ID: CA+Tgmoa8nddwpMA4Emn7sVoNrQ883mn8ZJBiqXa8dm3puKn=qQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Mar 29, 2015 at 2:56 PM, Andres Freund <andres(at)2ndquadrant(dot)com>
> As a quick recap, relation extension basically works like:
> 1) We lock the relation for extension
> 2) ReadBuffer*(P_NEW) is being called, to extend the relation
> 3) smgrnblocks() is used to find the new target block
> 4) We search for a victim buffer (via BufferAlloc()) to put the new
> block into
> 5) If dirty the victim buffer is cleaned
> 6) The relation is extended using smgrextend()
> 7) The page is initialized
>
> The problems come from 4) and 5) potentially each taking a fair
> while. If the working set mostly fits into shared_buffers 4) can
> requiring iterating over all shared buffers several times to find a
> victim buffer. If the IO subsystem is buys and/or we've hit the kernel's
> dirty limits 5) can take a couple seconds.

Interesting. I had always assumed the bottleneck was waiting for the
filesystem to extend the relation.

> Secondly I think we could maybe remove the requirement of needing an
> extension lock alltogether. It's primarily required because we're
> worried that somebody else can come along, read the page, and initialize
> it before us. ISTM that could be resolved by *not* writing any data via
> smgrextend()/mdextend(). If we instead only do the write once we've read
> in & locked the page exclusively there's no need for the extension
> lock. We probably still should write out the new page to the OS
> immediately once we've initialized it; to avoid creating sparse files.
>
> The other reason we need the extension lock is that code like
> lazy_scan_heap() and btvacuumscan() that tries to avoid initializing
> pages that are about to be initilized by the extending backend. I think
> we should just remove that code and deal with the problem by retrying in
> the extending backend; that's why I think moving extension to a
> different file might be helpful.

I thought the primary reason we did this is because we wanted to
write-and-fsync the block so that, if we're out of disk space, any
attendant failure will happen before we put data into the block. Once
we've initialized the block, a subsequent failure to write or fsync it
will be hard to recover from; basically, we won't be able to
checkpoint any more. If we discover the problem while the block is
still all-zeroes, the transaction that uncovers the problem errors
out, but the system as a whole is still OK.

Or at least, I think. Maybe I'm misunderstanding.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2015-03-30 00:35:09 Re: Rounding to even for numeric data type
Previous Message Robert Haas 2015-03-29 23:54:16 Re: getting rid of "thread fork emulation" in pgbench?