Re: Yet another fast GiST build

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Justin Pryzby <pryzby(at)telsasoft(dot)com>, "Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>, Darafei Komяpa Praliaskouski <me(at)komzpa(dot)net>, Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Yet another fast GiST build
Date: 2020-09-21 14:20:19
Message-ID: 932846.1600698019@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Heikki Linnakangas <hlinnaka(at)iki(dot)fi> writes:
> On 21/09/2020 02:06, Tom Lane wrote:
>> Another interesting point is that all the other index AMs seem to WAL-log
>> the new page before the smgrextend call, whereas this code is doing it
>> in the other order. I strongly doubt that both patterns are equally
>> correct. Could be that the other AMs are in the wrong though.

> My thinking was that it's better to call smgrextend() first, so that if
> you run out of disk space, you get the error before WAL-logging it. That
> reduces the chance that WAL replay will run out of disk space. A lot of
> things are different during WAL replay, so it's quite likely that WAL
> replay runs out of disk space anyway if you're living on the edge, but
> still.

Yeah. access/transam/README points out that such failures need to be
planned for, and explains what we do for heap pages;

1. Adding a disk page to an existing table.

This action isn't WAL-logged at all. We extend a table by writing a page
of zeroes at its end. We must actually do this write so that we are sure
the filesystem has allocated the space. If the write fails we can just
error out normally. Once the space is known allocated, we can initialize
and fill the page via one or more normal WAL-logged actions. Because it's
possible that we crash between extending the file and writing out the WAL
entries, we have to treat discovery of an all-zeroes page in a table or
index as being a non-error condition. In such cases we can just reclaim
the space for re-use.

So GIST seems to be acting according to that design. (Someday we need
to update this para to acknowledge that not all filesystems behave as
it's assuming.)

> I didn't notice that the other callers are doing it the other way round,
> though. I think they need to, so that they can stamp the page with the
> LSN of the WAL record. But GiST build is special in that regard, because
> it stamps all pages with GistBuildLSN.

Kind of unpleasant; that means they risk what the README points out:

In all of these cases, if WAL replay fails to redo the original action
we must panic and abort recovery. The DBA will have to manually clean up
(for instance, free up some disk space or fix directory permissions) and
then restart recovery. This is part of the reason for not writing a WAL
entry until we've successfully done the original action.

I'm not sufficiently motivated to go and change it right now, though.

regards, tom lane

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Namrata Bhave 2020-09-21 14:20:29 RE: Binaries on s390x arch
Previous Message Andrey M. Borodin 2020-09-21 14:19:32 Re: Yet another fast GiST build