Re: Patch to improve reliability of postgresql on linux nfs

From: Florian Pflug <fgp(at)phlo(dot)org>
To: George Barnett <gbarnett(at)atlassian(dot)com>
Cc: Bernd Helmle <mailings(at)oopsware(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Patch to improve reliability of postgresql on linux nfs
Date: 2011-09-13 11:07:42
Message-ID: AD4A13A5-8778-4D94-BBB5-AB6C13BF752A@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

[CC'ing to the list again - I assume you omitted pgsql-hackers from the
recipient list by accident]

On Sep13, 2011, at 03:00 , George Barnett wrote:
> On 12/09/2011, at 11:39 PM, Florian Pflug wrote:
>> Also, non-interruptible IO primitives are by no means "right". At best, they're
>> a compromise between complexity and functionality for I/O devices with rather
>> short (and bounded) communication timeouts - because in that case, processes are
>> only blocked un-interruptibly for a short while.
>
> Just to expand on that - I'm now in the situation where I can run my nfs mounts
> 'nointr' and postgres will work, but that means if I lose a storage unit I have
> a number of stuck processes, effectively meaning I need to reboot all my frontend
> servers before I can fail over to backup nfs stores.
>
> However, if I run the mounts with intr, then if a storage unit fails, I can fail
> over to a backup node (taking a minor loss of data hit I'm willing to accept) but
> postgres breaks under a moderate insert load.
>
> With the patch I supplied though, I'm able to have most of my cake and eat it.
>
> I'd be very interested in moving this forward - is there something I can change
> in the patch to make it more acceptable for a merge?

Here are a few comments

Tom already remarked that if we do that for write()s, we ought to do it for read()s
also which I agree with. All other primitives like lseek, close, ... should be taken
care of by SA_RESTART, but I'd be a good idea to verify that.

Also, I don't think that POSIX mandates that errno be reset to 0 if a function returns
successfully, making that "returnCode == 0 && errno == 0" check pretty dubious. I'm not
sure of this was what Tom was getting at with his remark about the ENOSPC handling being
wrong in the retry case.

And I also think that if we do this, we might as well handle EINTR correctly, even if
our use of SA_RESTART should prevent us from ever seeing that. The rules surrounding
EINTR and SA_RESTART for read/write are quite subtle...

If we retry, shouldn't be do CHECK_FOR_INTERRUPTS? Otherwise, processes waiting for
a vanished NFS server would be killable only with SIGKILL, not SIGTERM or SIGINT.
But I'm not sure if it's safe to put that into a generic function like pg_write_nointr.

Finally, WriteAll() seems like a poor name for that function. How about pg_write_nointr()?

Here's my suggested implementation for pg_write_nointr. pg_read_nointr should be similar
(but obviously without the ENOSPC handling)

int pg_write_nointr(int fd, const void *bytes, Size amount)
{
int written = 0;

while (amount > 0)
{
int ret;

ret = write(fd, bytes, amount);
if ((ret < 0) && (errno == EINTR))
{
/* interrupted by signal before first byte was written. Retry */
/* XXX: Is it safe to call CHECK_FOR_INTERRUPTS here? */
CHECK_FOR_INTERRUPTS();
continue;
}
else if (ret < 0)
{
/* error occurred. Abort */
return -1;
}
else if (ret == 0)
{
/* out of disk space. Abort */
return written;
}

/* made progress */

/* XXX: Is it safe to call CHECK_FOR_INTERRUPTS here? */
CHECK_FOR_INTERRUPTS();

written += ret;
amount -= ret;
bytes = (const char *) bytes + ret;
}
}

best regards,
Florian Pflug

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Florian Pflug 2011-09-13 11:30:34 Re: Patch to improve reliability of postgresql on linux nfs
Previous Message Jeff Davis 2011-09-13 08:41:36 Re: Range Types