Re: Streaming replication and a disk full in primary

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Streaming replication and a disk full in primary
Date: 2010-04-12 10:41:58
Message-ID: 4BC2F8F6.8070506@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Fujii Masao wrote:
> doc/src/sgml/config.sgml
> - archival or to recover from a checkpoint. If standby_keep_segments
> + archival or to recover from a checkpoint. If
> <varname>standby_keep_segments</>
>
> The word "standby_keep_segments" always needs the <varname> tag, I think.

Thanks, fixed.

> We should remove the document "25.2.5.2. Monitoring"?

I updated it to no longer claim that the primary can run out of disk
space because of a hung WAL sender. The information about calculating
the lag between primary and standby still seems valuable, so I didn't
remove the whole section.

> Why is standby_keep_segments used even if max_wal_senders is zero?
> In that case, ISTM we don't need to keep any WAL files in pg_xlog
> for the standby.

True. I don't think we should second guess the admin on that, though.
Perhaps he only set max_wal_senders=0 temporarily, and will be
disappointed if the the logs are no longer there when he sets it back to
non-zero and restarts the server.

> When XLogRead() reads two WAL files and only the older of them is recycled
> during being read, it might fail in checking whether the read data is valid.
> This is because the variable "recptr" can advance to the newer WAL file
> before the check.

Thanks, fixed.

> When walreceiver has gotten stuck for some reason, walsender would be
> unable to pass through the send() system call, and also get stuck.
> In the patch, such a walsender cannot exit forever because it cannot
> call XLogRead(). So I think that the bgwriter needs to send the
> exit-signal to such a too lagged walsender. Thought?

Any backend can get stuck like that.

> The shmem of latest recycled WAL file is updated before checking whether
> it's already been archived. If archiving is not working for some reason,
> the WAL file which that shmem indicates might not actually have been
> recycled yet. In this case, the standby cannot obtain the WAL file from
> the primary because it's been marked as "latest recycled", and from the
> archive because it's not been archived yet. This seems to be a big problem.
> How about moving the update of the shmem to after calling XLogArchiveCheckDone()
> in RemoveOldXlogFiles()?

Good point. It's particularly important considering that if a segment
hasn't been archived yet, it's not available to the standby from the
archive either. I changed that.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2010-04-12 10:49:34 Re: testing hot standby
Previous Message Fujii Masao 2010-04-12 09:06:21 Re: testing HS/SR - 1 vs 2 performance