Re: Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work
Date: 2022-01-15 08:34:12
Message-ID: CALj2ACV+acrnWUdwSNUXNzXLNA+kFkfuT8t=wiMoPhBWKrWUeA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jan 14, 2022 at 1:08 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> Hi,
>
> On 2021-12-31 18:12:37 +0530, Bharath Rupireddy wrote:
> > Currently the server is erroring out when unable to remove/parse a
> > logical rewrite file in CheckPointLogicalRewriteHeap wasting the
> > amount of work the checkpoint has done and preventing the checkpoint
> > from finishing.
>
> This seems like it'd make failures to remove the files practically
> invisible. Which'd have it's own set of problems?
>
> What motivated proposing this change?

We had an issue where there were many mapping files generated during
the crash recovery and end-of-recovery checkpoint was taking a lot of
time. We had to manually intervene and delete some of the mapping
files (although it may not sound sensible) to make end-of-recovery
checkpoint faster. Because of the race condition between manual
deletion and checkpoint deletion, the unlink error occurred which
crashed the server and the server entered the recovery again wasting
the entire earlier recovery work.

In summary, with the changes (emitting LOG-only messages for unlink
failures and continuing with the other files) proposed for
CheckPointLogicalRewriteHeap in this thread and the existing code in
CheckPointSnapBuild, I'm sure it will help not waste the recovery
that's has been done in case unlink fails for any reasons.

Regards,
Bharath Rupireddy.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Julien Rouhaud 2022-01-15 08:50:12 Re: pg_replslotdata - a tool for displaying replication slot information
Previous Message Julien Rouhaud 2022-01-15 08:33:04 Re: missing indexes in indexlist with partitioned tables