Re: Remove_temp_files_after_crash and significant recovery/startup time

From: Jeremy Schneider <schnjere(at)amazon(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, "McCoy, Shawn" <shamccoy(at)amazon(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Remove_temp_files_after_crash and significant recovery/startup time
Date: 2021-09-15 21:24:35
Message-ID: 7514f739-41f1-63d6-a0ed-673448ace1b5@amazon.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 9/10/21 14:57, Tomas Vondra wrote:
> On 9/10/21 10:58 PM, McCoy, Shawn wrote:
>> I noticed that the new parameter remove_temp_files_after_crash is
>> currently set to a default value of "true" in the version 14 release.
>> It seems this was discussed in this thread [1], and it doesn't look to
>> me like there's been a lot of stress testing of this feature.
>
> Not sure what could we learn from a stress test? IMHO it's fairly
> natural that if there are many temporary files and/or if deleting a file
> is expensive on a given filesystem, the cleanup may take time.

The thing that comes to mind for me is just getting a sense of what the
curve looks like for number of files versus startup time. If I can find
some time then I'll poke around and share numbers.

I remember awhile ago, I worked with a PostgreSQL user who had a major
outage crisis on their primary production database. They were having
some minor issues, and they decided to do a "quick" restart to see if it
would clear things out. The restart ended up taking something like a day
or two and the business was down the whole time. Working with them to
figure out what was happening, we found out that their very-DDL-heavy
workload had combined with a stuck checkpointer process. No checkpoints
had been completed for over a week. Only choices were waiting for WAL to
replay or taking data loss; we couldn't even get out of pain with a
restore from backup - sinces a restore still required replaying all the
same WAL.

There are certain core features in a database that you really need to be
as reliable and robust as possible. IMO, for critical production
databases, quick-as-possible-restarts are one of those.

>> In our fleet there have been cases where we have seen hundreds of
>> thousands of temp files generated.  I found a case where we helped a
>> customer that had a little over 2.2 million temp files.  Single
>> threaded cleanup of these takes a significant amount of time and
>> delays recovery. In RDS, we mitigated this by moving the pgsql_tmp
>> directory aside, start the engine and then separately remove the old
>> temp files.
>>
>> After noticing the current plans to default this GUC to "on" in v14,
>> just thought I'd raise the question of whether this should get a
>> little more discussion or testing with higher numbers of temp files?
>>
>
> I doubt we can lean anything new from such testing.
>
> Of course, we can discuss the default for the GUC. I see it as a trade
> off between risk of running out of disk space and increased recovery
> time, and perhaps the decision to prioritize lower risk of running out
> of disk space was not the right one ...

I'm doing a little asking around with colleagues. I'm having trouble
finding cases where people went end-to-end and figured out exactly what
in the workload was causing the high number of temp files. However,
there seems to be a fair number of incidents with numbers of temp files
in the hundreds of thousands.

One thing that seems possible is that in some of these cases, the temp
files were accumulating across many engine crashes - those cases would
not be an issue once you started cleaning up on every restart. However I
suspect there are still some cases where high connection counts and some
erratic workload characteristic or bugs are causing accumulation without
multiple crashes. If I learn more, I'll relay it along.

Frankly, if the GUC defaults to off, then we're a lot less likely to
find out if there /are/ issues. Kinda like LLVM and parallel query... at
some point you just have to turn it on... even if you're not 100% sure
where all the sharp edges are yet... PostgreSQL meme: "I test in someone
else's production"

All of that said, FWIW, if a restart is taking too long then a user can
always turn the GUC off and cancel/retry the startup. So this is not the
same as a stuck checkpointer, because there's simple recourse.

For my part, I appreciate this discussion. I missed it if these points
were debated when the feature was first committed and I can see
arguments both ways. It's not without precedent to have a new feature
turned off by default for its' first major release version. But we're
talking about a corner case situation, and it's not like users are
without recourse.

-Jeremy

--
Jeremy Schneider
Database Engineer
Amazon Web Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2021-09-15 21:35:16 Re: Polyphase merge is obsolete
Previous Message Jaime Casanova 2021-09-15 21:12:53 Re: Polyphase merge is obsolete