Remove_temp_files_after_crash and significant recovery/startup time

From: "McCoy, Shawn" <shamccoy(at)amazon(dot)com>
To: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Remove_temp_files_after_crash and significant recovery/startup time
Date: 2021-09-10 20:58:20
Message-ID: E7573D54-A8C9-40A8-89D7-0596A36ED124@amazon.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I noticed that the new parameter remove_temp_files_after_crash is currently set to a default value of "true" in the version 14 release. It seems this was discussed in this thread [1], and it doesn't look to me like there's been a lot of stress testing of this feature.

In our fleet there have been cases where we have seen hundreds of thousands of temp files generated. I found a case where we helped a customer that had a little over 2.2 million temp files. Single threaded cleanup of these takes a significant amount of time and delays recovery. In RDS, we mitigated this by moving the pgsql_tmp directory aside, start the engine and then separately remove the old temp files.

After noticing the current plans to default this GUC to "on" in v14, just thought I'd raise the question of whether this should get a little more discussion or testing with higher numbers of temp files?

Regards,
Shawn McCoy
Database Engineer
Amazon Web Services

[1] https://www.postgresql.org/message-id/CAH503wDKdYzyq7U-QJqGn%3DGm6XmoK%2B6_6xTJ-Yn5WSvoHLY1Ww%40mail.gmail.com

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2021-09-10 21:06:54 Re: slab allocator performance issues
Previous Message Robert Haas 2021-09-10 20:01:23 Re: Estimating HugePages Requirements?