RE: could not read from hash-join temporary file: SUCCESS && DB goes into recovery mode

From: Reid Thompson <Reid(dot)Thompson(at)omnicell(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "pgsql-general(at)lists(dot)postgresql(dot)org" <pgsql-general(at)lists(dot)postgresql(dot)org>, Reid Thompson <Reid(dot)Thompson(at)omnicell(dot)com>
Subject: RE: could not read from hash-join temporary file: SUCCESS && DB goes into recovery mode
Date: 2021-04-19 16:54:16
Message-ID: SJ0PR11MB484893B3096252DC73E0F8B09E499@SJ0PR11MB4848.namprd11.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Responses inline below
Thanks,
reid

-----Original Message-----
From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Sent: Monday, April 19, 2021 9:54 AM
To: Reid Thompson <Reid(dot)Thompson(at)omnicell(dot)com>
Cc: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: could not read from hash-join temporary file: SUCCESS && DB goes into recovery mode

Reid Thompson <Reid(dot)Thompson(at)omnicell(dot)com> writes:
> Hi I'm looking for some guidance related to the subject line issue.

Is this repeatable? If so you've found a bug, I think.

Alvaro has noted https://www.postgresql.org/message-id/20210419155928.GA3253%40alvherre.pgsql that the mentioned bug fix on the error reporting should correct this ( as the hash join will abort with the fix rather than not recognizing that it needs to, and then attempt a later read)
Somewhat -- I think it's dependent on what else may be writing tmp files ( i.e. it's happened twice in the past few days with successful runs in the interim and I believe that this query has been in place for a long time prior to the first occurrence)

> 1. That the error message has been updated ( i.e. SUCCESS is not
> the proper value)

Yeah, what this really indicates is an incomplete read (file shorter than expected). Since 11.8, we've improved the error reporting for that, but that wouldn't in itself fix whatever the underlying problem is.

See message noted above from Alvaro.

> 2. That the error is due to running out of temporary space either disk space or maybe temp_buffers?

That could be the proximate cause, although then there would be a bug that the original write failure wasn't detected. But it seems about as likely that there's just some inconsistency between what the temp file writing code wrote and what the reading code expects to read.

Is this a parallelized hash join by any chance? That's new in v11 if memory serves, so it'd be interesting to see if disabling enable_parallel_hash changes anything.

Explain on the host does not indicate parallelization anywhere - it does indicate that we can narrow the result set by optimizing our query which we are implementing now. Hopefully this will get us to a scheduled upgrade timeframe.

Anyway, I'd counsel updating to current (11.11), and then if you can still reproduce the problem, try to reduce it to a self-contained test case.

regards, tom lane

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Allie Crawford 2021-04-19 21:09:13 archive_commnad parameter question
Previous Message Reid Thompson 2021-04-19 16:26:39 RE: could not read from hash-join temporary file: SUCCESS && DB goes into recovery mode