Re: Result Cache node shows per-worker info even for workers not launched

From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Result Cache node shows per-worker info even for workers not launched
Date: 2021-04-28 08:24:17
Message-ID: CAApHDvoGF5cViQhkLOAuBY8un4_=wTAp9wyfD9m4geCqwR4aHg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 28 Apr 2021 at 00:39, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com> wrote:
> If planned parallel workers do not get launched, the Result Cache plan
> node shows all-0 stats for each of those workers:

Thanks for reporting this and for the patch.

You're right that there is a problem here. I did in fact have code to
skip workers that didn't have any cache misses right up until v18 of
the patch [1], but I removed it because I was getting some test
instability in the partition_prune regression tests. That was
highlighted by the CFbot machines. I mentioned about that in the final
paragraph of [2]. I didn't mention the exact test there, but I was
talking about the test in partition_prune.sql.

By the time it came to b6002a796, I did end up changing the
partition_prune tests. However, I had to back that version out again
because of some problems with force_parallel_mode = regress buildfarm
animals. By the time I re-committed Result Cache in 9eacee2e6, I had
changed the partition_prune tests so they did SET enable_resultcache =
0 before running that parallel test. I'd basically decided that the
test was never going to be stable on the buildfarm.

The problem there was that the results would vary depending on if the
parallel worker managed to do anything before the main process did all
the work. Because the tests are pretty small scale, many machines
wouldn't manage to get their worker(s) in gear and running before the
main process had finished the test. This was the reason I removed the
cache_misses == 0 test in explain.c. I'd thought that I could make
that test stable by just always outputting the cache stats line from
workers, even if they didn't assist during execution.

So, given that I removed the parallel test in partition_prune.sql, and
don't have any EXPLAIN ANALYZE output for parallel tests in
resultcache.sql, it should be safe enough to put that cache_misses ==
0 test back into explain.c

I've attached a patch to do this. The explain.c part is pretty similar
to your patch, I just took my original code and comment.

The patch also removes the SET force_parallel_mode = off in
resultcache.sql. That's no longer needed due to adding this check in
explain.c again. I also removed the changes I made to the
explain_parallel_append function in partition_prune.sql. I shouldn't
have included those in 9eacee2e6.

I plan to push this in the next 24 hours or so.

David

[1] https://postgr.es/m/CAApHDvoOmTtNPoF-+Q1dAOMa8vWFsFbyQb_A0iUKTS5nf2DuLw@mail.gmail.com
[2] https://postgr.es/m/CAApHDvrz4f+i1wu-8hyqJ=pxYDroGA5Okgo5rWPOj47RZ6QTmQ@mail.gmail.com

Attachment Content-Type Size
resultcache_explain_fix.patch application/octet-stream 3.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro Horiguchi 2021-04-28 08:36:33 Some doubious error messages and comments
Previous Message Dilip Kumar 2021-04-28 07:32:33 Re: [BUG] "FailedAssertion" reported when streaming in logical replication