Re: BUG #15577: Query returns different results when executed multiple times

From: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Bartosz Polnik <bartoszpolnik(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #15577: Query returns different results when executed multiple times
Date: 2019-01-09 16:34:53
Message-ID: CAKJS1f8zFL8539hpy_gBV3JOcNRqhb1wVVhuVkp_H_E_qJU_iw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Wed, 9 Jan 2019 at 17:01, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> wrote:
>
> On Wed, Jan 9, 2019 at 2:52 PM Thomas Munro
> <thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> > Now the number of results from the query in the script is not stable,
> > varying between 235 or 246 results for me. Not sure what's going on
> > yet.
>
> When there are multiple matches found by the Index Scan, sometimes
> only the first one makes it through to the Nested Loop, in the leader.
> It's not because of single_match being inappropriately set as I
> speculated earlier (though that'd behave similarly I guess) but
> because the quals are somehow broken in between invocations so that
> ExecScan() filters the rest of the matches out, and then fixed again
> on the next rescan. This happens whenever the leader takes a break to
> read tuple from parallel workers in between executing the Nested Loop
> node, which it often does after reading the first of several tuples
> resulting from a single rescan.
>
> What exactly is being corrupted and how, I don't yet know, and I need
> to leave this here for today, but that's what I've got so far.

From what I can see it seems to be the parameter for the index scan
that gets the wrong value.

I changed int4eq to become:

Datum
int4eq(PG_FUNCTION_ARGS)
{
int32 arg1 = PG_GETARG_INT32(0);
int32 arg2 = PG_GETARG_INT32(1);

if (arg1 == 100112 || arg2 == 100112)
elog(NOTICE, "%d %d", arg1, arg2);
PG_RETURN_BOOL(arg1 == arg2);
}

When the query returns the correct number of rows I only see NOTICES with:

NOTICE: 100112 100112

but when the wrong results are returned I see a few such as:

NOTICE: 100112 57772

If I put a conditional breakpoint on the elog with the condition "arg2
!= 100112" I see that the arg2 value is coming from the nested loop's
parameter.

If I modify the query to remove the redundant tba.target_id = tc.id
from the WHERE clause, then I can't trigger the issue. i.e.

AND (
(tba.target_id = tc.id AND tba.group_type = 'A')
OR tba.source_id = tc.id
);

becomes:

AND (
tba.group_type = 'A'
OR tba.source_id = tc.id
);

So the parameter does not need to have its value checked again during
the qual check. (If Bartosz is looking for a quick fix then he should
be able to alter the query to that)

The index scan should have already ensured that tba.target_id = tc.id.
So the nested loop parameter must have had the correct value at the
start of the index scan, but somehow it must have changed sometime
after the first tuple was returned from the index scan and before we
asked the index for the next tuple (hence why it looked like a
semi-join's result) I've just not managed to find out how the
parameter value gets changed. The value does not seem to have been
stomped on as each time I see the incorrect parameter value, it's set
to something else that occurs in the table.

At the moment I'm a bit suspicious of the following code in ParallelQueryMain().

paramexec_space = dsa_get_address(area, fpes->param_exec);
RestoreParamExecParams(paramexec_space, queryDesc->estate);

On first look, it seems that the exec params are in shared memory? ...
but I need more time to figure out how this all fits together.

I've attached the updated script per Thomas' alterations to make it break.

The breakage is very old. I found it existed when parallel joins were
added, but not necessarily blaming that commit.

45be99f8cd Support parallel joins, and make related improvements. -- BAD
a7de3dc5c3 Support multi-stage aggregation. -- GOOD

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
mockup_bug15577_v2.sql text/plain 4.5 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Thomas Munro 2019-01-09 20:20:42 Re: BUG #15577: Query returns different results when executed multiple times
Previous Message Andres Freund 2019-01-09 16:07:28 Re: BUG #15582: ALTER TABLE/INDEX ... SET TABLESPACE does not free disk space