On Tue, Feb 20, 2018 at 7:56 PM, Claudio Freire <klaussfreire(at)gmail(dot)com> wrote:
> On Tue, Feb 20, 2018 at 7:05 PM, Claudio Freire <klaussfreire(at)gmail(dot)com> wrote:
>> On Tue, Feb 20, 2018 at 6:54 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>>> The important part then happens in pg_dump. Note
>>>
>>> /*
>>> * To create binary-compatible heap files, we have to ensure the same
>>> * physical column order, including dropped columns, as in the
>>> * original. Therefore, we create dropped columns above and drop them
>>> * here, also updating their attlen/attalign values so that the
>>> * dropped column can be skipped properly. (We do not bother with
>>> * restoring the original attbyval setting.) Also, inheritance
>>> * relationships are set up by doing ALTER TABLE INHERIT rather than
>>> * using an INHERITS clause --- the latter would possibly mess up the
>>> * column order. That also means we have to take care about setting
>>> * attislocal correctly, plus fix up any inherited CHECK constraints.
>>> * Analogously, we set up typed tables using ALTER TABLE / OF here.
>>> */
>>> if (dopt->binary_upgrade &&
>>> (tbinfo->relkind == RELKIND_RELATION ||
>>> tbinfo->relkind == RELKIND_FOREIGN_TABLE ||
>>> tbinfo->relkind == RELKIND_PARTITIONED_TABLE))
>>> {
>>> ...
>>> appendPQExpBufferStr(q, "\n-- For binary upgrade, set heap's relfrozenxid and relminmxid\n");
>>> appendPQExpBuffer(q, "UPDATE pg_catalog.pg_class\n"
>>> "SET relfrozenxid = '%u', relminmxid = '%u'\n"
>>> "WHERE oid = ",
>>> tbinfo->frozenxid, tbinfo->minmxid);
>>> appendStringLiteralAH(q, fmtId(tbinfo->dobj.name), fout);
>>> appendPQExpBufferStr(q, "::pg_catalog.regclass;\n");
>>>
>>> note that the above if clause doesn't include materialized tables. Which
>>> sems to explain this bug? Could you check that just updating the above
>>> if to include matviews fixes the bug for you?
>>
>> The pg_dump --binary-only test does produce the necessary SQL to set
>> relfrozenxid after that change, so it looks like it would fix it.
>>
>> To be able to fully confirm it, though, I'll have to build a mimal
>> case to reproduce the issue, because the snapshot I used it not usable
>> again (can't re-upgrade), and launching another snapshot takes a lot
>> of time. Basically, I'll get back to you with a confirmation, but it
>> does look good.
>
> Well, the attached script reproduces the issue.
>
> Make a scratch dir somewhere, cd into it, and run it. It will download
> 9.6.7 and 10.2, build them, create a test db in 9.6.7, modify it a bit
> here and there to get to the desired state, pg_upgrade it to 10.2 and
> vacuum.
>
> Might even be a good idea to create a regression test with that, but
> AFAIK there's nothing remotely like that in the current tests.
>
> The patch you propose makes the test succeed. It may be true that
> there might be other similar issues lurking in that code, but it looks
> like it fixes this particular issue.
Oops, the previous script has timing issues, the attached one really
works - just added some sleeps ;-)