snapper vs. HEAD

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgbf(at)twiska(dot)com
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: snapper vs. HEAD
Date: 2020-03-29 03:50:32
Message-ID: 31255.1585453832@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Buildfarm member snapper has been crashing in the core regression tests
since commit 17a28b0364 (well, there's a bit of a range of uncertainty
there, but 17a28b0364 looks to be the only such commit that could have
affected code in gistget.c where the crash is). Curiously, its sibling
skate is *not* failing, despite being on the same machine and compiler.

I looked into this by dint of setting up a similar environment in a
qemu VM. I might not have reproduced things exactly, but I managed
to get the same kind of crash at approximately the same place, and
what it looks like to me is a compiler bug. It's iterating
gistindex_keytest's loop over search keys

ScanKey key = scan->keyData;
int keySize = scan->numberOfKeys;

while (keySize > 0)
{
...
key++;
keySize--;
}

one time too many, and accessing a garbage ScanKey value off the end of
the keyData array, leading to a function call into never-never land.

I'm no expert on Sparc assembly code, but it looks like the compiler
forgot the ",a" (annul) modifier here:

.loc 1 181 0
andcc %o7, 64, %g0
be,pt %icc, .L134 <----
addcc %l5, -1, %l5
.loc 1 183 0
lduh [%i4+16], %o7
add %i4, %o7, %o7
lduh [%o7+12], %o7
andcc %o7, 1, %g0
bne,pt %icc, .L141
ld [%fp-32], %g2
.loc 1 163 0
ba,pt %xcc, .L134
addcc %l5, -1, %l5

causing %l5 (which contains the keySize value) to be decremented
an extra time in the case where that branch is not taken and
we fall through as far as the "ba" at the end. Even that would
not be fatal, perhaps, except that the compiler also decided to
optimize the "keySize > 0" test to "keySize != 0", for ghod only
knows what reason (surely it's not any faster on a RISC machine),
so that arriving at .L134 with %l5 containing -1 allows the loop
to be iterated again. Kaboom.

It's unclear how 17a28b0364 would have affected this, but there is
an elog() call elsewhere in the same function, so maybe the new
coding for that changed register assignments or some other
phase-of-the-moon effect.

I doubt that anyone's going to take much interest in fixing this
old compiler version, so my recommendation is to back off the
optimization level on snapper to -O1, and probably on skate as
well because there's no obvious reason why the same compiler bug
might not bite skate at some point. I was able to get through
the core regression tests on my qemu VM after recompiling
gistget.c at -O1 (with other flags the same as snapper is using).

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2020-03-29 05:23:09 Re: pg_stat_statements issue with parallel maintenance (Was Re: WAL usage calculation patch)
Previous Message Noah Misch 2020-03-29 03:40:10 Re: backup manifests