Re: Failure in contrib test _int on loach

From: Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Failure in contrib test _int on loach
Date: 2019-04-09 16:11:06
Message-ID: 8984d4a8-313b-6b37-3bbc-81048bc8608e@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

05.04.2019 19:41, Anastasia Lubennikova writes:
>
> 05.04.2019 18:01, Tom Lane writes:
>> Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com> writes:
>>> On Fri, Apr 5, 2019 at 2:02 AM Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
>>> wrote:
>>>> This is a strange failure:
>>>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=loach&dt=2019-04-05%2005%3A15%3A00
>>>>
>>>> [ wrong answers from queries using a GIST index ]
>>> There are a couple of other recent instances of this failure, on
>>> francolin and whelk.
>> Yeah.  Given three failures in a couple of days, we can reasonably
>> guess that the problem was introduced within a day or two prior to
>> the first one.  Looking at what's touched GIST in that time frame,
>> suspicion has to fall heavily on
>> 9155580fd5fc2a0cbb23376dfca7cd21f59c2c7b.
>>
>> If I had to bet, I'd bet that there's something wrong with the
>> machinations described in the commit message:
>>           For GiST, the LSN-NSN interlock makes this a little tricky.
>> All pages must
>>      be marked with a valid (i.e. non-zero) LSN, so that the
>> parent-child
>>      LSN-NSN interlock works correctly. We now use magic value 1 for
>> that during
>>      index build. Change the fake LSN counter to begin from 1000, so
>> that 1 is
>>      safely smaller than any real or fake LSN. 2 would've been enough
>> for our
>>      purposes, but let's reserve a bigger range, in case we need more
>> special
>>      values in the future.
>>
>> I'll go add this as an open issue.
>>
>>             regards, tom lane
>>
>
> Hi,
> I've already noticed the same failure in our company buildfarm and
> started the research.
>
> You are right, it's the " Generate less WAL during GiST, GIN and
> SP-GiST index build. " patch to blame.
> Because of using the GistBuildLSN some pages are not linked correctly,
> so index scan cannot find some entries, while seqscan finds them.
>
> In attachment, you can find patch with a test that allows to reproduce
> the bug not randomly, but on every run.
> Now I'm trying to find a way to fix the issue.

The problem was caused by incorrect detection of the page to insert new
tuple after split.
If gistinserttuple() of the tuple formed by gistgetadjusted() had to
split the page, we must to go back to the parent and
descend back to the child that's a better fit for the new tuple.

Previously this was handled by the code block with the following comment:

* Concurrent split detected. There's no guarantee that the
* downlink for this page is consistent with the tuple we're
* inserting anymore, so go back to parent and rechoose the best
* child.

After introducing GistBuildNSN this code path became unreachable.
To fix it, I added new flag to detect such splits during indexbuild.

The patches with the test and fix are attached.

Many thanks to Teodor Sigaev, who helped to find the bug.

--

Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment Content-Type Size
gist_optimized_wal_intarray_fix_v1.patch text/x-patch 2.1 KB
gist_optimized_wal_intarray_test_v1.patch text/x-patch 3.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Konstantin Knizhnik 2019-04-09 16:13:32 Re: Zedstore - compressed in-core columnar storage
Previous Message Tomas Vondra 2019-04-09 16:05:18 Re: [HACKERS] PATCH: multivariate histograms and MCV lists