Re: Potential G2-item cycles under serializable isolation

From: Kyle Kingsbury <aphyr(at)jepsen(dot)io>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: Potential G2-item cycles under serializable isolation
Date: 2020-06-03 23:26:38
Message-ID: CAMotZ_wLtqrS_t09Q1b3ofjwV-7gr9728mV9aaN5murde7ykog@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Oh this is interesting! I can say that I'm running on a 24-way Xeon with
128gb of ram, so running out of system memory doesn't immediately seem like
a bottleneck--I'd suspect my config runs slower by dint of disks (older
SSD), fs settings, or maybe postgres tuning (this is with the stock Debian
config files).

"No process wrote x" is very surprising and means the database basically
fabricated a value out of thin air--either something is very broken in
postgres or (more likely, I think) there's some state left over from a
prior run--I messaged you on Hangouts about this, but you might have to
clear the tables by hand between runs. That's if I forgot to push up my
latest commit, which clears tables in append.clj's setup! function. If that
code is there... There's something deeper going wrong.

The memory consumption of jepsen during the analysis... There's probably
stuff I can optimize there, but it's never been an issue before--most
distributed dbs are only pushing ~100 txns/sec, not 10k, so our histories
never get this big. I know this is gonna sound weird, but slowing down
postgres might actually help with reproducing this bug. Another possible
path is to run more (--test-count 100) tests with shorter time limits
(--time-limit 20). Or maybe injecting (Thread/sleep) statements into the
transactions themselves, like in apply-mop!. Not sure!

If you're having trouble sorting through results, lein run serve in the
stolon/ directory will give you a little web server for browsing the store/
directory. Might come in handy!

--Kyle

On Wed, Jun 3, 2020, 19:11 Peter Geoghegan <pg(at)bowt(dot)ie> wrote:

> On Wed, Jun 3, 2020 at 2:35 PM Kyle Kingsbury <aphyr(at)jepsen(dot)io> wrote:
> > It looks like you're seeing a much higher txn success rate than I
> am--possibly due to your tuning? Might be worth adjusting --rate and/or
> --concurrency upwards
>
> I can see what I assume is the same problem (a failure/table flip and
> a huge graph) with "--concurrency 150 -r 10000", and with autovacuum
> disabled on the Postgres side (this is the same relatively tuned
> Postgres configuration that I used when Jepsen passed for me). It's
> difficult to run the tests, so it's hard to isolate without it taking
> a long time.
>
> BTW, the tests are kind of flappy. The Linux OOM killer just killed
> Java after 20 minutes or so, for example. I assume that this is to be
> expected with the settings cranked up like this -- the analysis will
> take longer and use more memory, too. Any tips on limiting that? Is
> there any reason to think that running the same test twice will affect
> the outcome of the second test?
>
> I also see this sometimes, even though I thought I fixed it earlier --
> it seems to happen at random:
>
> Caused by: java.lang.AssertionError: Assert failed: No transaction wrote
> 8363 2
> t2
>
> The fact that Kyle saw such a high number of failed transactions,
> which are difficult to reproduce here seems to suggest that the issue
> is related to running out of shared memory for predicate locks and/or
> bloat (which tends to have the side effect of increasing the need for
> predicate locks). I continue to suspect that this is related to an
> edge case with predicate locks. It could be related to running out of
> predicate locks -- maybe an issue with the lock escalation? That would
> tend to increase the number of failures by quite a lot.
>
> --
> Peter Geoghegan
>

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Peter Geoghegan 2020-06-04 01:33:45 Re: Potential G2-item cycles under serializable isolation
Previous Message Peter Geoghegan 2020-06-03 23:11:41 Re: Potential G2-item cycles under serializable isolation