Re: Configurable FP_LOCK_SLOTS_PER_BACKEND

From: Matt Smiley <msmiley(at)gitlab(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: Nikolay Samokhvalov <nik(at)postgres(dot)ai>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Configurable FP_LOCK_SLOTS_PER_BACKEND
Date: 2023-08-07 19:26:59
Message-ID: CA+eRB3qhb1Nt5hDct3iO=vtry1kuMgF4uzOns9GOatSLc3vuSw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thank you Tomas! I really appreciate your willingness to dig in here and
help us out! The rest of my replies are inline below.

On Thu, Aug 3, 2023 at 1:39 PM Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
wrote:

> The analysis in the linked gitlab issue is pretty amazing. I wasn't
> planning to argue against the findings anyway, but plenty of data
> supporting the conclusions is good.
>

Thank you! I totally agree, having supporting data is so helpful.

I'm not an expert on locking, so some of the stuff I say may be
> trivially obvious - it's just me thinking about ...
>

Absolutely makes sense to check assumptions, etc. Thanks for being open!
For what it's worth, I've also been working with Postgres for many years,
and I love that it keeps teaching me new things, this topic being just the
latest.

I wonder what's the rough configuration of those systems, though. Both
> the hardware and PostgreSQL side. How many cores / connections, etc.?
>

Each of the postgres hosts had 96 vCPUs and at peak handled roughly 80
concurrently active connections.

For purposes of reproducing the pathology, I think we can do so with a
single postgres instance. We will need a high enough query rate to push
the bottleneck to lock_manager lwlock contention. The simplest way to do
so is probably to give it a small dataset that fits easily in cache and run
several concurrent client connections doing cheap single-row queries, each
in its own transaction, against a target table that has either many indexes
or partitions or both.

For context, here's a brief summary of the production environment where we
first observed this pathology:
The writable primary postgres instance has several streaming replicas, used
for read-only portions of the workload. All of them run on equivalent
hardware. Most of the research focuses on the streaming replica postgres
instances, although the same pathology has been observed in the writable
primary node as well. The general topology is thousands of client
connections fanning down into several pgbouncer instances per Postgres
instance. From each Postgres instance's perspective, its workload
generally has a daily peak of roughly 80 concurrently active backends
supporting a throughput of 75K transactions second, where most transactions
run a single query.

Yes, I agree with that. Partitioning makes this issue works, I guess.
> Schemas with indexes on every column are disturbingly common these days
> too, which hits the issue too ...
>

Agreed.

> Those are pretty great pieces of information. I wonder if some of the
> measurements may be affecting the observation (by consuming too much
> CPU, making the contention worse), but overall it seems convincing.
>

Yes, definitely overhead is a concern, glad you asked!

Here are my notes on the overhead for each bpftrace script:
https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1357834678

Here is a summary of where that overhead comes from:
https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365310956

Here are more generic benchmark results for uprobe overhead:
https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1383

Briefly, we generally expect the instrumentation overhead to be roughly
1-2 microseconds per call to the instrumented instruction. It partly
depends on what we're doing in the instrumentation, but most of that
overhead is just the interrupt-handling to transfer control flow to/from
the BPF code.

Would it be difficult to sample just a small fraction of the calls? Say,
> 1%, to get good histograms/estimated with acceptable CPU usage.
>

That would be great, but since the overhead comes mostly from the control
transfer, it wouldn't help to put sampling logic in the tracer itself. The
main way to mitigate that overhead is to choose instrumentation points
where the call rate is tolerably low. That's why the only instrumentation
I used for more than a few seconds at a time were the "low overhead"
scripts that instrument only the stalled call to LWLockAcquire.

> In any case, it's a great source of information to reproduce the issue
> and evaluate possible fixes.
>

Thanks, that's my hope!

> After looking at the code etc. I think the main trade-off here is going
> to be the cost of searching the fpRelId array. At the moment it's
> searched linearly, which is cheap for 16 locks. But at some point it'll
> become as expensive as updating the slowpath, and the question is when.
>
> I wonder if we could switch to a more elaborate strategy if the number
> of locks is high enough. Say, a hash table, or some hybrid approach.
>

Interesting idea! I was hoping a linear search would stay cheap enough but
you're right, it's going to become too inefficient at some point. It might
make sense to start with just blackbox timing or throughput measurements,
because directly measuring that search duration may not be cheap. To
observe durations via BPF, we have to instrument 2 points (e.g. function
entry and return, or more generally the instructions before and after the
critical section we're observing). For code called as frequently as
LWLockAcquire, that overhead would be prohibitively expensive, so we might
need to measure it natively with counters for each histogram bucket we care
about. Just thinking ahead; we don't need to deal with this yet, I guess.

> I understand. I have two concerns:
>
> 1) How would the users know they need to tune this / determine what's
> the right value, and what's the right value for their system.
>
> 2) Having to deal with misconfigured systems as people tend to blindly
> tune everything to 100x the default, because more is better :-(
>

These are very valid concerns. Thanks for articulating them!

For point 1:
The advice might be to only increase the number of slots for fastpath locks
per xact if sampling pg_stat_activity frequently shows "lock_manager"
wait_events affecting a significant percentage of your non-idle backends.
And increase it as little as possible, due to the extra overhead incurred
when checking locks. For calibration purposes, I polled pg_locks
periodically, counting the number of slowpath locks where the lock mode was
weak enough to qualify for fastpath if there had been enough fastpath slots
available. See the graph and SQL at the bottom of this note:
https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365595142
.

For point 2:
Maybe we should put a ceiling on the allowed value. Maybe 64 or 128 or
256? Probably should depend on the cost of searching the fpRelid array, as
you described earlier.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2023-08-07 19:34:09 Re: Use of additional index columns in rows filtering
Previous Message Robert Haas 2023-08-07 19:21:56 Re: Configurable FP_LOCK_SLOTS_PER_BACKEND