False sharing for PgBackendStatus, made worse by in-core query_id handling

From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org, Bruce Momjian <bruce(at)momjian(dot)us>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: False sharing for PgBackendStatus, made worse by in-core query_id handling
Date: 2023-06-27 01:34:58
Message-ID: 20230627013458.axge7iylw7llyvww@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Twitter Thomas thoroughly nerdsniped me [1]. As part of that I ran a
concurrent readonly pgbench workload and analyzed cacheline "contention" using
perf c2c.

One of the cacheline conflicts, by far not the most common, but significant,
is one I hadn't noticed in the past. The cacheline accesses are by
pgstat_report_query_id() and pgstat_report_activity().

The reason for the conflict is simple - we don't ensure any aligment for
PgBackendStatus. Thus the end of one backend's PgBackendStatus will be in the
same cacheline as another backend's start of PgBackendStatus.

Historically that didn't show up too much, because the end of PgBackendStatus
happened to contain less-frequently changing data since at least 9.6, namely
int64 st_progress_param[PGSTAT_NUM_PROGRESS_PARAM];

which effectively avoided any relevant false sharing.

But in 4f0b0966c866 a new trailing element was added to PgBackendStatus:

/* query identifier, optionally computed using post_parse_analyze_hook */
uint64 st_query_id;

which is very frequently set, due to the following in ExecutorStart:
/*
* In some cases (e.g. an EXECUTE statement) a query execution will skip
* parse analysis, which means that the query_id won't be reported. Note
* that it's harmless to report the query_id multiple times, as the call
* will be ignored if the top level query_id has already been reported.
*/
pgstat_report_query_id(queryDesc->plannedstmt->queryId, false);

The benchmarks I ran used -c 48 -j 48 clients on my two socket workstation, 2x
10/20 cores/threads.

With a default pgbench -S workload, the regression is barely visible - the
context switches between pgbench and backend use too many resources. But a
pipelined pgbench -S shows a 1-2% regression and server-side query execution
of a simple statement [2] regresses by ~5.5%.

Note that this is with compute_query_id = auto, without any extensions
loaded.

The fix for this is quite simple, something like:
#ifdef pg_attribute_aligned
pg_attribute_aligned(PG_CACHE_LINE_SIZE)
#endif

at the start of PgBackendStatus.

Unfortunately we can't fix that in the backbranches, as it obviously is an ABI
violation.

Leaving the performance issue aside for a moment, I'm somewhat confused by the
maintenance of PgBackendStatus->st_query_id:

1) Why are there pgstat_report_query_id() calls in parse_analyze_*()? We aren't
executing the statements at that point?

2) pgstat_report_query_id() doesn't overwrite a non-zero query_id unless force
is passed in. Force is only passed in exec_simple_query(). query_id is also
reset when pgstat_report_activity(STATE_RUNNING) is called.

I think this means that e.g. bgworkers issuing queries will often get stuck
on the first query_id used, unless they call pgstat_report_activity()?

Greetings,

Andres Freund

[1] https://twitter.com/MengTangmu/status/1673439083518115840
[2] DO $$ BEGIN FOR i IN 1..10000 LOOP EXECUTE 'SELECT'; END LOOP;END;$$;

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2023-06-27 02:05:46 ReadRecentBuffer() doesn't scale well
Previous Message Hayato Kuroda (Fujitsu) 2023-06-27 01:09:42 RE: [Patch] Use *other* indexes on the subscriber when REPLICA IDENTITY is FULL