Disable parallel query by default

From: "Scott Mead" <scott(at)meads(dot)us>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Disable parallel query by default
Date: 2025-05-13 20:36:26
Message-ID: a5916f83-de79-4a40-933a-fb0d9ba2f5a0@app.fastmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello Hackers,

Over the last 24 months, I've noticed a pattern amongst users with unexpected plan flips landing on parallel plans.

77cd477 (9.6 beta) defaulted parallel query on (max_parallel_degree = 2), it's been nine years and I'd like to open the discussion to see what our thoughts are. Especially since it seems that the decision was made for 9.6 beta testing, and never really revisited.

I'll open by proposing that we prevent the planner from automatically selecting parallel plans by default, opting instead to allow users to set their max_parallel_workers_per_gather as needed. IOW: lets make the default max_parallel_workers_per_gather=0 for V18 forward.

Just to be clear, my concern isn't with parallel query in general, the issue we see is when high-frequency, low-latency queries start executing with parallelism on their own (i.e. the traditional plan flip with a twist). Given that max_parallel_workers_per_gather is dynamic and easily configured per session (or even per query with something like the pg_hint_plan extension), dissuading the planner from opting in to parallelism by default will contain the fallout that we see when plans flip to parallel execution.

What is the fallout? When a high-volume, low-latency query flips to parallel execution on a busy system, we end up in a situation where the database is effectively DDOSing itself with a very high rate of connection establish and tear-down requests. Even if the query ends up being faster (it generally does not), the CPU requirements for the same workload rapidly double or worse, with most of it being spent in the OS (context switch, fork(), destroy()). When looking at the database, you'll see a high load average, and high wait for CPU with very little actual work being done within the database.

For an example of scale, we have seen users with low connection rates (<= 5 / minute) suddenly spike to between 2000 and 3000 connect requests per minute until the system grinds to a halt.

I'm looking forward to the upcoming monitoring in e7a9496 (Add two attributes to pg_stat_database for parallel workers activity), it will be easier to empirically prove that parallel query is being used. I don't think the patch goes far enough though, we really need the ability to pinpoint the query and the specific variables used that triggered the parallel plan. When we tell a user that parallel query is in-use and suspected, it is almost always met with "no, we don't use that feature". Users do not even realize that it's happening and quickly ask for a list of all queries that have ever undergone parallel execution. It's pretty much impossible to give an accurate list of these because there is no instrumentation available (even in the new patch) to get to the per-query level.

When a user says "I'm not using parallel query" we have to walk through circumstantial evidence of its use. I typically combine IPC:BgWorkerShutDown, IPC:ParallelFinish, IO:DataFileRead (this helps nail it for sequential scans) with a high rate of connection establishment. When you look at all of these together, it still hard to see that parallelism is the cause, but when we disable automated plan selection, system stability returns.

The recommendation that I give to users is pretty straightforward: "Disable automatic parallel query, enable it for queries where you find substantial savings and can control the rate of execution." I always tell users that if they're using parallel query for anything that should execute in less than 5 minutes, they're probably pushing on the wrong tuning strategy as the load induced by the parallel query infrastructure is likely going to negate the savings that they're getting.

I'm curious to hear what others think of this proposal. I've dealt with so many of these over the last 24 months, most of them causing strife along the way, that I'm interested in what others think.

--
Scott Mead
Amazon Web Services
scott(at)meads(dot)us

Note: When testing the attached patch, there are failures in misc_sanity.out and misc_functions.out (replication origin name is too long). I assume these are unrelated to my attached patch.

Attachment Content-Type Size
v1-0001-Disable-parallel-query-by-default.patch application/octet-stream 3.1 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Mihail Nikalayeu 2025-05-13 20:44:24 Re: Small fixes needed by high-availability tools
Previous Message Florents Tselai 2025-05-13 20:24:11 Re: PATCH: jsonpath string methods: lower, upper, initcap, l/r/btrim, replace, split_part