Re: Massive parallel queue table causes index deterioration, but REINDEX fails with deadlocks.

From: Justin Pryzby <pryzby(at)telsasoft(dot)com>
To: Gunther <raj(at)gusw(dot)net>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Massive parallel queue table causes index deterioration, but REINDEX fails with deadlocks.
Date: 2019-02-24 07:03:58
Message-ID: 20190224070358.GJ28750@telsasoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Some ideas:

You could ALTER TABLE SET (fillfactor=50) to try to maximize use of HOT indices
during UPDATEs (check pg_stat_user_indexes).

You could also ALTER TABLE SET autovacuum parameters for more aggressive vacuuming.

You could recreate indices using the CONCURRENTLY trick
(CREATE INDEX CONCURRENTLY new; DROP old; ALTER .. RENAME;)

On Sat, Feb 23, 2019 at 04:05:51PM -0500, Gunther wrote:
> I am using an SQL queue for distributing work to massively parallel workers.
> Workers come in servers with 12 parallel threads. One of those worker sets
> handles 7 transactions per second. If I add a second one, for 24 parallel
> workers, it scales to 14 /s. Even a third, for 36 parallel workers, I can
> add to reach 21 /s. If I try a fourth set, 48 workers, I end up in trouble.
> But that isn't even so much my problem rather than the fact that in short
> time, the performance will deteriorate, and it looks like that is because
> the queue index deteriorates and needs a REINDEX.
>
> The queue table is essentially this:
>
> CREATE TABLE Queue (
> jobId bigint,
> action text,
> pending boolean,
> result text
> );
>
> the dequeue operation is essentially this:
>
> BEGIN
>
> SELECT jobId, action
>   FROM Queue
>   WHERE pending
> FOR UPDATE SKIP LOCKED
>
> which is a wonderful concept with the SKIP LOCKED.
>
> Then I perform the action and finally:
>
> UPDATE Queue
> SET pending = false,
> result = ?
> WHERE jobId = ?
>
> COMMIT
>
> I thought to keep my index tight, I would define it like this:
>
> CREATE UNIQUE INDEX Queue_idx_pending ON Queue(jobId) WHERE pending;
>
> so that only pending jobs are in that index.
>
> When a job is done, follow up work is often inserted into the Queue as
> pending, thus adding to that index.
>
> Below is the performance chart.
>
> The blue line at the bottom is the db server.
>
>
> You can see the orange line is the first worker server with 12 threads. It
> settled into a steady state of 7/s ran with 90% CPU for some 30 min, and
> then the database CPU% started climbing and I tried to rebuild the indexes
> on the queue, got stuck there, exclusive lock, no jobs were processing, but
> the exclusive lock was never obtained for too long. So I shut down the
> worker server. Database quiet I could resolve the messed up indexes and
> restarted again. Soon I added a second worker server (green line) starting
> around 19:15. Once settled in they were pulling 14/s together. but you can
> see in just 15 min, the db server CPU % climbed again to over 40% and the
> performance of the workers dropped, their load falling to 30%. Now at around
> 19:30 I stopped them all, REINDEXed the queue table and then started 3
> workers servers simultaneously. They settled in to 21/s but in just 10 min
> again the deterioration happened. Again I stopped them all, REINDEXed, and
> now started 4 worker servers (48 threads). This time 5 min was not enough to
> see them ever settling into a decent 28/s transaction rate, but I guess they
> might have reached that for a minute or two, only for the index
> deteriorating again. I did another stop now started only 2 servers and
> again, soon the index deteriorated again.
>
> Clearly that index is deteriorating quickly, in about 10,000 transactions.
>
> BTW: when I said 7/s, it is in reality about 4 times as many transactions,
> because of the follow up jobs that also get added on this queue. So 10,0000
> transactions may be 30 or 40 k transactions before the index deteriorates.
>
> Do you have any suggestion how I can avoid that index deterioration problem
> smoothly?
>
> I figured I might just pause all workers briefly to schedule the REINDEX
> Queue command, but the problem with this is that while the transaction
> volume is large, some jobs may take minutes to process, and in that case we
> need to wait minutes to quiet the database with then 47 workers sitting as
> idle capacity waiting for the 48th to finish so that the index can be
> rebuilt!
>
> Of course I tried to resolve the issue with vacuumdb --analyze (just in case
> if the autovacuum doesn't act in time) and that doesn't help. Vacuumdb
> --full --analyze would probably help but can't work because it required an
> exclusive table lock.
>
> I tried to just create a new index of the same
>
> CREATE UNIQUE INDEX Queue_idx2_pending ON Queue(jobId) WHERE pending;
> DROP INDEX Queue_idx_pending;
> ANALYZE Queue;
>
> but with that I got completely stuck with two indexes where I could not
> remove either of them for those locking issues. And REINDEX will give me a
> deadlock error rightout.
>
> I am looking for a way to manage that index so that it does not deteriorate.
>
> May be if I was not defining it with
>
> ... WHERE pending;
>
> then it would only grow, but never shrink. May be that helps somehow? I
> doubt it though. Adding to an index also causes deterioration, and most of
> the rows would be irrelevant because they would be past work. It would be
> nicer if there was another smooth way.

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Corey Huinker 2019-02-24 07:25:09 Re: Massive parallel queue table causes index deterioration, but REINDEX fails with deadlocks.
Previous Message David Rowley 2019-02-24 06:41:08 Re: Massive parallel queue table causes index deterioration, but REINDEX fails with deadlocks.