Re: Database logins taking longer and longer, showing up as "authentication" in ps(1)

From: Adam Sjøgren <asjo(at)koldfront(dot)dk>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-general(at)postgresql(dot)org, Stephen Frost <sfrost(at)snowman(dot)net>
Subject: Re: Database logins taking longer and longer, showing up as "authentication" in ps(1)
Date: 2020-08-20 08:23:03
Message-ID: 87lfi9pvyw.fsf@tullinup.koldfront.dk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Tom writes:

> Frequently, stack traces taken without debug symbols are pretty useless,
> but you seem to have got lucky.

For once, hah!

> It seems clear that the unexpected wait is here:
>
>> (gdb) bt
>> #0 0x00007f8272cf2827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0, futex_word=0x7f7a2e9f1d38) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
>> #1 do_futex_wait (sem=sem(at)entry=0x7f7a2e9f1d38, abstime=0x0) at sem_waitcommon.c:111
>> #2 0x00007f8272cf28d4 in __new_sem_wait_slow (sem=0x7f7a2e9f1d38, abstime=0x0) at sem_waitcommon.c:181
>> #3 0x00007f8272cf297a in __new_sem_wait (sem=<optimized out>) at sem_wait.c:29
>> #4 0x000055f60d36c252 in PGSemaphoreLock ()
>> #5 0x000055f60d3e35cc in LWLockAcquire ()
>> #6 0x000055f60d3d14ee in ProcArrayAdd ()
>> #7 0x000055f60d3e01a3 in InitProcessPhase2 ()
>> #8 0x000055f60d51c99b in InitPostgres ()
>> #9 0x000055f60d3f2baf in PostgresMain ()
>> #10 0x000055f60d102bf8 in ?? ()
>> #11 0x000055f60d37e492 in PostmasterMain ()
>> #12 0x000055f60d103fa5 in main ()
>
> which implies that the problem is unexpectedly high contention for the
> ProcArrayLock. It's unsurprising that incoming processes would need
> that lock, since they need to add themselves to the ProcArray. What's
> not so apparent is why there is so much activity as to block them
> from doing so for a significant amount of time. We can guess that
> the "worker processes" you mentioned are ganging up on that lock
> because they need to compute query snapshots, but that still doesn't
> seem like enough of an explanation.
>
> Can you show us exactly what the "worker processes" are doing while
> this is going on?

One thing I should have mentioned, but forgot, is that the database is
configured to do logical replication to another machine - could that
have an effect on the lock contention?

A colleague pointed out that in the pg_locks output, the replication
processes are waiting on the ProcArray lock:

· https://koldfront.dk/misc/postgresql/pg_stat_activity_pg_locks-BAD.txt.gz

I can try turning off replication the next time I see the problem.

The workers basically loop on:

* SELECT a job from a job-table, UPDATE it
* Take an advisory lock on the relevant row in the calculations table
* Do some work, at least a couple of SELECTs from various tables
* UPDATE or INSERT a row into the calculations table, release the lock
* INSERT a row in an exitstatus table
* DELETE the job

The calculations can be sub-second to several minutes, the majority
mostly in the shorter end. As this is happening there are millions of
jobs in the job-table.

When the problem occurs, the worker logfiles indicate that they are
moving along, still working.

To give a sense of scale, here are the number of transactions per second
plotted:

· https://koldfront.dk/misc/postgresql/postgres_transactions_ALL-pinpoint_1597799383-1597907383.png

(I think the hole in the graph is when Munin timed out getting the
information, and the number of transactions dropped as I stopped most of
the workers.)

Best regards,

Adam

--
"Examination and mastering of a new highly Adam Sjøgren
intellectual equipment was a hard labour." asjo(at)koldfront(dot)dk

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Олег Самойлов 2020-08-20 12:16:10 Re: BUG? Slave don't reconnect to the master
Previous Message Adrian Klaver 2020-08-19 22:28:25 Re: Sequence generating negative numbers