From: | Adam Sjøgren <asjo(at)koldfront(dot)dk> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | pgsql-general(at)postgresql(dot)org, Stephen Frost <sfrost(at)snowman(dot)net> |
Subject: | Re: Database logins taking longer and longer, showing up as "authentication" in ps(1) |
Date: | 2020-08-20 08:23:03 |
Message-ID: | 87lfi9pvyw.fsf@tullinup.koldfront.dk |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Tom writes:
> Frequently, stack traces taken without debug symbols are pretty useless,
> but you seem to have got lucky.
For once, hah!
> It seems clear that the unexpected wait is here:
>
>> (gdb) bt
>> #0 0x00007f8272cf2827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0, futex_word=0x7f7a2e9f1d38) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
>> #1 do_futex_wait (sem=sem(at)entry=0x7f7a2e9f1d38, abstime=0x0) at sem_waitcommon.c:111
>> #2 0x00007f8272cf28d4 in __new_sem_wait_slow (sem=0x7f7a2e9f1d38, abstime=0x0) at sem_waitcommon.c:181
>> #3 0x00007f8272cf297a in __new_sem_wait (sem=<optimized out>) at sem_wait.c:29
>> #4 0x000055f60d36c252 in PGSemaphoreLock ()
>> #5 0x000055f60d3e35cc in LWLockAcquire ()
>> #6 0x000055f60d3d14ee in ProcArrayAdd ()
>> #7 0x000055f60d3e01a3 in InitProcessPhase2 ()
>> #8 0x000055f60d51c99b in InitPostgres ()
>> #9 0x000055f60d3f2baf in PostgresMain ()
>> #10 0x000055f60d102bf8 in ?? ()
>> #11 0x000055f60d37e492 in PostmasterMain ()
>> #12 0x000055f60d103fa5 in main ()
>
> which implies that the problem is unexpectedly high contention for the
> ProcArrayLock. It's unsurprising that incoming processes would need
> that lock, since they need to add themselves to the ProcArray. What's
> not so apparent is why there is so much activity as to block them
> from doing so for a significant amount of time. We can guess that
> the "worker processes" you mentioned are ganging up on that lock
> because they need to compute query snapshots, but that still doesn't
> seem like enough of an explanation.
>
> Can you show us exactly what the "worker processes" are doing while
> this is going on?
One thing I should have mentioned, but forgot, is that the database is
configured to do logical replication to another machine - could that
have an effect on the lock contention?
A colleague pointed out that in the pg_locks output, the replication
processes are waiting on the ProcArray lock:
· https://koldfront.dk/misc/postgresql/pg_stat_activity_pg_locks-BAD.txt.gz
I can try turning off replication the next time I see the problem.
The workers basically loop on:
* SELECT a job from a job-table, UPDATE it
* Take an advisory lock on the relevant row in the calculations table
* Do some work, at least a couple of SELECTs from various tables
* UPDATE or INSERT a row into the calculations table, release the lock
* INSERT a row in an exitstatus table
* DELETE the job
The calculations can be sub-second to several minutes, the majority
mostly in the shorter end. As this is happening there are millions of
jobs in the job-table.
When the problem occurs, the worker logfiles indicate that they are
moving along, still working.
To give a sense of scale, here are the number of transactions per second
plotted:
· https://koldfront.dk/misc/postgresql/postgres_transactions_ALL-pinpoint_1597799383-1597907383.png
(I think the hole in the graph is when Munin timed out getting the
information, and the number of transactions dropped as I stopped most of
the workers.)
Best regards,
Adam
--
"Examination and mastering of a new highly Adam Sjøgren
intellectual equipment was a hard labour." asjo(at)koldfront(dot)dk
From | Date | Subject | |
---|---|---|---|
Next Message | Олег Самойлов | 2020-08-20 12:16:10 | Re: BUG? Slave don't reconnect to the master |
Previous Message | Adrian Klaver | 2020-08-19 22:28:25 | Re: Sequence generating negative numbers |