Re: RFC: replace pg_stat_activity.waiting with something more descriptive

From: Vladimir Borodin <root(at)simply(dot)name>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Александр Коротков <aekorotkov(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, "andres(at)anarazel(dot)de" <andres(at)anarazel(dot)de>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>
Subject: Re: RFC: replace pg_stat_activity.waiting with something more descriptive
Date: 2015-09-18 08:08:11
Message-ID: EEE78E40-0E48-411A-9F90-CF9339DA9698@simply.name
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


> 16 сент. 2015 г., в 20:52, Robert Haas <robertmhaas(at)gmail(dot)com> написал(а):
>
> On Wed, Sep 16, 2015 at 12:29 PM, Alexander Korotkov
> <aekorotkov(at)gmail(dot)com> wrote:
>> Yes, the major question is cost. But I think we should validate our thoughts
>> by experiments assuming there are more possible synchronization protocols.
>> Ildus posted implemention of double buffering approach that showed quite low
>> cost.
>
> I'm not sure exactly which email you are referring to, but I don't
> believe that anyone has done experiments that are anywhere near
> comprehensive enough to convince ourselves that this won't be a
> problem. If a particular benchmark doesn't show an issue, that can
> just mean that the benchmark isn't hitting the case where there is a
> problem. For example, EDB has had customers who have severe
> contention apparently on the buffer content lwlocks, resulting in big
> slowdowns. You don't see that in, say, a pgbench run. But for people
> who have certain kinds of queries, it's really bad. Those sort of
> loads, where the lwlock system really gets stressed, are cases where
> adding overhead seems likely to pinch.

Alexander and Ildus gave us two patches for REL9_4_STABLE - one with sampling every N milliseconds and one with measuring timings. We have tested both of them on two kinds of our workload besides pgbench runs. One workload is OLTP when all the data fits in shared buffers, synchronous_commit if off and the bottleneck is ProcArrayLock (all backtraces look something like the following):

[Thread debugging using libthread_db enabled]
0x00007f10e01d4f27 in semop () from /lib64/libc.so.6
#0 0x00007f10e01d4f27 in semop () from /lib64/libc.so.6
#1 0x000000000061fe27 in PGSemaphoreLock (sema=0x7f11f2d4f430, interruptOK=0 '\000') at pg_sema.c:421
#2 0x00000000006769ba in LWLockAcquireCommon (l=0x7f10e1e00120, mode=LW_EXCLUSIVE) at lwlock.c:626
#3 LWLockAcquire (l=0x7f10e1e00120, mode=LW_EXCLUSIVE) at lwlock.c:467
#4 0x0000000000667862 in ProcArrayEndTransaction (proc=0x7f11f2d4f420, latestXid=182562881) at procarray.c:404
#5 0x00000000004b579b in CommitTransaction () at xact.c:1957
#6 0x00000000004b6ae5 in CommitTransactionCommand () at xact.c:2727
#7 0x00000000006819d9 in finish_xact_command () at postgres.c:2437
#8 0x0000000000684f05 in PostgresMain (argc=<value optimized out>, argv=<value optimized out>, dbname=0x21e1a70 "xivadb", username=<value optimized out>) at postgres.c:4270
#9 0x0000000000632d7d in BackendRun (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:4155
#10 BackendStartup (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:3829
#11 ServerLoop (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1597
#12 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1244
#13 0x00000000005cadb8 in main (argc=3, argv=0x21e0aa0) at main.c:228

Another is when all the data fits in RAM but does not fit in shared buffers and the bottleneck is mostly BufFreelistLock with sensible contention around buffer partitions locking and buffers locking. Taking several backtraces with GDB of several backends and doing some bash magic gives the following:

root(at)pgload01g ~ # grep '^#4 ' /tmp/bt | awk '{print $2, $4, $NF}' | sort | uniq -c | sort -rn
126 0x000000000065db61 BufferAlloc bufmgr.c:591
67 0x000000000065e03a BufferAlloc bufmgr.c:760
43 0x00000000005c8c3b pq_getbyte pqcomm.c:899
39 0x000000000065dd93 BufferAlloc bufmgr.c:765
6 0x00000000004b52bb RecordTransactionCommit xact.c:1194
4 0x000000000065da0e ReadBuffer_common bufmgr.c:476
1 ReadBuffer_common relpersistence=112 bufmgr.c:340
1 exec_eval_expr expr=0x166e908, pl_exec.c:4796
1 0x00007f78b8cb217b ?? /usr/pgsql-9.4/lib/pg_stat_statements.so
1 0x00000000005d4cbb _copyList copyfuncs.c:3849
root(at)pgload01g ~ #

For both scenarios on linux we got approximately the same results - version with timings was faster then version with sampling (sampling was done every 10 ms). Vanilla PostgreSQL from REL9_4_STABLE gave ~15500 tps and version with timings gave ~14500 tps while version with sampling gave ~13800 tps. In all cases processor was 100% utilized. Comparing vanilla PostgreSQL and version with timings on constant workload (12000 tps) gave the following results in latencies for queries:

q'th vanilla timing
99% 79.0 97.0 (+18.0)
98% 64.0 76.0 (+12.0)
95% 38.0 47.0 (+9.0)
90% 16.0 21.0 (+5.0)
85% 7.0 11.0 (+4.0)
80% 5.0 7.0 (+2.0)
75% 4.0 5.0 (+1.0)
50% 2.0 3.0 (+1.0)

And it that test version with timings consumed about 7% more of CPU. Does it seem to be the results on workload where lwlock system is stressed?

And when the data does not fit in RAM you really don’t see much difference between all three version because your contention is moved from lwlock system to I/O, even with newest NVMe SSDs, or at least is divided between lwlocks and other waits events.

>
>> Yes, but some competing products also provides comprehensive waits
>> monitoring too. That makes me think it should be possible for us too.
>
> I agree, but keep in mind that some of those products may use
> techniques to reduce the overhead that we don't have available. I
> have a strong suspicion that one of those products in particular has
> done something clever to make measuring the time cheap on all
> platforms. Whatever that clever thing is, we haven't done it. So
> that matters.

I don’t know for all the products but Oracle didn’t make something clever, they exactly use gettimeofday on linux - strace proofs that.

>
>> I think the reason for hooks could be not only disagreements about design,
>> but platform dependent issues too.
>> Next step after we have view with current wait events will be gathering some
>> statistics of them. We can oppose at least two approaches here:
>> 1) Periodical sampling of current wait events.
>> 2) Measure each wait event duration. We could collect statistics for short
>> period locally and update shared memory structure periodically (using some
>> synchronization protocol).
>>
>> In the previous attempt to gather lwlocks statistics, you predict that
>> sampling could have a significant overhead [1]. In contrast, on many systems
>> time measurements are cheap. We have implemented both approaches and it
>> shows that sampling every 1 milliseconds produce higher overhead than
>> individual duration measurements for each wait event. We can share another
>> version of waits monitoring based on sampling to make these results
>> reproducible for everybody. However, cheap time measurements are available
>> not for each platform. For instance, ISTM that on Windows time measurements
>> are too expensive [2].
>>
>> That makes me think that we need pluggable solution, at least for
>> statistics: direct measuring of events durations for majority of systems and
>> sampling for others as the least harm.
>
> To me, those seem like arguments for making it configurable, but not
> necessarily for having hooks.

We were discussing it with Bruce at pgday.ru <http://pgday.ru/> and AFAIK Bruce said that it would be enough to have the ability to turn measuring timings off with a GUC not to have significant overhead on some platforms. Isn’t it a good approach?

BTW, we have also tested with pgbench and on two mentioned earlier workloads version with waits monitoring turned off, we haven’t seen any overhead in all tests.

>
>>> I think it's reasonable to consider reporting this data in the PGPROC
>>> using a 4-byte integer rather than reporting it through a singe byte
>>> in the backend status structure. I believe that addresses the
>>> concerns about reporting from auxiliary processes, and it also allows
>>> a little more data to be reported. For anything in excess of that, I
>>> think we should think rather harder. Most likely, such addition
>>> detail should be reported only for certain types of wait events, or on
>>> a delay, or something like that, so that the core mechanism remains
>>> really, really fast.
>>
>> That sounds reasonable. There are many pending questions, but it seems like
>> step forward to me.
>
> Great, let's do it. I think we should probably do the work to
> separate the non-individual lwlocks into tranches first, though.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

--
Да пребудет с вами сила…
https://simply.name/ru

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2015-09-18 08:21:46 Re: extend pgbench expressions with functions
Previous Message Haribabu Kommi 2015-09-18 08:03:43 Re: Parallel Seq Scan