Re: Re: [GSOC 17] Eliminate O(N^2) scaling from rw-conflict tracking in serializable transactions

From: "Mengxing Liu" <liu-mx15(at)mails(dot)tsinghua(dot)edu(dot)cn>
To: kgrittn <kgrittn(at)gmail(dot)com>, "Alvaro Herrera" <alvherre(at)2ndquadrant(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: [GSOC 17] Eliminate O(N^2) scaling from rw-conflict tracking in serializable transactions
Date: 2017-06-06 16:16:43
Message-ID: 68629d67.22ff.15c7e31ad9e.Coremail.liu-mx15@mails.tsinghua.edu.cn
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi, Kevin and Alvaro.

I think disk I/O is not the bottleneck in our experiment, but the global lock is.

For disk I/O, there are two evidences:
1) The total throughput is not more than 10 Ktps. Only a half are update transactions. An update transaction modifies 20 tuples; each tuple's size is about 100B.
So the data written to the disk should be less than 10MB per second. Even we take write-ahead log in consideration (just double it), the data should be less than 20MB/s.
I replaced ramdisk with SSD, and "iostat" shows the same result, while our SSD's sequential write speed is more than 700MB/s.
2) I changed isolation level from "serializable" to "read committed". As the isolation requirement becomes looser, throughput is increased. But in this case, the CPU utilization is nearly 100%. (It's about 50% in serializable model)
Therefore, disk I/O is not the bottleneck.

For the lock:
I read the source code in predicate.c; I found many functions use a global lock: SerializableXactHashLock. So there is only one process on working at any time!
As the problem of CPU not being fully used just happened after I had changed isolation level to "serailizable", this global lock should be the bottleneck.
Unfortunately, "perf" seems unable to record time waiting for locks.
I did it by hand. Specifically, I use function "gettimeofday" just before acquiring locks and after releasing locks.
In this way, I found function CheckForSerializableIn/CheckForSerializableOut takes more than 10% of running time, which is far bigger than what reported by perf in the last email.

If my statement is right, it sounds like good news as we find where the problem is.
Kevin has mentioned that the lock is used to protect the linked list. So I want to replace the linked list with hash table in the next step. After that I will try to remove this lock carefully.
But in this way, our purpose has been changed. O(N^2) tracking is not the bottleneck, the global lock is.

By the way, using "gettimeofday" to profile is really ugly.
Perf lock can only record kernel mutex, and requires kernel support, so I didn't use it.
Do you have any good idea about profiling time waiting for locks?

> -----Original Messages-----
> From: "Mengxing Liu" <liu-mx15(at)mails(dot)tsinghua(dot)edu(dot)cn>
> Sent Time: 2017-06-05 00:27:51 (Monday)
> To: "Kevin Grittner" <kgrittn(at)gmail(dot)com>
> Cc: "Alvaro Herrera" <alvherre(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
> Subject: Re: Re: [HACKERS] Re: [GSOC 17] Eliminate O(N^2) scaling from rw-conflict tracking in serializable transactions
>
>
>
>
> > -----Original Messages-----
> > From: "Kevin Grittner" <kgrittn(at)gmail(dot)com>
>
> > > I tried 30 cores. But the CPU utilization is about 45%~70%.
> > > How can we distinguish where the problem is? Is disk I/O or Lock?
> >
> > A simple way is to run `vmstat 1` for a bit during the test. Can
> > you post a portion of the output of that here? If you can configure
> > the WAL directory to a separate mount point (e.g., use the --waldir
> > option of initdb), a snippet of `iostat 1` output would be even
> > better.
>
> "vmstat 1" output is as follow. Because I used only 30 cores (1/4 of all), cpu user time should be about 12*4 = 48.
> There seems to be no process blocked by IO.
>
> procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
> r b swpd free buff cache si so bi bo in cs us sy id wa st
> 28 0 0 981177024 315036 70843760 0 0 0 9 0 0 1 0 99 0 0
> 21 1 0 981178176 315036 70843784 0 0 0 0 25482 329020 12 3 85 0 0
> 18 1 0 981179200 315036 70843792 0 0 0 0 26569 323596 12 3 85 0 0
> 17 0 0 981175424 315036 70843808 0 0 0 0 25374 322992 12 4 85 0 0
> 12 0 0 981174208 315036 70843824 0 0 0 0 24775 321577 12 3 85 0 0
> 8 0 0 981179328 315036 70845336 0 0 0 0 13115 199020 6 2 92 0 0
> 13 0 0 981179200 315036 70845792 0 0 0 0 22893 301373 11 3 87 0 0
> 11 0 0 981179712 315036 70845808 0 0 0 0 26933 325728 12 4 84 0 0
> 30 0 0 981178304 315036 70845824 0 0 0 0 23691 315821 11 4 85 0 0
> 12 1 0 981177600 315036 70845832 0 0 0 0 29485 320166 12 4 84 0 0
> 32 0 0 981180032 315036 70845848 0 0 0 0 25946 316724 12 4 84 0 0
> 21 0 0 981176384 315036 70845864 0 0 0 0 24227 321938 12 4 84 0 0
> 21 0 0 981178880 315036 70845880 0 0 0 0 25174 326943 13 4 83 0 0
>
> I used ramdisk to speedup the disk IO. Therefore, iostat can not give useful information.
>
> > I think the best thing may be if you can generate a CPU flame graph
> > of the worst case you can make for these lists:
> > http://www.brendangregg.com/flamegraphs.html IMO, such a graph
> > highlights the nature of the problem better than anything else.
> >
>
> The flame graph is attached. I use 'perf' to generate the flame graph. Only the CPUs running PG server are profiled.
> I'm not familiar with other part of PG. Can you find anything unusual in the graph?
>
>
> --
> Mengxing Liu
>
>
>
>
>
>
>

--
Mengxing Liu

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2017-06-06 16:19:08 Re: BEFORE trigger can cause undetected partition constraint violation
Previous Message Tom Lane 2017-06-06 16:15:13 Re: Extra Vietnamese unaccent rules