Re: lseek/read/write overhead becomes visible at scale ..

From: Tobias Oberstein <tobias(dot)oberstein(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: lseek/read/write overhead becomes visible at scale ..
Date: 2017-01-24 17:57:47
Message-ID: 422b4e6c-b7f0-90e0-6f70-389b2d50a848@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Am 24.01.2017 um 18:41 schrieb Andres Freund:
> Hi,
>
> On 2017-01-24 18:37:14 +0100, Tobias Oberstein wrote:
>>> assume that it'd get more than swamped with doing actualy work, and with
>>> buffering the frequently accessed stuff in memory.
>>>
>>>
>>>> What I am trying to say is: the syscall overhead of doing lseek/read/write
>>>> instead of pread/pwrite do become visible and hurt at a certain point.
>>>
>>> Sure - but the question is whether it's measurable when you do actual
>>> work.
>>
>> The syscall overhead is visible in production too .. I watched PG using perf
>> live, and lseeks regularily appear at the top of the list.
>
> Could you show such perf profiles? That'll help us.

oberstet(at)bvr-sql18:~$ psql -U postgres -d adr
psql (9.5.4)
Type "help" for help.

adr=# select * from svc_sqlbalancer.f_perf_syscalls();
NOTICE: starting Linux perf syscalls sampling - be patient, this can
take some time ..
NOTICE: sudo /usr/bin/perf stat -e "syscalls:sys_enter_*"
-x ";" -a sleep 30 2>&1
pid | syscall | cnt | cnt_per_sec
-----+---------------------------------------+---------+-------------
| syscalls:sys_enter_lseek | 4091584 | 136386
| syscalls:sys_enter_newfstat | 2054988 | 68500
| syscalls:sys_enter_read | 767990 | 25600
| syscalls:sys_enter_close | 503803 | 16793
| syscalls:sys_enter_newstat | 434080 | 14469
| syscalls:sys_enter_open | 380382 | 12679
| syscalls:sys_enter_mmap | 301491 | 10050
| syscalls:sys_enter_munmap | 182313 | 6077
| syscalls:sys_enter_getdents | 162443 | 5415
| syscalls:sys_enter_rt_sigaction | 158947 | 5298
| syscalls:sys_enter_openat | 85325 | 2844
| syscalls:sys_enter_readlink | 77439 | 2581
| syscalls:sys_enter_rt_sigprocmask | 60929 | 2031
| syscalls:sys_enter_mprotect | 58372 | 1946
| syscalls:sys_enter_futex | 49726 | 1658
| syscalls:sys_enter_access | 40845 | 1362
| syscalls:sys_enter_write | 39513 | 1317
| syscalls:sys_enter_brk | 33656 | 1122
| syscalls:sys_enter_epoll_wait | 23776 | 793
| syscalls:sys_enter_ioctl | 19764 | 659
| syscalls:sys_enter_wait4 | 17371 | 579
| syscalls:sys_enter_newlstat | 13008 | 434
| syscalls:sys_enter_exit_group | 10135 | 338
| syscalls:sys_enter_recvfrom | 8595 | 286
| syscalls:sys_enter_sendto | 8448 | 282
| syscalls:sys_enter_poll | 7200 | 240
| syscalls:sys_enter_lgetxattr | 6477 | 216
| syscalls:sys_enter_dup2 | 5790 | 193

<snip>

Note: there isn't a lot of load currently (this is from production).

>>> I'm much less against this change than Tom, but doing artificial syscall
>>> microbenchmark seems unlikely to make a big case for using it in
>>
>> This isn't a syscall benchmark, but FIO.
>
> There's not really a difference between those, when you use fio to
> benchmark seek vs pseek.

Sorry, I don't understand what you are talking about.

>>> postgres, where it's part of vastly more expensive operations (like
>>> actually reading data afterwards, exclusive locks, ...).
>>
>> PG is very CPU hungry, yes.
>
> Indeed - working on it ;)
>
>
>> But there are quite some system related effects
>> too .. eg we've managed to get down the system load with huge pages (big
>> improvement).
>
> Glad to hear it.

With 3TB RAM, huge pages is absolutely essential (otherwise, the system
bogs down in TLB etc overhead).

>>> I'd welcome seeing profiles of that - I'm working quite heavily on
>>> speeding up analytics workloads for pg.
>>
>> Here:
>>
>> https://github.com/oberstet/scratchbox/raw/master/cruncher/adr_stats/ADR-PostgreSQL-READ-Statistics.pdf
>>
>> https://github.com/oberstet/scratchbox/tree/master/cruncher/adr_stats
>
> Thanks, unfortunately those appear to mostly have io / cache hit ratio
> related stats?

Yep, this was just to proof that we are really running a DWH workload at
scale;)

Cheers,
/Tobias

>
> Greetings,
>
> Andres Freund
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2017-01-24 17:58:16 Re: pgbench more operators & functions
Previous Message Robert Haas 2017-01-24 17:56:36 Re: Declarative partitioning - another take