Re: lseek/read/write overhead becomes visible at scale ..

From: Tobias Oberstein <tobias(dot)oberstein(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: lseek/read/write overhead becomes visible at scale ..
Date: 2017-01-24 18:25:52
Message-ID: a55b21d1-7c99-2c66-d661-ef5288f29e30@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

>> pid | syscall | cnt | cnt_per_sec
>> -----+---------------------------------------+---------+-------------
>> | syscalls:sys_enter_lseek | 4091584 | 136386
>> | syscalls:sys_enter_newfstat | 2054988 | 68500
>> | syscalls:sys_enter_read | 767990 | 25600
>> | syscalls:sys_enter_close | 503803 | 16793
>> | syscalls:sys_enter_newstat | 434080 | 14469
>> | syscalls:sys_enter_open | 380382 | 12679
>>
>> Note: there isn't a lot of load currently (this is from production).
>
> That doesn't really mean that much - sure it shows that lseek is
> frequent, but it doesn't tell you how much impact this has to the

Above is on a mostly idle system ("idle" for our loads) .. when things
get hot, lseek calls can reach into the millions/sec.

Doing 5 million syscalls per sec comes with overhead no matter how
lightweight the syscall is, doesn't it?

Using pread instead of lseek+read halfes the syscalls.

I really don't understand what you are fighting here ..

> overall workload. For that'd you'd need a generic (i.e. not syscall
> tracepoint, but cpu cycle) perf profile, and look in the call graph (via
> perf report --children) how much of that is below the lseek syscall.

I see. I might find time to extend our helper function f_perf_syscalls.

>>>>> I'm much less against this change than Tom, but doing artificial syscall
>>>>> microbenchmark seems unlikely to make a big case for using it in
>>>>
>>>> This isn't a syscall benchmark, but FIO.
>>>
>>> There's not really a difference between those, when you use fio to
>>> benchmark seek vs pseek.
>>
>> Sorry, I don't understand what you are talking about.
>
> Fio as you appear to have used is a microbenchmark benchmarking
> individual syscalls.

I am benchmarking IOPS, and while doing so, it becomes apparent that at
these scales it does matter _how_ IO is done.

The most efficient way is libaio. I get 9.7 million/sec IOPS with low
CPU load. Using any synchronous IO engine is slower and produces higher
load.

I do understand that switching to libaio isn't going to fly for PG
(completely different approach). But doing pread instead of lseek+read
seems simple enough. But then, I don't know about the PG codebase ..

Among the synchronous methods of doing IO, psync is much better than sync.

pvsync, pvsync2 and pvsync2 + hipri (busy polling, no interrupts) are
better, but the gain is smaller, and all of them are inferior to libaio.

>>> Glad to hear it.
>>
>> With 3TB RAM, huge pages is absolutely essential (otherwise, the system bogs
>> down in TLB etc overhead).
>
> I was one of the people working on adding hugepage support to pg, that's
> why I was glad ;)

Ahh;) Sorry, wasn't aware. This is really invaluable. Thanks for that!

Cheers,
/Tobias

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2017-01-24 18:36:13 Re: lseek/read/write overhead becomes visible at scale ..
Previous Message Corey Huinker 2017-01-24 18:25:04 Re: \if, \elseif, \else, \endif (was Re: PSQL commands: \quit_if, \quit_unless)