Re: Improvement of checkpoint IO scheduler for stable transaction responses

From: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>
To: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Improvement of checkpoint IO scheduler for stable transaction responses
Date: 2013-07-05 07:50:50
Message-ID: 51D67ADA.4020205@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

(2013/07/05 0:35), Joshua D. Drake wrote:
> On 07/04/2013 06:05 AM, Andres Freund wrote:
>>>> Presumably the smaller segsize is better because we don't
>>>> completely stall the system by submitting up to 1GB of io at once. So,
>>>> if we were to do it in 32MB chunks and then do a final fsync()
>>>> afterwards we might get most of the benefits.
>>> Yes, I try to test this setting './configure --with-segsize=0.03125' tonight.
>>> I will send you this test result tomorrow.
>>
>
> I did testing on this a few years ago, I tried with 2MB segments over 16MB
> thinking similarly to you. It failed miserably, performance completely tanked.
Just as you say, test result was miserable... Too small segsize is bad for
parformance. It might be improved by separate derectory, but too many FD with
open() and close() seem to be bad. However, I think taht this implementation have
potential which is improve for IO performance, so we need to try to test with
some methods.

* Performance result in DBT-2 (WH340)
| NOTPM 90%tile Average Maximum
--------------------------------+-----------------------------------
original_0.7 (baseline) | 3474.62 18.348328 5.739 36.977713
fsync + write | 3586.85 14.459486 4.960 27.266958
fsync + write + segsize=0.25 | 3661.17 8.28816 4.117 17.23191
fsync + wrote + segsize=0.03125 | 3309.99 10.851245 6.759 19.500598

(2013/07/04 22:05), Andres Freund wrote:
> 1) it breaks pg_upgrade. Which means many of the bigger users won't be
> able to migrate to this and most packagers would carry the old
> segsize around forever.
> Even if we could get pg_upgrade to split files accordingly link mode
> would still be broken.
I think that pg_upgrade is one of the contrib, but not mainly implimentation of
Postgres. So contrib should not try to stand in improvement of main
implimentaion. Pg_upgrade users might consider same opinion.

> 2) It drastically increases the amount of file handles neccessary and by
> extension increases the amount of open/close calls. Those aren't all
> that cheap. And it increases metadata traffic since mtime/atime are
> kept for more files. Also, file creation is rather expensive since it
> requires metadata transaction on the filesystem level.
My test result was seemed this problem. But my test wasn't separate directory in
base/. I'm not sure that which way is best. If you have time to create patch,
please send us, and I try to test in DBT-2.

Best regards,
--
Mitsumasa KONDO
NTT Open Sorce Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Dunstan 2013-07-05 07:51:03 Re: [HACKERS] JPA + enum == Exception
Previous Message Greg Smith 2013-07-05 07:23:20 Re: fallocate / posix_fallocate for new WAL file creation (etc...)