Re: postgresql latency & bgwriter not doing its job

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: postgresql latency & bgwriter not doing its job
Date: 2014-08-30 23:10:38
Message-ID: 20140830231038.GC25523@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2014-08-31 01:50:48 +0300, Heikki Linnakangas wrote:
> On 08/30/2014 09:45 PM, Andres Freund wrote:
> >On 2014-08-30 14:16:10 -0400, Tom Lane wrote:
> >>Andres Freund <andres(at)2ndquadrant(dot)com> writes:
> >>>On 2014-08-30 13:50:40 -0400, Tom Lane wrote:
> >>>>A possible compromise is to sort a limited number of
> >>>>buffers ---- say, collect a few thousand dirty buffers then sort, dump and
> >>>>fsync them, repeat as needed.
> >>
> >>>Yea, that's what I suggested nearby. But I don't really like it, because
> >>>it robs us of the the chance to fsync() a relfilenode immediately after
> >>>having synced all its buffers.
> >>
> >>Uh, how so exactly? You could still do that. Yeah, you might fsync a rel
> >>once per sort-group and not just once per checkpoint, but it's not clear
> >>that that's a loss as long as the group size isn't tiny.
> >
> >Because it wouldn't have the benefit of sycing the minimal amount of
> >data anymore. If lots of other relfilenodes have been synced inbetween
> >the amount of newly dirtied pages in the os' buffercache (written by
> >backends, bgwriter) for a individual relfilenode is much higher.
>
> I wonder how much of the benefit from sorting comes from sorting the pages
> within each file, and how much just from grouping all the writes of each
> file together. In other words, how much difference is there between sorting,
> and fsyncing between each file, or the crude patch I posted earlier.

I haven't implemented fsync()ing between files so far. From the io stats
I'm seing the performance improvements come from the OS being able to
write data back in bigger chunks. Which seems entirely reasonable. If
the database and the write load are big enough, so writeback will be
triggered repeatedly during one checkpoint, the OS's buffercache will
have lots of nonsequential data to flush. Leading to much smaller IOs,
more seeks and deeper queues (=> latency increases).

> If we're going to fsync between each file, there's no need to sort all the
> buffers at once. It's enough to pick one file as the target - like in my
> crude patch - and sort only the buffers for that file. Then fsync that file
> and move on to the next file. That requires scanning the buffers multiple
> times, but I think that's OK.

I really can't see that working out. Production instances of postgres
with large shared_buffers settings (say 96GB in one case) have tens of
thousands of relations (~34500 in the same case). And that's a database
with a relatively simple schema. I've seen much worse.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2014-08-30 23:32:26 Re: [BUGS] Re: BUG #9555: pg_dump for tables with inheritance recreates the table with the wrong order of columns
Previous Message Heikki Linnakangas 2014-08-30 22:50:48 Re: postgresql latency & bgwriter not doing its job