Re: _mdfd_getseg can be expensive

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: _mdfd_getseg can be expensive
Date: 2014-10-31 22:32:10
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On 2014-03-31 12:10:01 +0200, Andres Freund wrote:
> I recently have seen some perf profiles in which _mdfd_getseg() was in
> the top #3 when VACUUMing large (~200GB) relations. Called by mdread(),
> mdwrite(). Looking at it's implementation, I am not surprised. It
> iterates over all segment entries a relations has; for every read or
> write. That's not painful for smaller relations, but at a couple of
> hundred GB it starts to be annoying. Especially if kernel readahead has
> already read in all data from disk.
> I don't have a good idea what to do about this yet, but it seems like
> something that should be fixed mid-term.
> The best I can come up is is caching the last mdvec used, but that's
> fairly ugly. Alternatively it might be a good idea to not store MdfdVec
> as a linked list, but as a densely allocated array.

I've seen this a couple times more since. On larger relations it gets
even more pronounced. When sequentially scanning a 2TB relation,
_mdfd_getseg() gets up to 80% proportionate CPU time towards the end of
the scan.

I wrote the attached patch that get rids of that essentially quadratic
behaviour, by replacing the mdfd chain/singly linked list with an
array. Since we seldomly grow files by a whole segment I can't see the
slightly bigger memory reallocations matter significantly. In pretty
much every other case the array is bound to be a winner.

Does anybody have fundamental arguments against that idea?

With some additional work we could save a bit more memory by getting rid
of the mdfd_segno as it's essentially redundant - but that's not
entirely trivial and I'm unsure if it's worth it.

I've also attached a second patch that makes PageIsVerified() noticeably
faster when the page is new. That's helpful and related because it makes
it easier to test the correctness of the md.c rewrite by faking full 1GB
segments. It's also pretty simple, so there's imo little reason not to
do it.


Andres Freund

The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633

Andres Freund
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
0001-Faster-PageIsVerified-for-the-all-zeroes-case.patch text/x-patch 1.4 KB
0002-Much-faster-O-1-mfdvec.patch text/x-patch 17.1 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2014-10-31 22:46:41 Re: tracking commit timestamps
Previous Message Tom Lane 2014-10-31 22:07:41 Re: Let's drop two obsolete features which are bear-traps for novices