From: | José Luis Tallón <jltallon(at)adv-solutions(dot)net> |
---|---|
To: | Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: PageRepairFragmentation performance |
Date: | 2014-11-18 18:37:32 |
Message-ID: | 546B91EC.7070500@adv-solutions.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 11/18/2014 07:03 PM, Heikki Linnakangas wrote:
> When profiling replay the WAL generated by pgbench, I noticed the
> PageRepairFragmentation consumes a large fraction of the CPU time:
> [snip]
>
> 1. Replace the qsort with something cheaper. The itemid arrays being
> sorted are small, a few hundred item at most, usually even smaller. In
> this pgbench test case I used, the typical size is about 60. With a
> small array a plain insertion sort is cheaper than the generic
> qsort(), because it can avoid the function overhead etc. involved with
> generic qsort. Or we could use something smarter, like a radix sort,
> knowing that we're sorting small integers. Or we could implement an
> inlineable version of qsort and use that.
IIRC, we would have a theoretical complexity of quicksort and radix sort
should be approximately the same for 256-1024 items... O(n*log(n)) vs
O(d*n), where d is ~log2(n) or just 16 in this case. However,
lexicographical ("bitstring-wise" ordering) might not be what we are
aiming for here
AFAIK, an inlined quicksort should be about the best performing sort
available (most of the enhancement coming from staying within the I-cache)
> 2. Instead of sorting the array and using memmove in-place, we could
> copy all the tuples to a temporary buffer in arbitrary order, and
> finally copy the temporary copy back to the buffer. That requires two
> memory copies per tuple, instead of one memmove, but memcpy() is
> pretty darn fast. It would be a loss when there are only a few large
> tuples on the page, so that avoiding the sort doesn't help, or when
> the tuples are mostly already in the correct places, so that most of
> the memmove()s are no-ops. But with a lot of small tuples, it would be
> a win, and it would be simple.
Memmove *should* be no slower than memcpy.... if both are actually
translated by the compiler to use intrinsics as opposed to calling the
functions --- as it seems to be done here (cfr. __memmove_ssse3_back )
A simple "if" in order to eliminate the no-op memmoves might as well do
it, too.
Just my two (euro) cents, though
> The second option would change behaviour slightly, as the tuples would
> be placed on the page in different physical order than before. It
> wouldn't be visible to to users, though.
>
> I spent some time hacking approach 1, and replaced the qsort() call
> with a bucket sort. I'm not sure if a bucket sort is optimal, or
> better than a specialized quicksort implementation, but it seemed simple.
>
> With the testcase I've been using - replaying about 2GB of WAL
> generated by pgbench - this reduces the replay time from about 55 s to
> 45 s.
Not bad at all... though I suspect most of it might come from staying
within the I-cache as opposed to regular qsort.
The smaller itemIdSortData structure surely helps a bit, too :)
> Thoughts? Attached is the patch I put together. It's actually two
> patches: the first is just refactoring, putting the common code
> between PageRepairFragmentation, PageIndexMultiDelete, and
> PageIndexDeleteNoCompact to function. The second replaces the qsort().
Definitively worth-while, even if just for the refactor. The speed-up
sounds very good, too.
Thanks,
/ J.L.
From | Date | Subject | |
---|---|---|---|
Next Message | Josh Berkus | 2014-11-18 18:42:58 | pg_locks doesn't check for interrupts? |
Previous Message | Heikki Linnakangas | 2014-11-18 18:03:10 | PageRepairFragmentation performance |