Quick Links

Re: WAL CPU overhead/optimization (was Master-slave visibility order)

From:	Ants Aasma <ants(at)cybertec(dot)at>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL CPU overhead/optimization (was Master-slave visibility order)
Date:	2013-08-29 23:53:54
Message-ID:	CA+CSw_tz5-ErTgj6SWghiTVEOx9r63=bVnTB9WEgEjqqwb68nQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Fri, Aug 30, 2013 at 1:30 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2013-08-30 01:10:40 +0300, Ants Aasma wrote:
>> On Fri, Aug 30, 2013 at 12:33 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> > FWIW, WAL is still the major bottleneck for INSERT heavy workloads. The
>> > per CPU overhead actually minimally increased (at least in my tests), it
>> > just scales noticeably better than before.
>>
>> Interesting. Do you have any insight what is behind the CPU overhead?
>> Maybe the solution is to make WAL insertion cheap enough to not
>> matter. That won't be easy, but neither are the alternatives.
>
> Funnily by far the biggest thing I have seen in benchmarks is the CRC32
> computation. I plan to brush up my ~3 year old CRC32 reimplementation
> patch sometime, but afair you had a much better one?
>
> I have some doubts about weakening the hash function by also using FNV
> or similar here, so I'd first like to try how much of a difference a
> better CRC32 implementation can make with the current XLogInsert()
> implementation.

The CRC32 implementations mostly differ by the amount of lookups that
are done in parallel. Postgresql does 1 lookup, IIRC zlib
implementation does 4, Intel has a paper that recommends going up to
8. The tradeoff is that each level requires a 4KB lookup table - for
small records the additional cache misses will probably kill any
speedup.

A quick overview of the hot cache large buffer performance of a few
interesting options:
crc32 slice-by-1: 0.148 bytes/cycle
crc32 slice-by-4: 0.392 bytes/cycle
crc32 slice-by-8: 0.654 bytes/cycle
crc32c instruction pipelined by 3: 6.8 bytes/cycle (number from Intels paper)
FNV 1 byte at a time version: 0.333 bytes/cycle
md5: 0.159 bytes/cycle
Murmur3A: 1.019 bytes/cycle
CityHash64: 4.246 bytes/cycle

CityHash64 actually looks pretty good, there no known hash quality
issues. Compared to CRC, the only weakening is that single bit errors
are not guaranteed to be 100% detected. There's also the issue that
only a 64bit implementation exists, but I'm sure this can be resolved
(if necessary, we can just use Murmur3 on 32bit).

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

In response to

WAL CPU overhead/optimization (was Master-slave visibility order) at 2013-08-29 22:30:04 from Andres Freund

Responses

Re: WAL CPU overhead/optimization (was Master-slave visibility order) at 2013-08-30 00:02:43 from Andres Freund

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Andres Freund	2013-08-30 00:02:43	Re: WAL CPU overhead/optimization (was Master-slave visibility order)
Previous Message	Hannu Krosing	2013-08-29 23:28:17	Re: PL/pgSQL PERFORM with CTE