| From: | Abhijit Menon-Sen <ams(at)2ndQuadrant(dot)com> |
|---|---|
| To: | pgsql-hackers(at)postgresql(dot)org |
| Cc: | Andres Freund <andres(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> |
| Subject: | Re: What exactly is our CRC algorithm? |
| Date: | 2014-11-19 15:58:11 |
| Message-ID: | 20141119155811.GA32492@toroid.org |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
At 2014-11-11 16:56:00 +0530, ams(at)2ndQuadrant(dot)com wrote:
>
> I'm working on this (first speeding up the default calculation using
> slice-by-N, then adding support for the SSE4.2 CRC instruction on
> top).
I've done the first part in the attached patch, and I'm working on the
second (especially the bits to issue CPUID at startup and decide which
implementation to use).
As a benchmark, I ran pg_xlogdump --stats against 11GB of WAL data (674
segments) generated by running a total of 2M pgbench transactions on a
db initialised with scale factor 25. The tests were run on my i5-3230
CPU, and the code in each case was compiled with "-O3 -msse4.2" (and
without --enable-debug). The profile was dominated by the CRC
calculation in ValidXLogRecord.
With HEAD's CRC code:
bin/pg_xlogdump --stats wal/000000010000000000000001 29.81s user 3.56s system 77% cpu 43.274 total
bin/pg_xlogdump --stats wal/000000010000000000000001 29.59s user 3.85s system 75% cpu 44.227 total
With slice-by-4 (a minor variant of the attached patch; the results are
included only for curiosity's sake, but I can post the code if needed):
bin/pg_xlogdump --stats wal/000000010000000000000001 13.52s user 3.82s system 48% cpu 35.808 total
bin/pg_xlogdump --stats wal/000000010000000000000001 13.34s user 3.96s system 48% cpu 35.834 total
With slice-by-8 (i.e. the attached patch):
bin/pg_xlogdump --stats wal/000000010000000000000001 7.88s user 3.96s system 34% cpu 34.414 total
bin/pg_xlogdump --stats wal/000000010000000000000001 7.85s user 4.10s system 34% cpu 35.001 total
(Note the progressive reduction in user time from ~29s to ~8s.)
Finally, just for comparison, here's what happens when we use the
hardware instruction via gcc's __builtin_ia32_crc32xx intrinsics
(i.e. the additional patch I'm working on):
bin/pg_xlogdump --stats wal/000000010000000000000001 3.33s user 4.79s system 23% cpu 34.832 total
There are a number of potential micro-optimisations, I just wanted to
submit the obvious thing first and explore more possibilities later.
-- Abhijit
| Attachment | Content-Type | Size |
|---|---|---|
| slice8.diff | text/x-diff | 32.6 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Robert Haas | 2014-11-19 16:03:12 | Re: group locking: incomplete patch, just for discussion |
| Previous Message | Andres Freund | 2014-11-19 15:57:27 | Re: Add shutdown_at_recovery_target option to recovery.conf |