Re: Auto-vectorization speeds up multiplication of large-precision numerics

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Auto-vectorization speeds up multiplication of large-precision numerics
Date: 2020-09-07 20:49:20
Message-ID: 1709987.1599511760@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> I experimented with a few different ideas such as adding restrict
> decoration to the pointers, and eventually found that what works
> is to write the loop termination condition as "i2 < limit"
> rather than "i2 <= limit". It took me a long time to think of
> trying that, because it seemed ridiculously stupid. But it works.

I've done more testing and confirmed that both gcc and clang can
vectorize the improved loop on aarch64 as well as x86_64. (clang's
results can be confusing because -ftree-vectorize doesn't seem to
have any effect: its vectorizer is on by default. But if you use
-fno-vectorize it'll go back to the old, slower code.)

The only buildfarm effect I've noticed is that locust and
prairiedog, which are using nearly the same ancient gcc version,
complain

c1: warning: -ftree-vectorize enables strict aliasing. -fno-strict-aliasing is ignored when Auto Vectorization is used.

which is expected (they say the same for checksum.c), but then
there are a bunch of

warning: dereferencing type-punned pointer will break strict-aliasing rules

which seems worrisome. (This sort of thing is the reason I'm
hesitant to apply higher optimization levels across the board.)
Both animals pass the regression tests anyway, but if any other
compilers treat -ftree-vectorize as an excuse to apply stricter
optimization assumptions, we could be in for trouble.

I looked closer and saw that all of those warnings are about
init_var(), and this change makes them go away:

-#define init_var(v) MemSetAligned(v, 0, sizeof(NumericVar))
+#define init_var(v) memset(v, 0, sizeof(NumericVar))

I'm a little inclined to commit that as future-proofing. It's
essentially reversing out a micro-optimization I made in d72f6c750.
I doubt I had hard evidence that it made any noticeable difference;
and even if it did back then, modern compilers probably prefer the
memset approach.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2020-09-07 20:55:28 Re: Disk-based hash aggregate's cost model
Previous Message Andres Freund 2020-09-07 20:45:48 Re: Improving connection scalability: GetSnapshotData()