RE:[HACKERS] Deadlock in XLogInsert at AIX

From: "REIX, Tony" <tony(dot)reix(at)atos(dot)net>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Noah Misch <noah(at)leadboat(dot)com>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Bernd Helmle <mailings(at)oopsware(dot)de>, "OLIVA, PASCAL" <pascal(dot)oliva(at)atos(dot)net>, "EMPEREUR-MOT, SYLVIE" <sylvie(dot)empereur-mot(at)atos(dot)net>
Subject: RE:[HACKERS] Deadlock in XLogInsert at AIX
Date: 2018-01-16 08:25:51
Message-ID: B37989F2852398498001550C29155BE5184AC1F0@FRCRPVV9EX3MSX.ww931.my-it-solutions.net
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

Hi Michael,

My team and my company (ATOS/Bull) are involved in improving the quality of PostgreSQL on AIX.

We have AIX 6.1, 7.1, and 7.2 Power8 systems, with several logical/physical processors.
And I plan to have a more powerful (more processors) machine for running PostgreSQL stress tests.
A DB-expert colleague has started to write some new not-too-complex stress tests that we'd like to submit to PostgreSQL project later.
For now, using latest versions of XLC 12 (12.1.0.19) and 13 (13.1.3.4 with a patch), we have only (on AIX 6.1 and 7.2) one remaining random failure (dealing with src/bin/pgbench/t/001_pgbench.pl test), for PostgreSQL 9.6.6 and 10.1 . And, on AIX 7.1, we have one more remaining failure that may be due to some other dependent software. Investigating.
XLC 13.1.3.4 shows an issue with -O2 and I have a work-around that fixes it in ./src/backend/parser/gram.c . We have opened a PMR (defect) against XLC.
Note that our tests are now executed without the PG_FORCE_DISABLE_INLINE "inline" trick in src/include/port/aix.h that suppresses the inlining of routines on AIX. I think that older versions of XLC have shown issues that have now disappeared (or, at least, many of them).
I've been able to compare PostgreSQL compiled with XLC vs GCC 7.1 and, using times outputs provided by PostgreSQL tests, XLC seems to provide at least 8% more speed. We also plan to run professional performance tests in order to compare PostgreSQL 10.1 on AIX vs Linux/Power. I saw some 2017 performance slides, made with older versions of PostgreSQL and XLC, that show bad PostgreSQL performance on AIX vs Linux/Power, and I cannot believe it. We plan to investigate this.

Though I have very very little skills about PostgreSQL (I'm porting too now GCC Go on AIX), we can help, at least by compiling/testing/investigating/stressing in a different AIX environment than the AIX ones (32/64bit, XLC/GCC) you have in your BuildFarm.
Let me know how we can help.

Regards,

Cordialement,

Tony Reix

ATOS / Bull SAS
ATOS Expert
IBM Coop Architect & Technical Leader
Office : +33 (0) 4 76 29 72 67
1 rue de Provence - 38432 Échirolles - France
www.atos.net

________________________________________
De : Michael Paquier [michael(dot)paquier(at)gmail(dot)com]
Envoyé : mardi 16 janvier 2018 08:12
À : Noah Misch
Cc : Heikki Linnakangas; Konstantin Knizhnik; PostgreSQL Hackers; Bernd Helmle
Objet : Re: [HACKERS] Deadlock in XLogInsert at AIX

On Fri, Feb 03, 2017 at 12:26:50AM +0000, Noah Misch wrote:
> On Wed, Feb 01, 2017 at 02:39:25PM +0200, Heikki Linnakangas wrote:
>> @@ -73,11 +73,19 @@ pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
>> static inline uint32
>> pg_atomic_fetch_add_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
>> {
>> + uint32 ret;
>> +
>> /*
>> - * __fetch_and_add() emits a leading "sync" and trailing "isync", thereby
>> - * providing sequential consistency. This is undocumented.
>> + * Use __sync() before and __isync() after, like in compare-exchange
>> + * above.
>> */
>> - return __fetch_and_add((volatile int *)&ptr->value, add_);
>> + __sync();
>> +
>> + ret = __fetch_and_add((volatile int *)&ptr->value, add_);
>> +
>> + __isync();
>> +
>> + return ret;
>> }
>
> Since this emits double syncs with older xlc, I recommend instead replacing
> the whole thing with inline asm. As I opined in the last message of the
> thread you linked above, the intrinsics provide little value as abstractions
> if one checks the generated code to deduce how to use them. Now that the
> generated code is xlc-version-dependent, the port is better off with
> compiler-independent asm like we have for ppc in s_lock.h.

Could it be cleaner to just use __xlc_ver__ to avoid double syncs on
past versions? I think that it would make the code more understandable
than just listing directly the instructions. As there have been other
bug reports from Tony Reix who has been working on AIX with XLC 13.1 and
that this thread got lost in the wild, I have added an entry in the next
CF:
https://commitfest.postgresql.org/17/1484/

As Heikki is not around these days, Noah, could you provide a new
version of the patch? This bug has been around for some time now, it
would be nice to move on.. I think I could have written patches myself,
but I don't have an AIX machine at hand. Of course not with XLC 13.1.
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeevan Chalke 2018-01-16 08:56:03 Re: [HACKERS] Partition-wise aggregation/grouping
Previous Message Amit Langote 2018-01-16 08:08:50 Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [HACKERS] path toward faster partition pruning