[PATCH] Resolve Parallel Hash Join Performance Issue

From: "Deng, Gang" <gang(dot)deng(at)intel(dot)com>
To: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: [PATCH] Resolve Parallel Hash Join Performance Issue
Date: 2020-01-09 08:53:42
Message-ID: 0F44E799048C4849BAE4B91012DB910462E9897A@SHSMSX103.ccr.corp.intel.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi all,

Attached is a patch to resolve parallel hash join performance issue. This is my first time to contribute patch to PostgreSQL community, I referred one of previous thread as template to report the issue and patch. Please let me know if need more information of the problem and patch.

A. Problem Summary
When we ran query which was executed by hash join operation, we can not achieve good performance improvement with more number of threads. More specifically, when we ran query02 of TPC-DS workload using scale 500 (500GB dataset), execution time using 8 threads was 124.6 sec while time using 28 threads was 103.5 sec. Here is execution time by different number of threads:

number of thread: 1 4 8 16 28
time used(sec): 460.4 211 124.6 101.9 103.5

The test was made on a server with 384GB DRAM, 56 cores/112 HT. Data has been cached into OS page cache, so there was no disk I/O during execution, and there were enough physical CPU cores to support 28 threads to run in parallel.

We investigated this problem with perf c2c (http://man7.org/linux/man-pages/man1/perf-c2c.1.html) tool, confirmed the problem was caused by false sharing cache coherence. And we located the code write cache line is at line 457 of nodeHashjoin.c (pg version 12.0).

B. Patch
change line 457 in ExecHashJoinImpl function of nodeHashJoin.c. (be applicable to both 12.0 and 12.1)
original code:
HeapTupleHeaderSetMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple));
changed to:
if (!HeapTupleHeaderHasMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple)))
{
HeapTupleHeaderSetMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple));
}
Compared with original code, modified code can avoid unnecessary write to memory/cache.

C. Test case:

1. Use https://github.com/pivotalguru/TPC-DS to setup TPC-DS benchmark for postgreSQL
2. run below command to ensure query will be executed with expected parallelism:
psql postgres -h localhost -U postgres -c "alter table web_sales set (parallel_workers =28);";psql postgres -h localhost -U postgres -c "alter table catalog_sales set (parallel_workers =28);"
3. run query: psql postgres -h localhost -U postgres -f 102.tpcds.02.sql first to ensure data is loaded into page cache.
4. run query " time psql postgres -h localhost -U postgres -f 102.tpcds.02.sql" again to measure performance.

D. Result
With the modified code, performance of hash join operation can scale better with number of threads. Here is result of query02 after patch. For example, performance improved ~2.5x when run 28 threads.

number of thread: 1 4 8 16 28
time used(sec): 465.1 193.1 97.9 55.9 41

I attached 5 files for more information:

1. query_plan_q2_no_opt_28_thread: query plan using 28 threads without patch
2. query_plan_q2_opt_28_thread: query plan using 28 threads with patch
3. perf_c2c_no_opt.txt: perf c2c output before patch
4. perf_c2c_opt.txt: perf c2c output after patch
5. git diff of the patch

Thanks and Best Regards

Deng, Gang (邓刚)
IAGS-CPDP-CEE PRC Enterprise
Mobile: 13161337000
Office: 010-57511964

Attachment Content-Type Size
git_diff application/octet-stream 705 bytes
perf_c2c_no_opt.txt text/plain 2.5 MB
perf_c2c_opt.txt text/plain 430.7 KB
query_plan_q2_no_opt_28_thread application/octet-stream 6.5 KB
query_plan_q2_opt_28_thread application/octet-stream 6.4 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message godjan • 2020-01-09 08:55:17 Verify true root on replicas with amcheck
Previous Message Andrew Dunstan 2020-01-09 08:14:33 Re: Allow 'sslkey' and 'sslcert' in postgres_fdw user mappings