Quick Links

Atomics for heap_parallelscan_nextpage()

From:	David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Cc:	Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Atomics for heap_parallelscan_nextpage()
Date:	2017-05-06 13:57:37
Message-ID:	CAKJS1f9tgsPhqBcoPjv9_KUPZvTLCZ4jy=B=bhqgaKn7cYzm-w@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

A while back I did some benchmarking on a big 4 socket machine to
explore a bit around the outer limits of parallel aggregates. I
discovered along the way that, given enough workers, and a simple
enough task, that seq-scan workers were held up waiting for the lock
to be released in heap_parallelscan_nextpage().

I've since done a little work in this area to try to improve things. I
ended up posting about it yesterday in [1]. My original patch used
batching to solve the issue; instead of allocating 1 block at a time,
the batching patch allocated a range of 10 blocks for the worker to
process. However, the implementation still needed a bit of work around
reporting sync-scan locations.

Andres mentioned in [2] that it might be worth exploring using atomics
to do the same job. So I went ahead and did that, and came up with the
attached, which is a slight variation on what he mentioned in the
thread.

To keep things a bit more simple, and streamline, I ended up pulling
out the logic for setting the startblock into another function, which
we only call once before the first call to
heap_parallelscan_nextpage(). I also ended up changing phs_cblock and
replacing it with a counter that always starts at zero. The actual
block is calculated based on that + the startblock modulo nblocks.
This makes things a good bit more simple for detecting when we've
allocated all the blocks to the workers, and also works nicely when
wrapping back to the start of a relation when we started somewhere in
the middle due to piggybacking with a synchronous scan.

Performance:

With parallel_workers=71, it looks something like:

Query 1: 881 GB, ~6 billion row TPC-H lineitem table.

tpch=# select count(*) from lineitem;
count
------------
5999989709
(1 row)

-- Master
Time: 123421.283 ms (02:03.421)
Time: 118895.846 ms (01:58.896)
Time: 118632.546 ms (01:58.633)

-- Atomics patch
Time: 74038.813 ms (01:14.039)
Time: 73166.200 ms (01:13.166)
Time: 72492.338 ms (01:12.492)

-- Batching Patch: Batching 10 pages at a time in heap_parallelscan_nextpage()
Time: 76364.215 ms (01:16.364)
Time: 75808.900 ms (01:15.809)
Time: 74927.756 ms (01:14.928)

Query 2: Single int column table with 2 billion rows.

tpch=# select count(*) from a;
count
------------
2000000000
(1 row)

-- Master
Time: 5853.918 ms (00:05.854)
Time: 5925.633 ms (00:05.926)
Time: 5859.223 ms (00:05.859)

-- Atomics patch
Time: 5825.745 ms (00:05.826)
Time: 5849.139 ms (00:05.849)
Time: 5815.818 ms (00:05.816)

-- Batching Patch: Batching 10 pages at a time in heap_parallelscan_nextpage()
Time: 5789.237 ms (00:05.789)
Time: 5837.395 ms (00:05.837)
Time: 5821.492 ms (00:05.821)

I've also attached a text file with the perf report for the lineitem
query. You'll notice that the heap_parallelscan_nextpage() is very
visible in master, but not on each of the two patches.

With the 2nd query, heap_parallelscan_nextpage is fairly insignificant
on master's profile, it's only showing up as 0.48%. Likely this must
be due to more tuples being read from the page, and more aggregation
work getting done before the next page is needed. I'm uncertain why I
previously saw a speed up in this case in [1].

I've also noticed that both the atomics patch and unpatched master do
something that looks a bit weird with synchronous seq-scans. If the
parallel seq-scan piggybacked on another scan, then subsequent
parallel scans will start at the same non-zero block location, even
when no other concurrent scans exist. I'd have expected this should go
back to block 0 again, but maybe I'm just failing to understand the
reason for reporting the startblock to ss_report_location() at the end
of the scan.

I'll now add this to the first commitfest of pg11. I just wanted to
note that I've done this, so that it's less likely someone else goes
and repeats the same work.

[1] https://www.postgresql.org/message-id/CAKJS1f-XhfQ2-%3D85wgYo5b3WtEs%3Dys%3D2Rsq%3DNuvnmaV4ZsM1XQ%40mail.gmail.com

[2] https://www.postgresql.org/message-id/20170505023646.3uhnmf2hbwtm63lc%40alap3.anarazel.de

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment	Content-Type	Size
parallel_next_page_perf.txt	text/plain	15.9 KB
parallel_nextpage_atomics.patch	application/octet-stream	7.7 KB

Responses

Re: Atomics for heap_parallelscan_nextpage() at 2017-05-08 13:26:32 from Amit Kapila
Re: Atomics for heap_parallelscan_nextpage() at 2017-08-16 13:20:28 from Heikki Linnakangas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2017-05-06 15:25:03	Re: Draft release notes for next week's back-branch releases
Previous Message	Petr Jelinek	2017-05-06 12:56:17	Re: Draft release notes for next week's back-branch releases