Re: Trouble with hashagg spill I/O pattern and costing

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Trouble with hashagg spill I/O pattern and costing
Date: 2020-05-21 00:12:55
Message-ID: 20200521001255.kfaihp3afv6vy6uq@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, May 19, 2020 at 09:15:40PM -0700, Jeff Davis wrote:
>On Tue, 2020-05-19 at 19:53 +0200, Tomas Vondra wrote:
>>
>> And if there a way to pre-allocate larger chunks? Presumably we could
>> assign the blocks to tape in larger chunks (e.g. 128kB, i.e. 16 x
>> 8kB)
>> instead of just single block. I haven't seen anything like that in
>> tape.c, though ...
>
>It turned out to be simple (at least a POC) so I threw together a
>patch. I just added a 32-element array of block numbers to each tape.
>When we need a new block, we retrieve a block number from that array;
>or if it's empty, we fill it by calling ltsGetFreeBlock() 32 times.
>
>I reproduced the problem on a smaller scale (330M groups, ~30GB of
>memory on a 16GB box). Work_mem=64MB. The query is a simple distinct.
>
>Unpatched master:
> Sort: 250s
> HashAgg: 310s
>Patched master:
> Sort: 245s
> HashAgg: 262s
>
>That's a nice improvement for such a simple patch. We can tweak the
>number of blocks to preallocate, or do other things like double from a
>small number up to a maximum. Also, a proper patch would probably
>release the blocks back as free when the tape was rewound.
>
>As long as the number of block numbers to preallocate is not too large,
>I don't think we need to change the API. It seems fine for sort to do
>the same thing, even though there's not any benefit.
>

I gave it a try on the machine with temp tablespace on SSD, and I can
confirm it improves performance. I've tried with different work_mem
values and I've also increased the number of pre-allocated blocks to 64
and 128 blocks, and the numbers look like this:

master

sort hash
----------------------------
4MB 335 1331
128MB 220 1208

patched (32)

sort hash
----------------------------
4MB 344 685
128MB 217 641

patched (64)

sort hash
----------------------------
4MB 329 545
128MB 214 493

patched (128)

sort hash
----------------------------
4MB 331 478
128MB 222 434

I agree that's pretty nice. I wonder how far would we need to go before
reaching a plateau. I'll try this on the other machine with temporary
tablespace on SATA, but that'll take longer.

The I/O pattern changed significantly - it's not visible on the charts,
so I'm not attaching them. But the statistics of block sizes and "gaps"
are pretty clear.

size of I/O requests
--------------------

a) master

type | bytes | count | pct
------+---------+---------+--------
RA | 8192 | 2905948 | 95.83
RA | 24576 | 63470 | 2.09
RA | 16384 | 40155 | 1.32
W | 8192 | 149295 | 52.85
W | 16384 | 51781 | 18.33
W | 24576 | 22247 | 7.88
W | 1310720 | 15493 | 5.48
W | 32768 | 11856 | 4.20

b) patched, 32 blocks

type | bytes | count | pct
------+---------+--------+--------
RA | 131072 | 247686 | 41.75
RA | 8192 | 95746 | 16.14
RA | 16384 | 82314 | 13.87
RA | 32768 | 82146 | 13.85
RA | 65536 | 82126 | 13.84
W | 1310720 | 16815 | 52.19
W | 262144 | 3628 | 11.26
W | 524288 | 2791 | 8.66

c) patched, 64 blocks

type | bytes | count | pct
------+---------+--------+--------
RA | 131072 | 213556 | 56.18
RA | 8192 | 47663 | 12.54
RA | 16384 | 39358 | 10.35
RA | 32768 | 39308 | 10.34
RA | 65536 | 39304 | 10.34
W | 1310720 | 18132 | 65.27
W | 524288 | 3722 | 13.40
W | 262144 | 581 | 2.09
W | 1048576 | 405 | 1.46
W | 8192 | 284 | 1.02

d) patched, 128 blocks

type | bytes | count | pct
------+---------+--------+--------
RA | 131072 | 200816 | 70.93
RA | 8192 | 23640 | 8.35
RA | 16384 | 19324 | 6.83
RA | 32768 | 19279 | 6.81
RA | 65536 | 19273 | 6.81
W | 1310720 | 18000 | 65.91
W | 524288 | 2074 | 7.59
W | 1048576 | 660 | 2.42
W | 8192 | 409 | 1.50
W | 786432 | 354 | 1.30

Clearly, the I/O requests are much larger - both reads and writes
shifted from 8kB to much larger ones, and the larger the number of
blocks the more significant the shift is. This means the workload is
getting more "sequential" and the write combining / read-ahead becomes
more effective.

deltas between I/O requests
---------------------------

I'll only show reads to save space, it's about the same for writes.

a) master

type | block_delta | count | pct
------+-------------+--------+-------
RA | 256 | 569237 | 18.77
RA | 240 | 475182 | 15.67
RA | 272 | 437260 | 14.42
RA | 224 | 328604 | 10.84
RA | 288 | 293628 | 9.68
RA | 208 | 199530 | 6.58
RA | 304 | 181695 | 5.99
RA | 192 | 109472 | 3.61
RA | 320 | 105211 | 3.47
RA | 336 | 57423 | 1.89

b) patched, 32 blocks

type | block_delta | count | pct
------+-------------+--------+-------
RA | 256 | 165071 | 27.82
RA | 32 | 82129 | 13.84
RA | 64 | 82122 | 13.84
RA | 128 | 82077 | 13.83
RA | 16 | 82042 | 13.83
RA | 7440 | 45168 | 7.61
RA | 7952 | 9838 | 1.66

c) patched, 64 blocks

type | block_delta | count | pct
------+-------------+--------+-------
RA | 256 | 173737 | 45.70
RA | 32 | 39301 | 10.34
RA | 64 | 39299 | 10.34
RA | 128 | 39291 | 10.34
RA | 16 | 39250 | 10.32
RA | 15120 | 21202 | 5.58
RA | 15376 | 4448 | 1.17

d) patched, 128 blocks

type | block_delta | count | pct
------+-------------+--------+-------
RA | 256 | 180955 | 63.91
RA | 32 | 19274 | 6.81
RA | 64 | 19273 | 6.81
RA | 128 | 19264 | 6.80
RA | 16 | 19203 | 6.78
RA | 30480 | 9835 | 3.47

The way I understand it, this needs to be interpreted together with
block size stats - in a perfectly sequential workload the two stats
would match. For master that's clearly not the case - the most common
read request size is 8kB, but the most common delta is 128kB (256
sectors, which is the read-ahead for the SSD device). The patched
results are much closer, mostly thanks to switching to 128kB reads.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Korotkov 2020-05-21 00:17:31 Re: Operator class parameters and sgml docs
Previous Message Thomas Munro 2020-05-21 00:02:36 Re: Parallel Seq Scan vs kernel read ahead