Re: 9.5: Memory-bounded HashAgg

From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: 9.5: Memory-bounded HashAgg
Date: 2014-08-14 23:38:38
Message-ID: 53ED487E.1060705@fuzzy.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 14.8.2014 21:47, Tomas Vondra wrote:
> On 14.8.2014 18:54, Jeff Davis wrote:
>> On Thu, 2014-08-14 at 16:17 +0200, Tomas Vondra wrote:
>>> Either it belongs to the current batch (and either it's in the hash
>>> table, or you add it there), or it's not - in that case write it to a
>>> temp file.
>>
>> I think the part you left out is that you need two files per batch: one
>> for the dumped-out partially-computed state values, and one for the
>> tuples.
>>
>> In other words, you haven't really discussed the step where you reunite
>> the tuples with that partially-computed state.
>
> No, that's not how the serialize/deserialize should work. The aggregate
> needs to store the state as-is, so that after deserializing it gets
> pretty much the same thing.
>
> For example, for 'median' the state is the list of all the values
> received so far, and when serializing it you have to write all the
> values out. After deserializing it, you will get the same list of values.
>
> Some aggregates may use complex data structures that may need more
> elaborate serialize.
>
>>> For sure, it's not for free - it may write to quite a few files. Is it
>>> more expensive than what you propose? I'm not sure about that. With
>>> your batching scheme, you'll end up with lower number of large batches,
>>> and you'll need to read and split them, possibly repeatedly. The
>>> batching scheme from hashjoin minimizes this.
>>
>> My approach only has fewer batches if it elects to have fewer batches,
>> which might happen for two reasons:
>> 1. A cardinality misestimate. This certainly could happen, but we do
>> have useful numbers to work from (we know the number of tuples and
>> distincts that we've read so far), so it's far from a blind guess.
>> 2. We're concerned about the random I/O from way too many partitions.
>
> OK. We can't really do much with the cardinality estimate.
>
> As for the random IO concerns, I did a quick test to see how this
> behaves. I used a HP ProLiant DL380 G5 (i.e. a quite old machine, from
> 2006-09 if I'm not mistaken). 16GB RAM, RAID10 on 6 x 10k SAS drives,
> 512MB write cache. So a quite lousy machine, considering today's standards.
>
> I used a simple C program (attached) that creates N files, and writes
> into them in a round-robin fashion until a particular file size is
> reached. I opted for 64GB total size, 1kB writes.
>
> ./iotest filecount filesize writesize
>
> File size is in MB, writesize is in bytes. So for example this writes 64
> files, each 1GB, using 512B writes.
>
> ./iotest 64 1024 512
>
> Measured is duration before/after fsync (in seconds):
>
> files | file size | before fsync | after fsync
> ---------------------------------------------------------
> 32 | 2048 | 290.16 | 294.33
> 64 | 1024 | 264.68 | 267.60
> 128 | 512 | 278.68 | 283.44
> 256 | 256 | 332.11 | 338.45
> 1024 | 64 | 419.91 | 425.48
> 2048 | 32 | 450.37 | 455.20
>
> So while there is a difference, I don't think it's the 'random I/O wall'
> as usually observed on rotational drives. Also, this is 2.6.32 kernel,
> and my suspicion is that with a newer one the behaviour would be better.
>
> I also have an SSD in that machine (Intel S3700), so I did the same test
> with these results:
>
> files | file size | before fsync | after fsync
> ---------------------------------------------------------
> 32 | 2048 | 445.05 | 464.73
> 64 | 1024 | 447.32 | 466.56
> 128 | 512 | 446.63 | 465.90
> 256 | 256 | 446.64 | 466.19
> 1024 | 64 | 511.85 | 523.24
> 2048 | 32 | 579.92 | 590.76
>
> So yes, the number of files matter, but I don't think it's strong enough
> to draw a clear line on how many batches we allow. Especially
> considering how old this machine is (on 3.x kernels, we usually see much
> better performance in I/O intensive conditions).

And just for fun, I did the same test on a workstation with 8GB of RAM,
S3700 SSD, i5-2500 CPU and kernel 3.12. That is, a more modern
hardware / kernel / ... compared to the machine above.

For a test writing 32GB of data (4x the RAM), I got these results:

files | file size | before fsync | after fsync
------------------------------------------------------
32 | 1024 | 171.27 | 175.96
64 | 512 | 165.57 | 170.12
128 | 256 | 165.29 | 169.95
256 | 128 | 164.69 | 169.62
512 | 64 | 163.98 | 168.90
1024 | 32 | 165.44 | 170.50
2048 | 16 | 165.97 | 171.35
4096 | 8 | 166.55 | 173.26

So, no sign of slowdown at all, in this case. I don't have a rotational
disk in the machine at this moment, so I can't repeat the test. But I
don't expect the impact to be much worse than for the old machine.

I'm not sure whether this proves we should not worry about the number of
batches at all - the old kernel / machines will be with us for some
time. However, I'm not a fan of artificialy limiting the implementation
because of a decade old machines either.

Tomas

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Josh Berkus 2014-08-14 23:47:32 Re: jsonb format is pessimal for toast compression
Previous Message Tom Lane 2014-08-14 23:27:22 Re: jsonb format is pessimal for toast compression