Re: Parallel Seq Scan

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, John Gorman <johngorman2(at)gmail(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel Seq Scan
Date: 2015-01-29 16:34:08
Message-ID: CAMkU=1zq17cb-FgXnuRpzTwAL2aoaG8Gqs=AodUh1BTmfH5X9Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jan 27, 2015 at 11:08 PM, Heikki Linnakangas <
hlinnakangas(at)vmware(dot)com> wrote:

> On 01/28/2015 04:16 AM, Robert Haas wrote:
>
>> On Tue, Jan 27, 2015 at 6:00 PM, Robert Haas <robertmhaas(at)gmail(dot)com>
>> wrote:
>>
>>> Now, when you did what I understand to be the same test on the same
>>> machine, you got times ranging from 9.1 seconds to 35.4 seconds.
>>> Clearly, there is some difference between our test setups. Moreover,
>>> I'm kind of suspicious about whether your results are actually
>>> physically possible. Even in the best case where you somehow had the
>>> maximum possible amount of data - 64 GB on a 64 GB machine - cached,
>>> leaving no space for cache duplication between PG and the OS and no
>>> space for the operating system or postgres itself - the table is 120
>>> GB, so you've got to read *at least* 56 GB from disk. Reading 56 GB
>>> from disk in 9 seconds represents an I/O rate of >6 GB/s. I grant that
>>> there could be some speedup from issuing I/O requests in parallel
>>> instead of serially, but that is a 15x speedup over dd, so I am a
>>> little suspicious that there is some problem with the test setup,
>>> especially because I cannot reproduce the results.
>>>
>>
>> So I thought about this a little more, and I realized after some
>> poking around that hydra's disk subsystem is actually six disks
>> configured in a software RAID5[1]. So one advantage of the
>> chunk-by-chunk approach you are proposing is that you might be able to
>> get all of the disks chugging away at once, because the data is
>> presumably striped across all of them. Reading one block at a time,
>> you'll never have more than 1 or 2 disks going, but if you do
>> sequential reads from a bunch of different places in the relation, you
>> might manage to get all 6. So that's something to think about.
>>
>> One could imagine an algorithm like this: as long as there are more
>> 1GB segments remaining than there are workers, each worker tries to
>> chug through a separate 1GB segment. When there are not enough 1GB
>> segments remaining for that to work, then they start ganging up on the
>> same segments. That way, you get the benefit of spreading out the I/O
>> across multiple files (and thus hopefully multiple members of the RAID
>> group) when the data is coming from disk, but you can still keep
>> everyone busy until the end, which will be important when the data is
>> all in-memory and you're just limited by CPU bandwidth.
>>
>
> OTOH, spreading the I/O across multiple files is not a good thing, if you
> don't have a RAID setup like that. With a single spindle, you'll just
> induce more seeks.
>
> Perhaps the OS is smart enough to read in large-enough chunks that the
> occasional seek doesn't hurt much. But then again, why isn't the OS smart
> enough to read in large-enough chunks to take advantage of the RAID even
> when you read just a single file?

In my experience with RAID, it is smart enough to take advantage of that.
If the raid controller detects a sequential access pattern read, it
initiates a read ahead on each disk to pre-position the data it will need
(or at least, the behavior I observe is as-if it did that). But maybe if
the sequential read is a bunch of "random" reads from different processes
which just happen to add up to sequential, that confuses the algorithm?

Cheers,

Jeff

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2015-01-29 16:34:27 Re: pg_upgrade and rsync
Previous Message Tom Lane 2015-01-29 16:29:17 Re: Memory leak in gingetbitmap