Re: Parallel Seq Scan vs kernel read ahead

From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: "k(dot)jamison(at)fujitsu(dot)com" <k(dot)jamison(at)fujitsu(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel Seq Scan vs kernel read ahead
Date: 2020-07-21 23:55:17
Message-ID: CAApHDvoDzAzXEp+Ay2CfT3U=ZcD5NLD7K9_Y936bnHjzs5jkHw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Kirk,

Thank you for doing some testing on this. It's very useful to get some
samples from other hardware / filesystem / os combinations.

On Tue, 21 Jul 2020 at 21:38, k(dot)jamison(at)fujitsu(dot)com
<k(dot)jamison(at)fujitsu(dot)com> wrote:
> Query Planner I/O Timings (ms):
> | Worker | I/O READ (Master) | I/O READ (Patch) | I/O WRITE (Master) | I/O WRITE (Patch) |
> |--------|-------------------|------------------|--------------------|-------------------|
> | 0 | "1,130.78" | "1,250.82" | "1,698.05" | "1,733.44" |

> | Worker | Buffers |
> |--------|--------------------------------------------------|
> | 0 | shared read=442478 dirtied=442478 written=442446 |

I'm thinking the scale of this test might be a bit too small for the
machine you're using to test. When you see "shared read" in the
EXPLAIN (ANALYZE, BUFFERS) output, it does not necessarily mean that
the page had to be read from disk. We use buffered I/O, so the page
could just have been fetched from the kernel's cache.

If we do some maths here on the timing. It took 1130.78 milliseconds
to read 442478 pages, which, assuming the standard page size of 8192
bytes, that's 3457 MB in 1130.78 milliseconds, or 3057 MB/sec. Is
that a realistic throughput for this machine in terms of I/O? Or do
you think that some of these pages might be coming from the Kernel's
cache?

I understand that Amit wrote:

On Fri, 17 Jul 2020 at 21:18, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> I think recreating the database and restarting the server after each
> run might help in getting consistent results. Also, you might want to
> take median of three runs.

Please also remember, if you're recreating the database after having
restarted the machine that creating the table will likely end up
caching some of the pages either in shared buffers or the Kernel's
cache. It would be better to leave the database intact and just reboot
the machine. I didn't really like that option with my tests so I just
increased the size of the table beyond any size that my machines could
have cached. With the 16GB RAM Windows laptop, I used a 100GB table
and with the 64GB workstation, I used an 800GB table. I think my test
using SELECT * FROM t WHERE a < 0; with a table that has a padding
column is likely going to be a more accurate test. Providing there is
no rows with a < 0 in the table then the executor will spend almost
all of the time in nodeSeqscan.c trying to find a row with a < 0.
There's no additional overhead of aggregation doing the count(*).
Having the additional padding column means that we read more data per
evaluation of the a < 0 expression. Also, having a single column
table is not that realistic.

I'm pretty keen to see this machine running something closer to the
test I mentioned in [1] but the benchmark query I mentioned in [2]
with the "t" table being at least twice the size of RAM in the
machine. Larger would be better though. With such a scaled test, I
don't think there's much need to reboot the machine in between. Just
run a single query first to warm up the cache before timing anything.
Having the table a few times larger than RAM will mean that we can be
certain that the disk was actually used during the test. The more data
we can be certain came from disk the more we can trust that the
results are meaningful.

Thanks again for testing this.

David

[1] https://www.postgresql.org/message-id/CAApHDvrfJfYH51_WY-iQqPw8yGR4fDoTxAQKqn+Sa7NTKEVWtg@mail.gmail.com
[2] https://www.postgresql.org/message-id/CAApHDvo+LEGKMcavOiPYK8NEbgP-LrXns2TJ1n_XNRJVE9X+Cw@mail.gmail.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2020-07-22 00:31:12 Re: Which SET TYPE don't actually require a rewrite
Previous Message PASCAL CROZET 2020-07-21 23:49:08 RE: Additional Chapter for Tutorial