Re: Slow count(*) again...

From: Mladen Gogala <mladen(dot)gogala(at)vmsinfo(dot)com>
To: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
Cc: Neil Whelchel <neil(dot)whelchel(at)gmail(dot)com>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Slow count(*) again...
Date: 2010-10-10 17:14:22
Message-ID: 4CB1F46E.80205@vmsinfo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-performance

Craig Ringer wrote:
> On 10/10/2010 9:54 AM, Mladen Gogala wrote:
>
>
>> Unfortunately, the problem is in the rather primitive way PostgreSQL
>> does I/O. It didn't change in 9.0 so there is nothing you could gain by
>> upgrading. If you execute strace -o /tmp/pg.out -e read <PID of the
>> sequential scan process> and inspect the file /tmp/pg.out when the query
>> finishes, you will notice a gazillion of read requests, all of them 8192
>> bytes in size. That means that PostgreSQL is reading the table block by
>> block, without any merging of the requests.
>>
>
> I'd be really interested in any measurements you've done to determine
> the cost of this over doing reads in larger chunks. If they're properly
> detailed and thought out, the -hackers list is likely to be interested
> as well.
>
I can provide measurements, but from Oracle RDBMS. Postgres doesn't
allow tuning of that aspect, so no measurement can be done. Would the
numbers from Oracle RDBMS be acceptable?

> The Linux kernel, at least, does request merging (and splitting, and
> merging, and more splitting) along the request path, and I'd personally
> expect that most of the cost of 8k requests would be in the increased
> number of system calls, buffer copies, etc required. Measurements
> demonstrating or contradicting this would be good to see.
>

Even the cost of hundreds of thousands of context switches is far from
negligible. What kind of measurements do you expect me to do with the
database which doesn't support tweaking of that aspect of its operation?

> It's worth being aware that there are memory costs to doing larger
> reads, especially when you have many backends each of which want to
> allocate a larger buffer for reading.

Oh, it's not only larger memory, the buffer management would have to be
changed too, to prevent process doing a sequential scan from inundating
the shared buffers. Alternatively, the blocks would have to be written
into the private memory and immediately thrown away after that. However,
the experience with Oracle tells me that this is well worth it. Here are
the numbers:

Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.5.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options

SQL> show parameter db_file_multi

NAME TYPE VALUE
------------------------------------ -----------
------------------------------
db_file_multiblock_read_count integer 16
SQL> alter session set db_file_multiblock_read_count=1;

Session altered.
SQL> select count(*) from ni_occurrence;

COUNT(*)
----------
402062638

Elapsed: 00:08:20.88
SQL> alter session set db_file_multiblock_read_count=128;

Session altered.

Elapsed: 00:00:00.50
SQL> select count(*) from ni_occurrence;

COUNT(*)
----------
402062638

Elapsed: 00:02:17.58

In other words, when I batched the sequential scan to do 128 blocks I/O,
it was 4 times faster than when I did the single block I/O.
Does that provide enough of an evidence and, if not, why not?

> If you can use a chunk of
> shared_buffers as the direct destination for the read that's OK, but
> otherwise you're up for (1mb-8kb)*num_backends extra memory use on I/O
> buffers that could otherwise be used as shared_buffers or OS cache.
>
> Async I/O, too, has costs.
>

There is a common platitude that says that there is no such thing as
free lunch. However, both Oracle RDBMS and IBM DB2 use asynchronous I/O,
probably because they're unaware of the danger. Let me now give you a
full table scan of a much smaller table located in a Postgres database:

news=> select count(*) from internet_web_sites;
count
---------
1290133
(1 row)

Time: 12838.958 ms

Oracle counts 400 million records in 2 minutes and Postgres 9.01 takes
12.8 seconds to count 1.2 million records? Do you see the disparity?

Both databases, Oracle and Postgres, are utilizing the same 3Par SAN
device, the machines housing both databases are comparable HP 64 bit
Linux machines, both running 64 bit version of Red Hat 5.5. Respective
table sizes are here:

SQL> select bytes/1048576 as MB from user_segments
2 where segment_name='NI_OCCURRENCE';

MB
----------
35329

news=> select pg_size_pretty(pg_table_size('moreover.internet_web_sites'));
pg_size_pretty
----------------
216 MB
(1 row)

So, I really pushed Oracle much harder than I pushed Postgres.

> > PostgreSQL is in
>
>> dire need of something similar and it wouldn't even be that hard to
>> implement.
>>
>
> I'd really like to see both those assertions backed with data or patches ;-)
>

With the database that doesn't allow tuning of that aspect, it's the
self-defeating proposition. However, I did my best to give you the numbers.

> Personally, I know just enough about how PG's I/O path works to suspect
> that "not that hard to implement" is probably a little ...
> over-optimistic. Sure, it's not that hard to implement in a new program
> with no wired-in architectural and design choices; that doesn't mean
> it's easy to retrofit onto existing code, especially a bunch of
> co-operating processes with their own buffer management.
>
>
It maybe so, but slow sequential scan is still the largest single
performance problem of PostgreSQL. The frequency with which that topic
appears on the mailing lists should serve as a good evidence for that. I
did my best to prove my case. Again, requiring "hard numbers" when
using the database which doesn't allow tweaking of the I/O size is self
defeating proposition. The other databases, like DB2 and Oracle both
allow tweaking of that aspect of its operation, Oracle even on the per
session basis. If you still claim that it wouldn't make the difference,
the onus to prove it is on you.

--
Mladen Gogala
Sr. Oracle DBA
1500 Broadway
New York, NY 10036
(212) 329-5251
www.vmsinfo.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Vaibhav Kaushal 2010-10-10 17:21:54 Re: Which file does the SELECT?
Previous Message Tom Lane 2010-10-10 15:59:18 Re: Which file does the SELECT?

Browse pgsql-performance by date

  From Date Subject
Next Message Reid Thompson 2010-10-10 18:33:04 Re: Slow count(*) again...
Previous Message Craig Ringer 2010-10-10 15:30:09 Re: Slow count(*) again...