Re: Slow count(*) again...

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Mladen Gogala <mladen(dot)gogala(at)vmsinfo(dot)com>
Cc: "david(at)lang(dot)hm" <david(at)lang(dot)hm>, Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>, Vitalii Tymchyshyn <tivv00(at)gmail(dot)com>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Slow count(*) again...
Date: 2010-10-12 17:07:58
Message-ID: 27107.1286903278@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-performance

So I spent a bit of quality time with oprofile this morning, and found
once again that there's no substitute for having actual data before
theorizing.

Test case software: current Git HEAD (plus one code change explained
below), compiled with --enable-debug to support oprofile, cassert off;
no other special configure options. Running on current Fedora 13 (gcc
4.4.4 in particular). All postgresql.conf options are out-of-the-box.

Test case hardware: recently purchased mid-grade desktop, dual Xeon
E5503 processors (Nehalem cores, 2GHZ), 4GB DDR3-800 RAM, no-name
SATA disk.

Test query: "select count(*) from t" where t has 4 nonnull integer
columns and 81920000 rows, occupying 3459MB. I chose that size
specifically to fit into available RAM, so that on repeated executions
no physical I/O will occur.

On this setup I find that "select count(*)" runs in about 7.5sec when
the data is fully cached in RAM, for a scanning speed of 460MB/sec.
This is well in excess of what the machine's disk hardware can do:
bonnie++ rates the machine's disk read speed at 152MB/sec. So in theory
PG should be able to completely saturate the disk when processing a table
bigger than RAM. In reality the test case run time if I've just flushed
cache is about 28sec, working out to a scan rate of 123MB/sec. I expect
if I'd bothered to tune the kernel readahead parameters as outlined
earlier in this thread, I could get to 150MB/sec.

Now of course this disk setup is far from industrial strength, but the
processor isn't what you'd put in a serious database server either (in
particular, its available memory bandwidth is well behind the curve).
Also, the table is pretty narrow (only 16 payload bytes per row), and
any wider test table would show a pretty much linear scaling of achievable
scan rate versus table width. So I don't see much support here at all
for the notion that we scan slower than available disk bandwidth.

Further details from poking at it with oprofile: in the fully-cached
case the CPU time is about 80% Postgres and 20% kernel. That kernel
time is of course all to do with moving pages from kernel disk buffers
into Postgres shared memory. Although I've not bothered to increase
shared_buffers from the default 32MB, it wouldn't matter on this benchmark
unless I were able to make shared_buffers hold the entire table ... and
even then I'd only save 20%.

oprofile further shows that (with stock Postgres sources) the userspace
runtime breaks down like this:

samples % symbol name
141267 13.0810 heapgettup_pagemode
85947 7.9585 advance_aggregates
83031 7.6885 ExecProject
78975 7.3129 advance_transition_function
75060 6.9504 heapgetpage
73540 6.8096 ExecClearTuple
69355 6.4221 ExecProcNode
59288 5.4899 heap_getnext
57745 5.3470 ExecScan
55618 5.1501 HeapTupleSatisfiesMVCC
47057 4.3574 MemoryContextReset
41904 3.8802 ExecStoreTuple
37146 3.4396 SeqNext
32206 2.9822 ExecAgg
22135 2.0496 XidInMVCCSnapshot
21142 1.9577 int8inc
19280 1.7853 AllocSetReset
18211 1.6863 hash_search_with_hash_value
16285 1.5079 TransactionIdPrecedes

I also looked at the source-line-level breakdown, though that's too bulky
to post here. The most interesting fact here is that tuple visibility
testing (MVCC) overhead is simply nonexistent: it'd be in heapgetpage()
if it were being done, which it isn't because all the pages of the table
have the PageIsAllVisible bit set. In a previous run where those bits
weren't set but the per-tuple hint bits were, visibility testing still
only ate a percent or two of the runtime. So the theory some people have
espoused in this thread that visibility testing is the bottleneck doesn't
hold water either. If you go back and look at previous pgsql-hackers
discussions about that, what people have been worried about is not the CPU
cost of visibility testing but the need for indexscan queries to visit
the heap for no other purpose than to check the visibility flags. In a
seqscan it's not going to matter.

I looked a bit more closely at the heapgettup_pagemode timing. The
lines shown by opannotate as more than 0.1 percent of the runtime are

22545 2.2074 :{ /* heapgettup_pagemode total: 153737 15.0528 */
5685 0.5566 : bool backward = ScanDirectionIsBackward(dir);
5789 0.5668 : if (!scan->rs_inited)
5693 0.5574 : lineindex = scan->rs_cindex + 1;
11429 1.1190 : dp = (Page) BufferGetPage(scan->rs_cbuf);
5693 0.5574 : linesleft = lines - lineindex;
5766 0.5646 : while (linesleft > 0)
5129 0.5022 : lineoff = scan->rs_vistuples[lineindex];
44461 4.3533 : tuple->t_data = (HeapTupleHeader) PageGetItem((Page) dp, lpp);
11135 1.0903 : tuple->t_len = ItemIdGetLength(lpp);
5692 0.5573 : if (key != NULL)
5773 0.5653 : HeapKeyTest(tuple, RelationGetDescr(scan->rs_rd),
5674 0.5556 : scan->rs_cindex = lineindex;
11406 1.1168 :}

There doesn't seem to be a whole lot of room for improvement there.
Maybe we could shave a couple percent with some tenser coding (I'm
wondering why HeapKeyTest is being reached, in particular, when there's
no WHERE clause). But any local changes here will be marginal at best.

One thing I did find is that the time spent in ExecProject/ExecClearTuple,
amounting to nearly 15% of the runtime, is just for evaluating the
arguments of the aggregate ... and count(*) hasn't got any arguments.
So a patch like this improves the run speed by about 15%:

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index a7dafeb..051e70c 100644
*** a/src/backend/executor/nodeAgg.c
--- b/src/backend/executor/nodeAgg.c
*************** advance_aggregates(AggState *aggstate, A
*** 480,486 ****
TupleTableSlot *slot;

/* Evaluate the current input expressions for this aggregate */
! slot = ExecProject(peraggstate->evalproj, NULL);

if (peraggstate->numSortCols > 0)
{
--- 480,489 ----
TupleTableSlot *slot;

/* Evaluate the current input expressions for this aggregate */
! if (peraggstate->evalproj)
! slot = ExecProject(peraggstate->evalproj, NULL);
! else
! slot = peraggstate->evalslot;

if (peraggstate->numSortCols > 0)
{
*************** ExecInitAgg(Agg *node, EState *estate, i
*** 1728,1738 ****
peraggstate->evalslot = ExecInitExtraTupleSlot(estate);
ExecSetSlotDescriptor(peraggstate->evalslot, peraggstate->evaldesc);

! /* Set up projection info for evaluation */
! peraggstate->evalproj = ExecBuildProjectionInfo(aggrefstate->args,
! aggstate->tmpcontext,
! peraggstate->evalslot,
! NULL);

/*
* If we're doing either DISTINCT or ORDER BY, then we have a list of
--- 1731,1744 ----
peraggstate->evalslot = ExecInitExtraTupleSlot(estate);
ExecSetSlotDescriptor(peraggstate->evalslot, peraggstate->evaldesc);

! /* Set up projection info for evaluation, if agg has any args */
! if (aggrefstate->args)
! peraggstate->evalproj = ExecBuildProjectionInfo(aggrefstate->args,
! aggstate->tmpcontext,
! peraggstate->evalslot,
! NULL);
! else
! peraggstate->evalproj = NULL;

/*
* If we're doing either DISTINCT or ORDER BY, then we have a list of

bringing the oprofile results to

samples % symbol name
181660 17.9017 heapgettup_pagemode
138049 13.6040 advance_transition_function
102865 10.1368 advance_aggregates
80948 7.9770 ExecProcNode
79943 7.8780 heap_getnext
73384 7.2316 ExecScan
60607 5.9725 MemoryContextReset
53889 5.3105 ExecStoreTuple
46666 4.5987 SeqNext
40535 3.9945 ExecAgg
33481 3.2994 int8inc
32202 3.1733 heapgetpage
26068 2.5689 AllocSetReset
18493 1.8224 hash_search_with_hash_value
8679 0.8553 LWLockAcquire
6615 0.6519 ExecSeqScan
6583 0.6487 LWLockRelease
3928 0.3871 hash_any
3715 0.3661 ReadBuffer_common

(note that this, not the stock code, is what corresponds to the 7.5sec
runtime I quoted above --- it's about 8.5sec without that change).

At this point what we've got is 25% of the runtime in nodeAgg.c overhead,
and it's difficult to see how to get any real improvement without tackling
that. Rather than apply the patch shown above, I'm tempted to think about
hard-wiring COUNT(*) as a special case in nodeAgg.c such that we don't go
through advance_aggregates/advance_transition_function at all, but just
increment a counter directly. However, that would very clearly be
optimizing COUNT(*) and nothing else. Given the opinions expressed
elsewhere in this thread that heavy reliance on COUNT(*) represents
bad application design, I'm not sure that such a patch would meet with
general approval.

Actually the patch shown above is optimizing COUNT(*) and nothing else,
too, since it's hard to conceive of any other zero-argument aggregate.

Anyway, if anyone is hot to make COUNT(*) faster, that's where to look.
I don't think any of the previous discussion in this thread is on-point
at all, except for the parts where people suggested avoiding it.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2010-10-12 17:20:18 Re: host name support in pg_hba.conf
Previous Message Dan Harris 2010-10-12 17:06:47 Re: Slow count(*) again...

Browse pgsql-performance by date

  From Date Subject
Next Message Jon Nelson 2010-10-12 17:24:47 read only transactions
Previous Message Dan Harris 2010-10-12 17:06:47 Re: Slow count(*) again...