Re: gprof SELECT COUNT(*) results

From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: gprof SELECT COUNT(*) results
Date: 2005-11-25 10:23:54
Message-ID: 1132914234.4347.169.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 2005-11-24 at 23:48 -0500, Tom Lane wrote:

> > What's more, we can see that for each row, a LWLock pair is invoked. So on
> > a more aggressive thought, can we change it to page level?
>
> Yeah, I was wondering the same. It'd be possible to rewrite the seqscan
> stuff so that we do the visibility tests for all the tuples on a given
> page at once, taking the buffer content lock just once, and saving aside
> the valid tuple IDs to return later. This should definitely be faster
> when all the tuples actually get fetched.

And me also. Looks good from here. It would be a very localised
optimization, so fairly easy to implement.

Even on your lower gprof numbers this would be 15% faster stand-alone.
Take into account the significant reduction in LWlock requests and it
might go much faster still on a busy system. The gain for other users
would be very good also.

> It might be a bit slower for
> a LIMIT query, but I'm not sure if we care that much.

Agreed. It would be fairly easy to make this happen only for full
SeqScans, and the greater majority of those don't have a LIMIT of less
than one block.

> The only other
> objection I can think of is that if there are any broken tuples on a
> page, this approach would likely make it impossible to fetch any of the
> non-broken ones :-(

I was thinking of the brute force approach: take a complete copy of the
block when we read the first tuple off it. That way any wierdness
on-block is avoided until we logically attempt to read that tuple. It
also allows us to palloc the right amount of space first time.

This could be an interesting win: no other DBMS can take this approach.
I think it has further advantages also:

1. Copying the tuples locally will also mean that the data will probably
be in the L2 cache rather than main memory, so this could make things
faster still.

2. For large scans, the buffer for that block doesn't need to be pinned
after the first touch, so the buffer could be cleared and replaced with
another before we have logically finished reading the block. That will
improve fluidity in shared_buffers and improve efficiency.

Are you, or will you be implementing this? If not, I'll implement this -
it seems an important BI optimization.

Best Regards, Simon Riggs

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Martijn van Oosterhout 2005-11-25 12:14:03 Re: someone working to add merge?
Previous Message Atsushi Ogawa 2005-11-25 10:15:55 Re: core dump on 8.1 and no dump on REL8_1_STABLE