Re: [HACKERS] Slow count(*) again...

From: Kenneth Marshall <ktm(at)rice(dot)edu>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: david(at)lang(dot)hm, Vitalii Tymchyshyn <tivv00(at)gmail(dot)com>, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, Mladen Gogala <mladen(dot)gogala(at)vmsinfo(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Bruce Momjian <bruce(at)momjian(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: [HACKERS] Slow count(*) again...
Date: 2011-02-04 14:52:20
Message-ID: 20110204145220.GC1261@aart.is.rice.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-performance

On Thu, Feb 03, 2011 at 09:05:52PM -0500, Robert Haas wrote:
> On Thu, Feb 3, 2011 at 8:37 PM, <david(at)lang(dot)hm> wrote:
> > On Thu, 3 Feb 2011, Robert Haas wrote:
> >
> >> On Thu, Feb 3, 2011 at 7:39 PM, ?<david(at)lang(dot)hm> wrote:
> >>>>
> >>>> Yeah, but you'll be passing the entire table through this separate
> >>>> process that may only need to see 1% of it or less on a large table.
> >>>> If you want to write the code and prove it's better than what we have
> >>>> now, or some other approach that someone else may implement in the
> >>>> meantime, hey, this is an open source project, and I like improvements
> >>>> as much as the next guy. ?But my prediction for what it's worth is
> >>>> that the results will suck. ?:-)
> >>>
> >>> I will point out that 1% of a very large table can still be a lot of disk
> >>> I/O that is avoided (especially if it's random I/O that's avoided)
> >>
> >> Sure, but I think that trying to avoid it will be costly in other ways
> >> - you'll be streaming a huge volume of data through some auxiliary
> >> process, which will have to apply some algorithm that's very different
> >> from the one we use today. ?The reality is that I think there's little
> >> evidence that the way we do ANALYZE now is too expensive. ?It's
> >> typically very cheap and works very well. ?It's a bit annoying when it
> >> fires off in the middle of a giant data load, so we might need to
> >> change the time of it a little, but if there's a problem with the
> >> operation itself being too costly, this is the first I'm hearing of
> >> it. ?We've actually worked *really* hard to make it cheap.
> >
> > I could be misunderstanding things here, but my understanding is that it's
> > 'cheap' in that it has little impact on the database while it is running.
>
> I mean that it's cheap in that it usually takes very little time to complete.
>
> > the issue here is that the workflow is
> >
> > load data
> > analyze
> > start work
> >
> > so the cost of analyze in this workflow is not "1% impact on query speed for
> > the next X time", it's "the database can't be used for the next X time while
> > we wait for analyze to finish running"
>
> OK.
>
> > I don't understand why the algorithm would have to be so different than
> > what's done today, surely the analyze thread could easily be tweaked to
> > ignore the rest of the data (assuming we don't have the thread sending the
> > data to analyze do the filtering)
>
> If you want to randomly pick 10,000 rows out of all the rows that are
> going to be inserted in the table without knowing in advance how many
> there will be, how do you do that? Maybe there's an algorithm, but
> it's not obvious to me. But mostly, I question how expensive it is to
> have a second process looking at the entire table contents vs. going
> back and rereading a sample of rows at the end. I can't remember
> anyone ever complaining "ANALYZE took too long to run". I only
> remember complaints of the form "I had to remember to manually run it
> and I wish it had just happened by itself".
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company

Probably doomed to be shot down, but since you are effectively inline,
you could sample assuming a range of table row counts. Start at the
size of a table where random (index) lookups are faster than a sequential
scan and then at appropriate multiples, 100x, 100*100X,... then you should
be able to generate appropriate statistics. I have not actually looked at
how that would happen, but it would certainly allow you to process far, far
fewer rows than the entire table.

Regards,
Ken

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2011-02-04 14:53:15 Re: ALTER EXTENSION UPGRADE, v3
Previous Message Vitalii Tymchyshyn 2011-02-04 14:38:30 Re: [HACKERS] Slow count(*) again...

Browse pgsql-performance by date

  From Date Subject
Next Message Nick Lello 2011-02-04 14:55:01 Re: [HACKERS] Slow count(*) again...
Previous Message hubert depesz lubaczewski 2011-02-04 14:49:51 Re: Really really slow select count(*)