From: | Kenneth Marshall <ktm(at)rice(dot)edu> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | david(at)lang(dot)hm, Vitalii Tymchyshyn <tivv00(at)gmail(dot)com>, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, Mladen Gogala <mladen(dot)gogala(at)vmsinfo(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Bruce Momjian <bruce(at)momjian(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org> |
Subject: | Re: [HACKERS] Slow count(*) again... |
Date: | 2011-02-04 14:52:20 |
Message-ID: | 20110204145220.GC1261@aart.is.rice.edu |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers pgsql-performance |
On Thu, Feb 03, 2011 at 09:05:52PM -0500, Robert Haas wrote:
> On Thu, Feb 3, 2011 at 8:37 PM, <david(at)lang(dot)hm> wrote:
> > On Thu, 3 Feb 2011, Robert Haas wrote:
> >
> >> On Thu, Feb 3, 2011 at 7:39 PM, ?<david(at)lang(dot)hm> wrote:
> >>>>
> >>>> Yeah, but you'll be passing the entire table through this separate
> >>>> process that may only need to see 1% of it or less on a large table.
> >>>> If you want to write the code and prove it's better than what we have
> >>>> now, or some other approach that someone else may implement in the
> >>>> meantime, hey, this is an open source project, and I like improvements
> >>>> as much as the next guy. ?But my prediction for what it's worth is
> >>>> that the results will suck. ?:-)
> >>>
> >>> I will point out that 1% of a very large table can still be a lot of disk
> >>> I/O that is avoided (especially if it's random I/O that's avoided)
> >>
> >> Sure, but I think that trying to avoid it will be costly in other ways
> >> - you'll be streaming a huge volume of data through some auxiliary
> >> process, which will have to apply some algorithm that's very different
> >> from the one we use today. ?The reality is that I think there's little
> >> evidence that the way we do ANALYZE now is too expensive. ?It's
> >> typically very cheap and works very well. ?It's a bit annoying when it
> >> fires off in the middle of a giant data load, so we might need to
> >> change the time of it a little, but if there's a problem with the
> >> operation itself being too costly, this is the first I'm hearing of
> >> it. ?We've actually worked *really* hard to make it cheap.
> >
> > I could be misunderstanding things here, but my understanding is that it's
> > 'cheap' in that it has little impact on the database while it is running.
>
> I mean that it's cheap in that it usually takes very little time to complete.
>
> > the issue here is that the workflow is
> >
> > load data
> > analyze
> > start work
> >
> > so the cost of analyze in this workflow is not "1% impact on query speed for
> > the next X time", it's "the database can't be used for the next X time while
> > we wait for analyze to finish running"
>
> OK.
>
> > I don't understand why the algorithm would have to be so different than
> > what's done today, surely the analyze thread could easily be tweaked to
> > ignore the rest of the data (assuming we don't have the thread sending the
> > data to analyze do the filtering)
>
> If you want to randomly pick 10,000 rows out of all the rows that are
> going to be inserted in the table without knowing in advance how many
> there will be, how do you do that? Maybe there's an algorithm, but
> it's not obvious to me. But mostly, I question how expensive it is to
> have a second process looking at the entire table contents vs. going
> back and rereading a sample of rows at the end. I can't remember
> anyone ever complaining "ANALYZE took too long to run". I only
> remember complaints of the form "I had to remember to manually run it
> and I wish it had just happened by itself".
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
Probably doomed to be shot down, but since you are effectively inline,
you could sample assuming a range of table row counts. Start at the
size of a table where random (index) lookups are faster than a sequential
scan and then at appropriate multiples, 100x, 100*100X,... then you should
be able to generate appropriate statistics. I have not actually looked at
how that would happen, but it would certainly allow you to process far, far
fewer rows than the entire table.
Regards,
Ken
From | Date | Subject | |
---|---|---|---|
Next Message | Robert Haas | 2011-02-04 14:53:15 | Re: ALTER EXTENSION UPGRADE, v3 |
Previous Message | Vitalii Tymchyshyn | 2011-02-04 14:38:30 | Re: [HACKERS] Slow count(*) again... |
From | Date | Subject | |
---|---|---|---|
Next Message | Nick Lello | 2011-02-04 14:55:01 | Re: [HACKERS] Slow count(*) again... |
Previous Message | hubert depesz lubaczewski | 2011-02-04 14:49:51 | Re: Really really slow select count(*) |