Skip site navigation (1) Skip section navigation (2)

Re: [HACKERS] Slow count(*) again...

From: david(at)lang(dot)hm
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Vitalii Tymchyshyn <tivv00(at)gmail(dot)com>, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, Mladen Gogala <mladen(dot)gogala(at)vmsinfo(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Bruce Momjian <bruce(at)momjian(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: [HACKERS] Slow count(*) again...
Date: 2011-02-03 20:54:02
Message-ID: alpine.DEB.2.00.1102031243190.10088@asgard.lang.hm (view raw or flat)
Thread:
Lists: pgsql-hackerspgsql-performance
On Thu, 3 Feb 2011, Robert Haas wrote:

> On Thu, Feb 3, 2011 at 5:11 AM,  <david(at)lang(dot)hm> wrote:
>> If I am understanding things correctly, a full Analyze is going over all the
>> data in the table to figure out patterns.
>
> No.  It's going over a small, fixed-size sample which depends on
> default_statistics_target but NOT on the table size.  It's really
> important to come up with a solution that's not susceptible to running
> ANALYZE over and over again, in many cases unnecessarily.
>
>> If this is the case, wouldn't it make sense in the situation where you are
>> loading an entire table from scratch to run the Analyze as you are
>> processing the data? If you don't want to slow down the main thread that's
>> inserting the data, you could copy the data to a second thread and do the
>> analysis while it's still in RAM rather than having to read it off of disk
>> afterwords.
>
> Well that's basically what autoanalyze is going to do anyway, if the
> table is small enough to fit in shared_buffers.  And it's actually
> usually BAD if it starts running while you're doing a large bulk load,
> because it competes for I/O bandwidth and the buffer cache and slows
> things down.  Especially when you're bulk loading for a long time and
> it tries to run over and over.  I'd really like to suppress all those
> asynchronous ANALYZE operations and instead do ONE synchronous one at
> the end, when we try to use the data.

If the table is not large enough to fit in ram, then it will compete for 
I/O, and the user will have to wait.

what I'm proposing is that as the records are created, the process doing 
the creation makes copies of the records (either all of them, or some of 
them if not all are needed for the analysis, possibly via shareing memory 
with the analysis process), this would be synchronous with the load, not 
asynchronous.

this would take zero I/O bandwidth, it would take up some ram, memory 
bandwidth, and cpu time, but a load of a large table like this is I/O 
contrained.

it would not make sense for this to be the default, but as an option it 
should save a significant amount of time.

I am making the assumption that an Analyze run only has to go over the 
data once (a seqential scan of the table if it's >> ram for example) and 
gathers stats as it goes.

with the current code, this is a completely separate process that knows 
nothing about the load, so if you kick it off when you start the load, it 
makes a pass over the table (competing for I/O), finishes, you continue to 
update the table, so it makes another pass, etc. As you say, this is a bad 
thing to do. I am saying to have an option that ties the two togeather, 
essentially making the data feed into the Analyze run be a fork of the 
data comeing out of the insert run going to disk. So the Analyze run 
doesn't do any I/O and isn't going to complete until the insert is 
complete. At which time it will have seen one copy of the entire table.

David Lang

In response to

Responses

pgsql-performance by date

Next:From: Mladen GogalaDate: 2011-02-03 21:01:40
Subject: Re: [HACKERS] Slow count(*) again...
Previous:From: Ross J. ReedstromDate: 2011-02-03 19:24:42
Subject: Re: [HACKERS] Slow count(*) again...

pgsql-hackers by date

Next:From: Mladen GogalaDate: 2011-02-03 21:01:40
Subject: Re: [HACKERS] Slow count(*) again...
Previous:From: Alex HunsakerDate: 2011-02-03 20:21:42
Subject: Re: arrays as pl/perl input arguments [PATCH]

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group