Re: count(*) performance improvement ideas

From: "Stephen Denne" <Stephen(dot)Denne(at)datamail(dot)co(dot)nz>
To: "Mark Mielke" <mark(at)mark(dot)mielke(dot)cc>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>, "Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com>
Subject: Re: count(*) performance improvement ideas
Date: 2008-03-19 21:54:31
Message-ID: F0238EBA67824444BC1CB4700960CB4804EAC8F5@dmpeints002.isotach.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Mark Mielke wrote
> This returns to the question of whether count of the whole table is useful, or whether
> count of a GROUP BY or WHERE is useful.
> If GROUP BY or WHERE is useful, then trigger on UPDATE becomes necessary.

True... for the example I gave I should have had an update trigger on my table.
I neglected it as I based the example of a more complex multi-table example from my application, where I know that the value I'm grouping by doesn't change.

> What is the direction here? Is it count of the whole table only? (<-- not interesting to me)
> Or count of more practical real life examples, which I completely agree with Greg,
> that this gets into the materialized view realm, and becomes very interesting.
> In my current db project, I never count all of the rows in a table. However, I do use count(*) with GROUP BY and WHERE.

I'm trying to figure out how close I can come to a useful efficient materialized view with current production builds of postgresql,
and identifying areas where changes to postgresql could make it easier.

Currently I can see three areas of concern:

1) Turning it on
My query to initially populate the materialized view table takes 6 hours on my (limited hardware) development system, which highlights the problem of when do you turn on the triggers.
An outage is one way to guarantee that there are neither missing details nor double counted details.
Would turning on the triggers and then running my initial population query in the same transaction work?

2) Table bloat
I'm trying to group together a large number of +1 or -1 deltas into a single delta per transaction.
This creates as many dead rows as there are updates.
This is also a problem with Pavan's suggestion of maintaining a counter table.
They can all be HOT updates in 8.3, but they still all create new row versions.

Tom says "Collapsing a transaction's changes
into a single row would require keeping some intra-transaction state,
which is do-able at the C-code level but not from plpgsql."

So can I hope that this problem is solvable as a contrib module that will work with at least 8.3?
I'd still want to write plpgsql trigger function myself, as I know the aggregation rules, and call contributed functions to integrate with the collapsing of the transaction's changes into a single row. (Expect I'd need a small number of rows per transaction, as I need to create delta rows for each value of my grouping field involved in the transaction.

3) How to incorporate the deltas.
With my technique, if the transaction rolls back the delta record becomes a dead row, if it commits, the delta is then visible to transaction started after this time.
I need to have a regular process run to sum and remove the deltas, rewriting the summary rows.
I'd like to be able to write an after-commit trigger that fires after changes in dependent tables that I identify, which can add the transaction's deltas to the summary table. I would want it to effectively be a new, small transaction.

I think that if these three areas are addressed, then before even considering writing code to automatically convert any given view into a materialized view.

2) and 3) could perhaps be implemented with a per transaction map from my_custom_key to a prepared statement and a list of parameter values.
Provide access to the values, making them updateable within the transaction. Have the statement automatically executed on commit.

Pavan also refers to deferred triggers, which has got me thinking about another possible solution:

Instead of inserting a delta row, that will be updated a lot of times, create an on commit drop temp table named after the txid and the grouping value (which is an integer in my case, perhaps hash it if you're grouping by something that doesn't easily convert to part of a table name),
create an after insert initially deferred constraint to call a function which will move the 'at commit' values of the rows in the temp table to the summary table.
The temp table will only have one row inserted, updated many times, then on commit the trigger is fired once, and the temp table is dropped.

Does anyone think this will or won't work for some reason?

Regards,
Stephen Denne.

Disclaimer:
At the Datamail Group we value team commitment, respect, achievement, customer focus, and courage. This email with any attachments is confidential and may be subject to legal privilege. If it is not intended for you please advise by reply immediately, destroy it and do not copy, disclose or use it in any way.
__________________________________________________________________
This email has been scanned by the DMZGlobal Business Quality
Electronic Messaging Suite.
Please see http://www.dmzglobal.com/dmzmessaging.htm for details.
__________________________________________________________________

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kurt Roeckx 2008-03-19 21:55:34 Re: Re: [COMMITTERS] pgsql: Don't need -Wno-error anymore, because flex is no longer
Previous Message patrick 2008-03-19 20:11:22 Re: tsearch2 in postgresql 8.3.1 - invalid byte sequence for encoding "UTF8": 0xc3