Re: How to check a table content efficiently? With LIMIT and OFFSET?

From: Alban Hertroys <dalroi(at)solfertje(dot)student(dot)utwente(dot)nl>
To: Stefan Keller <sfkeller(at)gmail(dot)com>
Cc: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>, pgsql-general List <pgsql-general(at)postgresql(dot)org>
Subject: Re: How to check a table content efficiently? With LIMIT and OFFSET?
Date: 2011-05-30 06:45:50
Message-ID: DB49D134-7296-48D2-9D34-99726A2DCCBF@solfertje.student.utwente.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 29 May 2011, at 23:55, Stefan Keller wrote:

> Hi Alban
>
> On 2011/5/29 Alban Hertroys wrote:
>> On 29 May 2011, at 19:45, Stefan Keller wrote:
>>
>>> But I'm hesitating to use ANALYZE for two reasons:
>>> 1. It's very slow: it repeadly takes 59000 ms on my machine.
>>
>> ANALYZE on a single table takes 59s?!? That's _really_ long. How big is that table (it has about 180k rows, you did provide that information, but that's not much at all) and how many indexes are on it? Are you sure you're not overburdening your hardware in some way?
>>
>> Or are you in fact talking about a different command? For example, ANALYZE (without specifying a table) or VACUUM ANALYZE <table>?
>
> You are right: I used ANALYZE (without specifying a table). But this
> still takes about 1 to 3 sec which is about 100 times slower than
>
> SELECT (count(*) = 1) FROM (SELECT osm_id FROM planet_osm_point LIMIT
> 1 OFFSET 100000) tmp;
> or

Still 1 to 3 seconds? It should be faster than a full table-scan, as it just takes samples across the table. Perhaps you have many indexes or some of an uncommon type? Or is your system I/O-bound perhaps? It does need to write those statistics to a system table at some point, in which an I/O-bound system won't perform very well of course.

> SELECT reltuples FROM pg_class WHERE relname = 'planet_osm_point';

That will only work if you want to know the total amount of rows in the table. If you need to know how many rows will match a specific WHERE-clause this falls on its behind. For cases like that you can use the output of EXPLAIN <query>, but that will not be very accurate since it uses statistical information about value distribution and such of your data (which is gathered by ANALYSE).

If you really only need the total number of records and if you're indeed inserting/deleting in batches, then it's probably best to create statement-level INSERT/DELETE triggers (that could call the same function). I've done this in the past using a row-level trigger, but my data came in live. For batches of data it's probably more efficient to call a statement-level trigger once per query than a row-level one for each row. I think you can obtain the number of modified rows from GET DIAGNOSTICS, off the top of my head.

>>> 2. There's an autovacuum background process which already does the
>>> job, doesn't it?
>>
>> Yes, but in its own time. If you know there has been a batch of inserts/deletes you might as well run analyse immediately on that table.
>
> My table is a read-only table after all.
> That's another reason why I'm reluctant using ANALYZE <table>.

You probably won't need to run it as often as every time you need to know the number of rows in it. If the data doesn't change, then the row-count in the statistics won't either.
You probably do want to run this after a batch-INSERT/DELETE, or your row-counts will be inaccurate until auto-vacuum comes along.

>> Also, on this mailing-list people don't appreciate it if you top-post. It makes the context
>> hard to decipher and sometimes even makes it difficult to give an accurate answer
>> because the information people want to refer to is far separated from the bit where
>> they're trying to reply to something you said/asked. Remember, people aren't here for your sake.
>
> Thank you for the hint, which I didn't know:
> Is this really still part of this elderly USENET netiquette here?

There's nothing elderly about it. If you're communicating with a large number of people at once, the requirements change. You don't want to make it difficult on people to follow a thread they possibly didn't follow earlier, or they either won't bother to answer or they only pick up the latest bit of the thread. In both cases the chances that their answers will be irrelevant are quite significant, provided they even do reply.

In the end it has little to do with style and much more with common sense. The format used here (as well as in USENET) is more suitable for mailing lists.

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.

!DSPAM:737,4de33d4211921620335251!

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Enrico Weigelt 2011-05-30 07:39:50 Re: [SQL] 500KB PDF files for postgresql8.3, which is the most efficient way?
Previous Message Scott Marlowe 2011-05-30 06:21:50 Re: Regular disk activity of an idle DBMS