Re: What is the right way to deal with a table with rows that are not in a random order?

From: Douglas Alan <darkwater42(at)gmail(dot)com>
To: Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org, Jeff Davis <pgsql(at)j-davis(dot)com>
Subject: Re: What is the right way to deal with a table with rows that are not in a random order?
Date: 2009-05-29 00:45:12
Message-ID: ce6334d00905281745m253e629at595a05a41ec4c217@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com> wrote:

> Douglas Alan wrote:

>> I'm worried that turning off seqscan would distort other queries.
>> (Remember, I don't have control of the individual queries.  The
>> user of the application can specify all sorts of queries, and
>> there's an ORM in the middle.)

> You are aware you can turn off seq scans for just the current
> connection, right?

Yes, of course.  I thought that the suggestion was to change this
setting in production, not just for diagnostic purposes.  As I
previously reported, I've already changed some settings for diagnostic
purposes and also reported what values I had to set them to to force
an index scan for the query in question.

> No one's saying to do it all the time.  They're saying to do it and
> then run explain analyze on your query, then post the results of
> both let us have a look.

Okay -- no problem:

   set enable_seqscan = on;
   explain analyze select * from maindb_astobject
   where survey_id = 2
   limit 1000;

   "Limit  (cost=0.00..48.03 rows=1000 width=78) (actual
time=84837.835..265938.258 rows=1000 loops=1)"
   "  ->  Seq Scan on maindb_astobject  (cost=0.00..3538556.10
rows=73675167 width=78) (actual time=84837.825..265932.121 rows=1000
loops=1)"
   "        Filter: (survey_id = 2)"
   "Total runtime: 265942.416 ms"

   set enable_seqscan = off;
   explain analyze select * from maindb_astobject
   where survey_id = 2
   limit 1000;

   "Limit  (cost=0.00..67.37 rows=1000 width=78) (actual
time=172.248..225.219 rows=1000 loops=1)"
   "  ->  Index Scan using maindb_astobject_survey_id on
maindb_astobject  (cost=0.00..4963500.87 rows=73675167 width=78)
(actual time=172.240..221.078 rows=1000 loops=1)"
   "        Index Cond: (survey_id = 2)"
   "Total runtime: 227.412 ms"

>> Slow query:
>>
>> psql> select * from maindb_astobject
>>      where survey_id = 2
>>      limit 20 offset 1000;

> You'll notice that in your previous posts you never mentioned offset,
> which greatly affects the plan chosen.

It makes little difference for this query.  The plan is the same with
or without the offset.  The purpose of the offset was just to more or
less simulate a "limit 1020" without getting deluged by 1,020 results.

> Is survey_id indexed?

Yes, of course.

>> P.P.S. Many other queries were very slow due to Postgres wanting to use hash
>> joins and merge joins rather than nested index scans.

> Then it's quite possible you have a problem with misestimation of
> values in your db.  Since we have no explain ANALYZE output, we cannot
> judge if this is the case.

Okay, well if you can also tell me how to fix the hash join / merge
join problem we are seeing without resorting to the current hack I am
using to fix it, I will be very thankful!

>> The nested index
>> scans were, in fact, much faster.  I just turned off hash joins and merge
>> joins:  problem solved.

>One problem solved, another one created is more likely.

Such as?  I haven't yet seen a nested index scan perform poorly yet.

Though if there is some database parameter that is set incorrectly,
and this is causing the hash join / merge join problem, then I can see
that having that parameter not be set correctly could be a source of
many future woes, so I certainly would appreciate any insight into
that.

|>ouglas

P.S. Here are the stats on the column. It appears that my recollection
of 25% of the table matching was a bit off. It's actually 98.5%!  That
might explain more why Postgres wants to do a sequential scan. The
problem is that still means that it has to scan a million rows
sequentially before it finds a single matching row, as the matching
rows are at the end of the database:

select * from pg_stats
where tablename='maindb_astobject' and attname = 'survey_id';

schemaname | tablename | attname | null_frac | avg_width |
n_distinct | most_common_vals | most_common_freqs
| histogram_bounds | correlation
------------+------------------+-----------+-----------+-----------+------------+------------------+--------------------------------------------------+------------------+-------------
public | maindb_astobject | survey_id | 0 | 4 |
5 | {2,4,10,3,5} |
{0.985347,0.00966,0.00286667,0.0019,0.000226667} | |
0.998872

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Adrian Klaver 2009-05-29 02:21:00 Re: Debugging pgsql function date error
Previous Message Rory Campbell-Lange 2009-05-29 00:36:10 Debugging pgsql function date error