sequence query performance issues

From: "Peter Koczan" <pjkoczan(at)gmail(dot)com>
To: pgsql-performance <pgsql-performance(at)postgresql(dot)org>
Subject: sequence query performance issues
Date: 2007-09-27 22:04:35
Message-ID: 4544e0330709271504g364fd042u161966fa448934f6@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hello,

I have a weird performance issue with a query I'm testing. Basically,
I'm trying to port a function that generates user uids, and since
postgres offers a sequence generator function, I figure I'd take
advantage of that. Basically, I generate our uid range, filter out
those which are in use, and randomly pick however many I need.
However, when I run it it takes forever (>10 minutes and I get nothing
so I cancelled the query) and cpu usage on the server is maxed out.

Here's my query (I'll post the explain output later so as not to
obscure my question):
=> select a.uid from generate_series(1000, 32767) as a(uid) where
a.uid not in (select uid from people) order by random() limit 1;

I thought that nulls were a problem, so I tried:
=> select a.uid from generate_series(1000, 32767) as a(uid) where
a.uid not in (select coalesce(uid,0) from people) order by random()
limit 1;
And that finished in less than a second.

I then tried:
=> select a.uid from generate_series(1000, 32767) as a(uid) where
a.uid not in (select coalesce(uid,0) from people where uid is not
null) order by random() limit 1;
And we're back to taking forever.

So I have 2 questions:

- Is there a better query for this purpose? Mine works when coalesced,
but it seems a little brute-force and the random() sorting, while
kinda nice, is slow.

- Is this in any way expected? I know that nulls sometimes cause
problems, but why is it taking forever even when trying to filter
those out?

Thanks.

Peter

The gory details:
- There is an btree index on people(uid), and there are ~6300 rows, of
which ~1300 have null uids.

- EXPLAIN output (I couldn't get EXPLAIN ANALYZE output from the first
two queries since they took too long):
=> explain select a.uid from generate_series(1000, 32767) as a(uid)
where a.uid not in (select uid from people) order by random() limit 1;
QUERY PLAN
------------------------------------------------------------------------------------------
Limit (cost=40025.57..40025.60 rows=10 width=4)
-> Sort (cost=40025.57..40026.82 rows=500 width=4)
Sort Key: random()
-> Function Scan on generate_series a
(cost=693.16..40003.16 rows=500 width=4)
Filter: (NOT (subplan))
SubPlan
-> Materialize (cost=693.16..756.03 rows=6287 width=2)
-> Seq Scan on people (cost=0.00..686.87
rows=6287 width=2)
(8 rows)

=> explain select a.uid from generate_series(1000, 32767) as a(uid)
where a.uid not in (select uid from people where uid is not null)
order by random() limit 1;
QUERY PLAN
------------------------------------------------------------------------------------------
Limit (cost=31486.71..31486.73 rows=10 width=4)
-> Sort (cost=31486.71..31487.96 rows=500 width=4)
Sort Key: random()
-> Function Scan on generate_series a
(cost=691.79..31464.29 rows=500 width=4)
Filter: (NOT (subplan))
SubPlan
-> Materialize (cost=691.79..741.00 rows=4921 width=2)
-> Seq Scan on people (cost=0.00..686.87
rows=4921 width=2)
Filter: (uid IS NOT NULL)
(9 rows)

=> explain select a.uid from generate_series(1000, 32767) as a(uid)
where a.uid not in (select coalesce(uid, 0) from people) order by
random() limit 1;
QUERY PLAN
----------------------------------------------------------------------------------------
Limit (cost=756.97..756.99 rows=10 width=4)
-> Sort (cost=756.97..758.22 rows=500 width=4)
Sort Key: random()
-> Function Scan on generate_series a (cost=718.30..734.55
rows=500 width=4)
Filter: (NOT (hashed subplan))
SubPlan
-> Seq Scan on people (cost=0.00..702.59 rows=6287 width=2)
(7 rows)

=> explain analyze select a.uid from generate_series(1000, 32767) as
a(uid) where a.uid not in (select coalesce(uid, 0) from people) order
by random() limit 1;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=756.97..756.99 rows=10 width=4) (actual
time=370.444..370.554 rows=10 loops=1)
-> Sort (cost=756.97..758.22 rows=500 width=4) (actual
time=370.434..370.472 rows=10 loops=1)
Sort Key: random()
-> Function Scan on generate_series a (cost=718.30..734.55
rows=500 width=4) (actual time=70.018..199.540 rows=26808 loops=1)
Filter: (NOT (hashed subplan))
SubPlan
-> Seq Scan on people (cost=0.00..702.59 rows=6287
width=2) (actual time=0.023..29.167 rows=6294 loops=1)
Total runtime: 372.224 ms
(8 rows)

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Richard Huxton 2007-09-28 08:06:11 Re: sequence query performance issues
Previous Message Ron Mayer 2007-09-27 18:07:48 Re: Searching for the cause of a bad plan