Re: Indexes on functions

From: "Michael Richards" <michael(at)fastmail(dot)ca>
To: pgsql-sql(at)postgresql(dot)org
Subject: Re: Indexes on functions
Date: 2001-10-18 01:52:22
Message-ID: 3BCE35D6.00000B.93192@frodo.searchcanada.ca
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

I'm going to write a function in C that parses XML. Is there any way
to index the output of this function? I've got 10,000,000 rows that
contain XML data and I need to efficiently find the ones that contain
the proper keys. I tried pulling the values I want from the data and
putting it in its own table but with an average of 20 items that just
grows to an unmanageable size.

-Michael
_________________________________________________________________
http://fastmail.ca/ - Fast Free Web Email for Canadians
>From pgsql-sql-owner(at)postgresql(dot)org Thu Oct 18 09:40:33 2001
Received: from fw-ext.econet.hu (fw.econet.hu [213.16.96.5])
by postgresql.org (8.11.3/8.11.4) with ESMTP id f9HENGP37202
for <pgsql-sql(at)postgresql(dot)org>; Wed, 17 Oct 2001 10:23:16 -0400 (EDT)
(envelope-from col(at)mportal(dot)hu)
Received: from mail.econet.hu ([172.17.1.17])
by fw-ext.econet.hu with esmtp (Exim 3.12 #1 (Debian))
id 15trgk-000242-00; Wed, 17 Oct 2001 16:28:54 +0200
Received: from host152.localnet3 ([192.168.12.152] helo=mportal.hu)
by mail.econet.hu with esmtp (Exim 3.12 #1 (Debian))
id 15trb4-0003hG-00; Wed, 17 Oct 2001 16:23:02 +0200
Message-ID: <3BCD9582(dot)1070804(at)mportal(dot)hu>
Date: Wed, 17 Oct 2001 16:28:18 +0200
From: CoL <col(at)mportal(dot)hu>
User-Agent: Mozilla/5.0 (Windows; U; Win95; en-US; rv:0.9.2) Gecko/20010726 Netscape6/6.1
X-Accept-Language: en-us
MIME-Version: 1.0
To: Stephan Szabo <sszabo(at)megazone23(dot)bigpanda(dot)com>
CC: pgsql-sql(at)postgresql(dot)org
Subject: Re: index problem
References: <Pine(dot)BSF(dot)4(dot)21(dot)0110161548400(dot)18471-100000(at)megazone23(dot)bigpanda(dot)com>
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
X-Scanner: exiscan *15trb4-0003hG-00*YiTaXybQ75Y* http://duncanthrax.net/exiscan/
X-Archive-Number: 200110/249
X-Sequence-Number: 5204

Hi,

Stephan Szabo wrote:

> On Tue, 16 Oct 2001, CoL wrote:
>
>
>>---------------------------
>>The 2 table query, where prog_data has ~8800 rowsn and index on prog_id:
>>bash-2.04$ time echo "explain select distinct
>>prog_id,prog_ftype,prog_fcasthour,prog_date from prog_dgy_xy,prog_data
>>where pxygy_pid=prog_id " | psql -Uuser db
>>NOTICE: QUERY PLAN:
>>
>>Unique (cost=7432549.69..7680455.07 rows=2479054 width=32)
>> -> Sort (cost=7432549.69..7432549.69 rows=24790538 width=32)
>> -> Merge Join (cost=148864.65..161189.33 rows=24790538 width=32)
>> -> Index Scan using prog_data_pkey on prog_data
>>(cost=0.00..701.12 rows=8872 width=28)
>> -> Sort (cost=148864.65..148864.65 rows=921013 width=4)
>> -> Seq Scan on prog_dgy_xy (cost=0.00..30145.13
>>rows=921013 width=4)
>>
>
> I'm guessing that the approximately 25 million row estimate on the join
> has to be wrong as well given that prog_data.prog_id should be unique.
>
> Hmm, does the explain change if you vacuum analyze the other table
> (prog_data)? If not, what does explain show if you do a
> set enable_seqscan='off';
> before it?

The result:
db=>set enable_seqscan='off';
db=>explain select distinct prog_id,prog_ftype,prog_fcasthour,prog_date
from prog_dgy_xy,prog_data where pxygy_pid=prog_id;
NOTICE: QUERY PLAN:

Unique (cost=7606982.10..7854887.48 rows=2479054 width=32)
-> Sort (cost=7606982.10..7606982.10 rows=24790538 width=32)
-> Merge Join (cost=0.00..335621.73 rows=24790538 width=32)
-> Index Scan using progdgyxy_idx2 on prog_dgy_xy
(cost=0.00..323297.05 rows=921013 width=4)
-> Index Scan using prog_data_pkey on prog_data
(cost=0.00..701.12 rows=8872 width=28)

It "seems" index is used, but the same result :(((, and bigger execution
time: real 3m41.830s

What is in tables?
prog_data contains unique id and other info.
prog_dgy_xy contains that id with x,y coordinates (so many ids from
prog_data with unique x,y)
#prog_data:
#prog_id, prog_ftype, prog_fcasthour, prog_date
#1
'type' 6 2001-10-14 12:00:00
#2
'type' 12 2001-10-14 12:00:00
#prog_dgy_xy:
#pxygy_pid, pxygy_x, pxygy_y
#1
0.1 0.1
#1 0.1 0.15

How can this query takes real 0m1.755s for mysql, [17 sec for
oracle], and 2-3 minutes!! for postgres?

And why:
POSTGRES:
set enable_seqscan ='off'; select count(*) from prog_dgy_xy where
pxygy_pid<13161;
count
--------
900029
real 2m34.340s
explain:
Aggregate (cost=327896.89..327896.89 rows=1 width=0)
-> Index Scan using progdgyxy_idx2 on prog_dgy_xy
(cost=0.00..325594.54 rows=920940 width=0)

MYSQL:
select count(pxygy_pid) from PROG_DGY_XY where pxygy_pid<13161
count(pxygy_pid)
900029
real 0m27.878s
explain:
table type possible_keys key key_len ref rows Extra
PROG_DGY_XY range progdgyxy_idx1,progdgyxy_idx2 progdgyxy_idx2
4 NULL 906856 where used; Using index

The same time difference in case of: = or >, however explain says, cause
seq scan is off, the index is used.
I did vacuum, and vacuum analyze too before.

PS: I think i have to make a site for that, cause there are many
questions :), and weird things.
I love postgres but this makes me "hm?". Today i'll make these test
under 7.1.2.
thx
CoL

Browse pgsql-sql by date

  From Date Subject
Next Message Henshall, Stuart - WCP 2001-10-18 13:28:31 Re: Deleting obsolete values
Previous Message Oleg Lebedev 2001-10-17 22:32:57 SQL parser and/or optimizer