Re: INDEX suggestion needed

From: Thomas Beutin <tyrone(at)laokoon(dot)IN-Berlin(dot)DE>
To: Manfred Koizar <mkoi-pg(at)aon(dot)at>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: INDEX suggestion needed
Date: 2002-12-13 15:41:38
Message-ID: 20021213164138.C17113@laokoon.bug.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi,

at first thanks to all people for help!

On Thu, Dec 12, 2002 at 10:00:48PM +0100, Manfred Koizar wrote:
> On Thu, 12 Dec 2002 20:13:24 +0100, Thomas Beutin
> <tyrone(at)laokoon(dot)IN-Berlin(dot)DE> wrote:
[...]

> >tb=# EXPLAIN ANALYZE SELECT COUNT(DISTINCT a_id) FROM stat_pages WHERE m_id = 35::smallint AND (visit >= '2002-06-01' AND visit <= '2002-12-11');
>
> This selects (almost) all rows. An index cannot help.
>
> >i got the following index/sequence scans by date ranges:
> >(visit >= '2002-06-01' AND visit <= '2002-06-30') index scan
>
> 0 rows satisfy this condition, index scan is ok
>
> >(visit >= '2002-06-01' AND visit <= '2002-07-31') index scan
>
> Ca. 10000 rows, 3% of the whole table, index scan ok
Ahh, now i get a clear look into index usability ;-)

> >(visit >= '2002-06-01' AND visit <= '2002-08-31') sequence scan
>
> 32%, sequence scan is expected to be faster, unless tuples are almost
> perfectly ordered by visit or most of the table (~ 8000 pages) fits
> into the cache. How much physical memory is installed? What are your
> shared_buffers and effective_cache_size settings? You might want to
> experiment with
> SET enable_seqscan = off;
Ok, this queries are running on my developer workstation (notebook):
550 MHz mobile celeron, 128 MB RAM, 512MB swap space, IDE
running a apache/php, postgres 7.2.3, X11/gnome and mozilla 1.2.1

This are my entries in postgresql.conf:
#max_connections = 32
#shared_buffers = 64 # 2*max_connections, min 16
#effective_cache_size = 1000 # default in 8k pages
so i'm guessing i run the default values. Playing around with
enable_seqscan = on/off shows the planner's right choices.

Actually i cannot connect to the production server for stats, but the
hardware is:
2x iPIII 850MHz, 2GB RAM, 1GB swap, SCSI RAID5 for database
running a apache/php and postgres 7.2.?
Should be more reasonable...

[...]

> >(visit >= '2002-10-01' AND visit <= '2002-10-31') index scan, but long (>5sec)
>
> Ca. 3%; interesting that it takes so long; I'd like to see EXPLAIN
> ANALYZE output for enable_seqscan on and off.
tb=# set enable_seqscan=on;
SET VARIABLE
itb=# EXPLAIN ANALYZE SELECT COUNT(DISTINCT a_id) FROM stat_pages WHERE m_id = 35::smallint AND (visit >= '2002-10-01' AND visit <= '2002-10-31');
NOTICE: QUERY PLAN:

Aggregate (cost=8788.75..8788.75 rows=1 width=34) (actual time=4663.69..4663.70 rows=1 loops=1)
-> Index Scan using tb5 on stat_pages (cost=0.00..8712.60 rows=30459 width=34) (actual time=0.41..468.35 rows=29937 loops=1)
Total runtime: 4663.99 msec

EXPLAIN

tb=# set enable_seqscan=off;
SET VARIABLE
tb=# EXPLAIN ANALYZE SELECT COUNT(DISTINCT a_id) FROM stat_pages WHERE m_id = 35::smallint AND (visit >= '2002-10-01' AND visit <= '2002-10-31');
NOTICE: QUERY PLAN:

Aggregate (cost=8788.75..8788.75 rows=1 width=34) (actual time=4788.35..4788.35 rows=1 loops=1)
-> Index Scan using tb5 on stat_pages (cost=0.00..8712.60 rows=30459 width=34) (actual time=0.41..439.44 rows=29937 loops=1)
Total runtime: 4788.65 msec

EXPLAIN

There is no difference in cost.

> >And: The date range in my table is from 2002-07-10 11:36:53+02 up to
> >2002-10-29 23:31:47+01.
>
> Yes, this is approximately reflected by the histogram bounds.
>
> > attname | null_frac | avg_wi | n_distinct | correlation
> >---------+-----------+--------+------------+-------------
> > visit | 0 | 8 | -0.543682 | -0.972118
>
> The negative correlation looks strange. How did you insert your data?
It is a dump from the production system, and the production system gets
the data once a day from webserver logs line by line.

> > m_id | 0 | 2 | 1 | 1
> ^^^
> Only one distinct value in m_id? This explains why your m_id index is
> never used.
Yes, because i copied only a part of the data, it's still a lot for
my developer system ;-)
I will playing around on the production system next week, may be i give
combined indicees (m_id, visit) with enable_seqscan=on/off a try.

> > attname | most_common_vals
> >---------+--------------------------
> > visit | "2002-08-21 10:29:10+02", ...
> > m_id | 35
>
> > attname | histogram_bounds
> >---------+--------------------------
> > visit | "2002-07-25 16:37:12+02"
> > "2002-08-15 12:36:18+02"
> > "2002-08-23 12:36:15+02"
> > "2002-08-29 17:30:54+02"
> > "2002-09-05 12:54:31+02"
> > "2002-09-10 18:03:54+02"
> > "2002-09-16 15:44:56+02"
> > "2002-09-20 14:34:40+02"
> > "2002-09-24 13:59:29+02"
> > "2002-09-29 09:09:31+02"
> > "2002-10-29 23:25:13+01"
> > m_id |
> >
>
> >??? Is this output ok?
>
> Almost. I forgot to ask for most_common_freqs (cut'n'paste error).
Here is the result:
tb=# SELECT attname, null_frac, avg_width, n_distinct, most_common_vals, most_common_freqs, histogram_bounds, correlation FROM pg_stats WHERE tablename = 'stat_pages' AND attname IN ('m_id', 'visit');
attname | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation
---------+-----------+-----------+------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------
visit | 0 | 8 | -0.465972 | {"2002-08-21 10:31:18+02","2002-08-28 15:28:04+02","2002-09-02 08:50:08+02","2002-09-02 13:48:49+02","2002-09-04 13:00:03+02","2002-09-06 18:55:19+02","2002-09-12 15:24:14+02","2002-09-13 13:12:39+02","2002-09-18 12:55:07+02","2002-09-18 15:01:52+02"} | {0.000666667,0.000666667,0.000666667,0.000666667,0.000666667,0.000666667,0.000666667,0.000666667,0.000666667,0.000666667} | {"2002-07-25 07:04:05+02","2002-08-15 07:52:47+02","2002-08-22 11:00:35+02","2002-08-29 11:59:47+02","2002-09-05 13:56:08+02","2002-09-11 08:08:52+02","2002-09-16 10:48:37+02","2002-09-20 11:50:46+02","2002-09-23 22:25:32+02","2002-09-27 13:01:03+02","2002-10-29 23:31:18+01"} | -0.964541
m_id | 0 | 2 | 1 | {35} | {1} | | 1
(2 rows)

> But I don't expect any value of visit to occur much more than twice,
> so these values should be irrelevant to our estimations.
The probability isn't high, but it could happen to see the same value multiple
times. These are web server log data in a one second grid.

Regards,
-tb
--
Thomas Beutin tb(at)laokoon(dot)IN-Berlin(dot)DE
Beam me up, Scotty. There is no intelligent live down in Redmond.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Jan Poslusny 2002-12-13 15:46:35 Re: Copy/foreign key contraints
Previous Message Ken Godee 2002-12-13 15:14:11 Copy/foreign key contraints