Re: Values larger than 1/3 of a buffer page cannot be indexed (hstore)

From: Stefan Keller <sfkeller(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-general <pgsql-general(at)postgresql(dot)org>, oleg(at)sai(dot)msu(dot)su, Ott Michel <m1ott(at)hsr(dot)ch>
Subject: Re: Values larger than 1/3 of a buffer page cannot be indexed (hstore)
Date: 2011-05-01 14:18:05
Message-ID: BANLkTikXVkqQN7iiQ__yo3LEaiC0UBhCHg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi Tom,

Thanks for the hint! I'm actually doing a GROUP BY with an inquality
search (HAVING...) and still get long lasting queries, see query (2)
below.

I'm doing an equality search with success with the '->' operator on
the same field 'tags' like in this query (1):

-- Count all restaurants in database ("amenity = restaurant''):
select count(*) from osm_all_v
where hstore(tags)->'amenity'='restaurant'

This query 1 is reasonably fast and could be accelerated using this
functional index:
CREATE INDEX planet_osm_point_tags_restaurant
ON planet_osm_point
USING btree (tags)
WHERE (tags -> 'amenity'::text) = 'restaurant'::text;

But look at query 2:

This GROUP BY query gives following results (after a VACUUM FULL ANALYZE):
* 26265ms without any index on field 'tags'
* 26000ms with GIN index on field 'tags'
* 26078ms with HASH index on field 'tags'

To sum up:
* GIN and HASH indexes don't make a difference.
* BTree index fails (as reported) on field 'tags' with ERROR 'index
row size 3120 exceeds maximum 2712' SQL state: 54000).
* GIST index fails too on field 'tags' but with ERROR 'invalid hstore
value found', SQL state: XX000.

Q:
=> The latter indicates an "invalid value" - but does not say what's invalid.
=> I have no clue how to boost query 2...? See below the query and the
explain output.

Yours, Stefan

QUERY 2
=======
-- Return all key-values of semantic type 'enum' without types
numeric, date/time etc.
-- (actually hstore handles all tag/values as of type text):
SELECT MIN(keys.key), hstore(p.tags)->keys.key, count(*)
FROM planet_osm_point p, (
SELECT key, count(*)
FROM (
SELECT (each(tags)).key FROM planet_osm_point
) AS stat
GROUP BY key HAVING count(*) > 1
AND key NOT LIKE 'note:%'
...
AND key NOT IN
('ele','ref','url','website','email','maxspeed','converted_by', ... )
) keys
WHERE hstore(p.tags)->keys.key >''
GROUP BY hstore(p.tags)->keys.key HAVING count(*) > 1
ORDER by 1,3 desc
...
"amenity";"bus_stop";24414
"amenity";"restaurant";5423
"amenity";"bench";5041
"amenity";"parking";4232
"amenity";"fire_hydrant";2363
"amenity";"post_box";1838
"amenity";"fuel";1628
"amenity";"place_of_worship";1615
...

EXPLAIN output:
"Sort (cost=6043108.58..6054613.93 rows=4602142 width=118) (actual
time=331415.987..331417.914 rows=964 loops=1)"
" Sort Key: (min(keys.key)), (count(*))"
" Sort Method: quicksort Memory: 85kB"
" -> GroupAggregate (cost=4173505.61..4401220.42 rows=4602142
width=118) (actual time=330101.581..331409.834 rows=964 loops=1)"
" Filter: (count(*) > 1)"
" -> Sort (cost=4173505.61..4198338.94 rows=9933329
width=118) (actual time=330101.487..330870.237 rows=196227 loops=1)"
" Sort Key: ((p.tags -> keys.key))"
" Sort Method: external merge Disk: 21960kB"
" -> Nested Loop (cost=22534.58..574521.77 rows=9933329
width=118) (actual time=6747.314..327192.811 rows=196227 loops=1)"
" Join Filter: ((p.tags -> keys.key) > ''::text)"
" -> Seq Scan on planet_osm_point p
(cost=0.00..5649.22 rows=182822 width=86) (actual time=0.015..516.191
rows=182822 loops=1)"
" -> Materialize (cost=22534.58..22539.88
rows=163 width=32) (actual time=0.039..0.890 rows=420 loops=182822)"
" -> Subquery Scan on keys
(cost=22534.58..22539.07 rows=163 width=32) (actual
time=6740.620..6743.358 rows=420 loops=1)"
" -> HashAggregate
(cost=22534.58..22537.44 rows=163 width=32) (actual
time=6740.613..6741.651 rows=420 loops=1)"
" Filter: (count(*) > 1)"
" -> Subquery Scan on stat
(cost=0.00..21417.62 rows=148929 width=32) (actual
time=0.093..6230.640 rows=196458 loops=1)"
" Filter: ((stat.key !~~
'name%'::text) AND (stat.key !~~ 'addr:%'::text) AND (stat.key !~~*
'FIXME'::text) AND (stat.key !~~* 'openGeoDB:%'::text) AND (stat.key
!~~ 'note:%'::text) AND (stat.key !~~ '%_ref'::text) AND (stat.key !~~
'%description%'::text) AND (stat.key !~~ 'contact:%'::text) AND
(stat.key !~~ 'operator:%'::text) AND (stat.key !~~ 'uic_%'::text) AND
(stat.key !~~ 'TMC%'::text) AND (stat.key !~~ 'uic_%'::text) AND
(stat.key !~~ 'direction%'::text) AND (stat.key !~~ 'is_in%'::text)
AND (stat.key !~~ 'wikipedia%'::text) AND (stat.key <> ALL
('{ele,ref,url,website,email,maxspeed,converted_by,layer,level,phone,mobility:station_id,information,opening_hours,date,time,collection_times,operator,colour,fee,nat_name,alt_name,population,seats,postal_code,capacity,line,lines,maxheight,reg_name}'::text[])))"
" -> Seq Scan on
planet_osm_point (cost=0.00..6106.28 rows=182822 width=86) (actual
time=0.025..2363.603 rows=575700 loops=1)"
"Total runtime: 331426.018 ms"

2011/5/1 Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>:
> Stefan Keller <sfkeller(at)gmail(dot)com> writes:
>> Any ideas on how to index my hstore attribute?
>
> Use a GIST or GIN index.  The only thing that a btree index on hstore
> can do for you is to support equality comparisons on the whole hstore
> value, which is pretty unlikely to be what you're after.
>
>                        regards, tom lane
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2011-05-01 16:35:09 Re: Postgresql, PSN hack and table limits
Previous Message Jasen Betts 2011-05-01 12:27:43 Re: Short-circuit boolean evaluation