Optimising "full outer join where" for muti-row to multi-column view

From: "Phil Endecott" <spam_from_postgresql_general(at)chezphil(dot)org>
To: <pgsql-general(at)postgresql(dot)org>
Subject: Optimising "full outer join where" for muti-row to multi-column view
Date: 2006-12-28 20:29:56
Message-ID: 1167337796666@dmwebmail.belize.chezphil.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Dear PostgreSQL experts,

I have a database that records the EXIF data for a collection of
photos. (EXIF is a method for embedding arbitary name-value data
in a JPEG, and digital cameras typically use it to record things
like exposure information.) My exif table looks something like
this:

photo_id | tag | value
----------+--------------------------+--------------------------
1001 | DateTime | 2006:10:26 11:19:29
1001 | Orientation | top - left
1001 | PixelXDimension | 3888
1001 | PixelYDimension | 2592
1002 | DateTimeOriginal | 2006:10:26 13:34:06
1002 | Orientation | left - bottom
1002 | PixelXDimension | 3888
1002 | PixelYDimension | 2592

photo_id and tag together form the primary key. The data also
includes many tags that I'm not currently interested in. From
this I create a view containing only the tags of interest:

photo_id | orientation | datetime | xsize | ysize
----------+---------------+---------------------+-------+-------
1001 | top - left | 2006:10:30 11:19:29 | 3888 | 2592
1002 | left - bottom | 2006:10:26 13:34:06 | 3888 | 2592

My first attempt did this in the obvious way by joining the
exif table with itself once per column:

create view photo_info_v as
select photo_id,
e1.value as orientation,
e2.value as datetime,
e3.value as xsize,
e4.value as ysize
from exif e1
join exif e2 using(photo_id)
join exif e3 using(photo_id)
join exif e4 using(photo_id)
where e1.tag='Orientation'
and e2.tag='DateTime'
and e3.tag='PixelXDimension'
and e4.tag='PixelYDimension';

This works well for one important query, where I find one photo's
information from the view:

photos=> explain analyse select * from photo_info_v where photo_id=1201;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.00..12.09 rows=1 width=60) (actual time=1.198..1.290 rows=1 loops=1)
-> Nested Loop (cost=0.00..9.07 rows=1 width=46) (actual time=0.953..1.009 rows=1 loops=1)
-> Nested Loop (cost=0.00..6.04 rows=1 width=32) (actual time=0.693..0.731 rows=1 loops=1)
-> Index Scan using exif_pkey on exif e1 (cost=0.00..3.02 rows=1 width=18) (actual time=0.384..0.394 rows=1 loops=1)
Index Cond: ((photo_id = 1201) AND (tag = 'Orientation'::text))
-> Index Scan using exif_pkey on exif e4 (cost=0.00..3.02 rows=1 width=18) (actual time=0.189..0.205 rows=1 loops=1)
Index Cond: ((1201 = photo_id) AND (tag = 'PixelYDimension'::text))
-> Index Scan using exif_pkey on exif e3 (cost=0.00..3.02 rows=1 width=18) (actual time=0.186..0.194 rows=1 loops=1)
Index Cond: ((1201 = photo_id) AND (tag = 'PixelXDimension'::text))
-> Index Scan using exif_pkey on exif e2 (cost=0.00..3.02 rows=1 width=18) (actual time=0.171..0.195 rows=1 loops=1)
Index Cond: ((1201 = photo_id) AND (tag = 'DateTime'::text))
Total runtime: 3.064 ms

However, I might just want one column from the view:

photos=> explain analyse select orientation from photo_info_v where photo_id=1201;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.00..12.09 rows=1 width=14) (actual time=1.266..1.353 rows=1 loops=1)
-> Nested Loop (cost=0.00..9.07 rows=1 width=18) (actual time=1.009..1.062 rows=1 loops=1)
-> Nested Loop (cost=0.00..6.04 rows=1 width=18) (actual time=0.752..0.787 rows=1 loops=1)
-> Index Scan using exif_pkey on exif e1 (cost=0.00..3.02 rows=1 width=18) (actual time=0.410..0.418 rows=1 loops=1)
Index Cond: ((photo_id = 1201) AND (tag = 'Orientation'::text))
-> Index Scan using exif_pkey on exif e4 (cost=0.00..3.02 rows=1 width=4) (actual time=0.183..0.199 rows=1 loops=1)
Index Cond: ((1201 = photo_id) AND (tag = 'PixelYDimension'::text))
-> Index Scan using exif_pkey on exif e3 (cost=0.00..3.02 rows=1 width=4) (actual time=0.168..0.176 rows=1 loops=1)
Index Cond: ((1201 = photo_id) AND (tag = 'PixelXDimension'::text))
-> Index Scan using exif_pkey on exif e2 (cost=0.00..3.02 rows=1 width=4) (actual time=0.168..0.191 rows=1 loops=1)
Index Cond: ((1201 = photo_id) AND (tag = 'DateTime'::text))
Total runtime: 3.123 ms

I only wanted the orientation information, which could be found
using this faster query:

photos=> explain analyse select value from exif where photo_id=1201 and tag='Orientation';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
Index Scan using exif_pkey on exif (cost=0.00..3.02 rows=1 width=14) (actual time=0.504..0.536 rows=1 loops=1)
Index Cond: ((photo_id = 1201) AND (tag = 'Orientation'::text))
Total runtime: 1.298 ms

Instead it builds up all four columns of the view and discards three
of them. It has to do this, because if data for any one of the four
columns did not exist then the whole row would not exist in the view.
I know that all of the data is always present, but PostgreSQL doesn't
know that, and I don't think there is any way that I can tell it that
it does.

To try and avoid this, I tried using a full outer join in the view
definition. In this case the row would always be present in the view,
even if the data for the other columns were not present. I hoped that
the query would then be optimised to look up only the orientation
information:

create view photo_info_v as
select photo_id,
e1.value as orientation,
e2.value as datetime,
e3.value as xsize,
e4.value as ysize
from exif e1
full outer join exif e2 using(photo_id)
full outer join exif e3 using(photo_id)
full outer join exif e4 using(photo_id)
where e1.tag='Orientation'
and e2.tag='DateTime'
and e3.tag='PixelXDimension'
and e4.tag='PixelYDimension';

photos=> explain analyse select orientation from photo_info_v where photo_id=1201;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------
Merge Join (cost=2431.73..3198.49 rows=7 width=14) (actual time=2256.194..2345.563 rows=1 loops=1)
Merge Cond: ("outer".photo_id = "inner"."?column5?")
Join Filter: (COALESCE("inner"."?column5?", "outer".photo_id) = 1201)
-> Index Scan using exif_pkey on exif e4 (cost=0.00..740.62 rows=1229 width=4) (actual time=1.937..462.014 rows=1229 loops=1)
Index Cond: (tag = 'PixelYDimension'::text)
-> Sort (cost=2431.73..2434.99 rows=1302 width=26) (actual time=1766.523..1802.629 rows=1229 loops=1)
Sort Key: COALESCE(COALESCE(e1.photo_id, e2.photo_id), e3.photo_id)
-> Merge Join (cost=1601.15..2364.38 rows=1302 width=26) (actual time=931.551..1604.342 rows=1229 loops=1)
Merge Cond: ("outer".photo_id = "inner"."?column4?")
-> Index Scan using exif_pkey on exif e3 (cost=0.00..740.62 rows=1229 width=4) (actual time=0.240..519.953 rows=1229 loops=1)
Index Cond: (tag = 'PixelXDimension'::text)
-> Sort (cost=1601.15..1604.37 rows=1286 width=22) (actual time=931.092..971.617 rows=1229 loops=1)
Sort Key: COALESCE(e1.photo_id, e2.photo_id)
-> Merge Join (cost=0.00..1534.74 rows=1286 width=22) (actual time=0.467..797.332 rows=1229 loops=1)
Merge Cond: ("outer".photo_id = "inner".photo_id)
-> Index Scan using exif_pkey on exif e1 (cost=0.00..740.62 rows=1229 width=18) (actual time=0.193..271.369 rows=1229 loops=1)
Index Cond: (tag = 'Orientation'::text)
-> Index Scan using exif_pkey on exif e2 (cost=0.00..740.62 rows=1229 width=4) (actual time=0.118..395.892 rows=1229 loops=1)
Index Cond: (tag = 'DateTime'::text)
Total runtime: 2350.601 ms

Ooops! It has got worse not better. The runtime has increased by
three orders of magnitude, because it seems to compute most of the
view before filtering. It boils down to this query:

photos=> select e1.value as orientation
photos-> from exif e1
photos-> full outer join exif e2 using(photo_id)
photos-> full outer join exif e3 using(photo_id)
photos-> full outer join exif e4 using(photo_id)
photos-> where e1.tag='Orientation'
photos-> and e2.tag='DateTime'
photos-> and e3.tag='PixelXDimension'
photos-> and e4.tag='PixelYDimension'
photos-> and photo_id=1201;

I would like it to optimise away the unused joins, leaving:

photos=> explain analyse select e1.value as orientation
photos-> from exif e1
photos-> where e1.tag='Orientation'
photos-> and photo_id=1201;

Is there a good reason why it cannot do this? Remeber that photo_id and
tag are the primary key for exif, so the view cannot have more than one
row per photo_id.

This is not the first problem I have had with non-trivial views, where
PostgreSQL has failed to simplify a query on the view as I had hoped it
would, either because the semantics of SQL mean that it is unable to
(with no way of describing the additional constraints that apply to
that data and could make the simplification possible), or because the
query optimiser doesn't detect a possible optimisation. Back in 7.4
days I resorted to materialised views updated using triggers, which
involved enormously more work than a normal view. Has anything been
done, or is anything planned, to make this easier?

I imagine that the basic problem here, projecting data from multiple
rows into a single row with multiple columns, is a fairly common one.
Are there any tricks to doing this that anyone can share?

I'm using PostgreSQL 8.1.

Many thanks,

Phil.

(You are welcome to CC: me in any replies.)

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Martijn van Oosterhout 2006-12-28 22:03:29 Re: Optimising "full outer join where" for muti-row to multi-column view
Previous Message Dave Page 2006-12-28 20:06:25 Re: Backup Restore