Re: Problems pushing down WHERE-clause to underlying view

From: Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>
To: Nicklas Avén <nicklas(dot)aven(at)jordogskog(dot)no>, pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: Problems pushing down WHERE-clause to underlying view
Date: 2019-02-15 16:06:39
Message-ID: 9cd9e5c1-1909-6f36-1473-f3d2a773762e@aklaver.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 2/15/19 7:28 AM, Nicklas Avén wrote:
> Hi
>
>
> We have a system with 2 layers of views. It is about forestry.
>
> The first layer contains the logic like grouping volumes in logs
> together to stems or harvesting areas and joining species names to codes
> and things like that.
>
> The second layer just joins this underlying views to a table with user
> ids and machine ids. So, when used by the application there is a where
> clause containing the user id which gives access to the correct data
>
> in the underlying view based on machine id.
>
>
> The underlying view in this case can return approx 22000 rows, grouped
> from approx 8 million logs by harvest date, harvest object and so on.
>
>
> The problem is that it always calculates all those 22000 rows even if
> the user id I use only gives 250 rows.
>
> So, the query uses 4 seconds instead of under 100 ms.

https://www.postgresql.org/docs/10/sql-createview.html

"CREATE VIEW defines a view of a query. The view is not physically
materialized. Instead, the query is run every time the view is
referenced in a query."

Might want to look at materialized view:
https://www.postgresql.org/docs/10/sql-creatematerializedview.html

"CREATE MATERIALIZED VIEW is similar to CREATE TABLE AS, except that it
also remembers the query used to initialize the view, so that it can be
refreshed later upon demand. A materialized view has many of the same
properties as a table, but there is no support for temporary
materialized views or automatic generation of OIDs."

I would also suggest running the EXPLAIN below with ANALYZE so actual
timings are returned. Also try:

SELECT
*
FROM
underlying_view AS b
JOIN
contractor_access AS b
ON
a.machine_key = b.machine_key
WHERE
user_id = 'name(at)email(dot)address'
>
>
> I have tried this on 2 servers and my laptop, PostgreSQL 9.6 and 10 and
> get the same issue.
>
>
> I have tried to pick the query apart to understand what is happening.
>
>
> First, the underlying view looks like this except I removed some fields
> that doesn't affect the case to save some space:
>
> CREATE OR REPLACE VIEW underlying_view AS
>  SELECT
>     l.machine_key,
>     o.object_name,
>     o.sub_object_name,
>     s.species_group_name,
>     p.product_group_name ,
>     l.m3_sub AS volume_m3sub,
>     l.number_of_logs,
>     mi.basemachine_manufacturer,
>     mi.basemachine_model
>    FROM
>     (
>        SELECT
>         hl.contractor_id,
>         hl.machine_key,
>         hl.operator_key,
>         hl.object_key,
>         hl.sub_object_key,
>         date(hl.harvest_date) AS harvest_date, --this is timestamptz
> since we use the time in other places
>         hl.species_group_key,
>         hl.product_key,
>         sum(hl.m3_sub) AS m3_sub,
>         count(*) AS number_of_logs
>     FROM version_union_tables_r02.harvester_logs hl
>     GROUP BY hl.machine_key, hl.contractor_id, hl.operator_key,
> hl.object_key, hl.sub_object_key, (date(hl.harvest_date)),
> hl.species_group_key, hl.product_key
>     ) l
>      LEFT JOIN version_union_tables_r02.machine_info mi ON
> l.machine_key::text = mi.machine_key::text
>      LEFT JOIN version_union_tables_r02.objects o ON
> l.machine_key::text = o.machine_key::text AND l.object_key =
> o.object_key AND l.sub_object_key = o.sub_object_key
>      LEFT JOIN version_union_tables_r02.products p ON
> l.machine_key::text = p.machine_key::text AND l.product_key = p.product_key
>      LEFT JOIN version_union_tables_r02.species s ON
> l.machine_key::text = s.machine_key::text AND l.species_group_key =
> s.species_group_key;
>
>
> In the next level is a view that looks like this:
>
>
> CREATE OR REPLACE VIEW top_level_view AS
>  SELECT *
>    FROM underlying_view a
>      JOIN contractor_access b ON a.machine_key = b.machine_key WHERE
> b.active <> 0;
>
>
> If I query this top_level_view like:
>
> SELECT * FROM top_level_view WHERE user_id = 'name(at)email(dot)address';
>
> I haven't succeeded to avoid the underlying view to compute the full
> dataset.
>
>
> The user_id 'name(at)email(dot)address' returns 1 machine_key from
> contractor_access table that has any hits in the underlying view (4 in
> total but 3 machines are not represented in the underlying view)
>
> We call the machine that we get from contractor_id table 'machine1'
>
> Here is what I have tried to find out when the plan changes:
>
>
> 1) SELECT * FROM underlying_view WHERE machine_key = 'machine1'; --
> returns 250 rows in approx 100 ms, so, works as expected
>
>
> 2) select * from (select 'machine1' machine_key) a, underlying_view b
> where a.machine_key = b.machine_key; --same as above, works as expected
>
>
> 3) select * from (select * from contractor_access where user_id =
> 'name(at)email(dot)address') a,
> underlying_view b where a.machine_key = b.machine_key;      -- Here I am
> hit. this returns the same 250 rows, but in over 4 seconds
>
>
> /*So I thought I should try to force down the machine_key to the
> underlying view with lateral like this*/
>
>
> 4) select * from (select * from contractor_access where user_id =
> 'name(at)email(dot)address') a,
> lateral (select * from underlying_view where machine_key =
> a.machine_key) b;
>
> But this doesn't work either. It returns the same 250 rows in approx 4
> seconds.
>
>
> My question is, is there some trick to force the planner to push down
> the machine_key.
>
> I cannot understand what is fooling the planner.
>
> The table is analyzed, I have tried on several machines, so I do not
> think it is miss leading statistics.
>
>
> I haven't done any configuration more than tried with
> max_parallel_workers_per_gather to 0 since the workers makes it harder
> to understand what is happening.
>
>
> Here is the quer plan on query number 3 above:
>
> EXPLAIN select * from (select * from contractor_access where user_id =
> 'name(at)email(dot)address') a,
> underlying_view b where a.machine_key = b.machine_key;
>
> Aggregate  (cost=543839.03..543839.04 rows=1 width=8)
>   ->  Hash Join  (cost=395402.74..543798.72 rows=16123 width=0)
>         Hash Cond: ((hl.machine_key)::text =
> contractor_access.machine_key)
>         ->  Hash Left Join  (cost=395395.10..533563.59 rows=806147
> width=400)
>               Hash Cond: (((hl.machine_key)::text =
> (s.machine_key)::text) AND (hl.species_group_key = s.species_group_key))
>               ->  Hash Left Join  (cost=395380.73..485122.31
> rows=806147 width=32)
>                     Hash Cond: (((hl.machine_key)::text =
> (p.machine_key)::text) AND (hl.product_key = p.product_key))
>                     ->  Hash Left Join (cost=395320.48..444697.18
> rows=806147 width=36)
>                           Hash Cond: (((hl.machine_key)::text =
> (o.machine_key)::text) AND (hl.object_key = o.object_key) AND
> (hl.sub_object_key = o.sub_object_key))
>                           ->  Hash Left Join (cost=395257.01..417426.05
> rows=806147 width=36)
>                                 Hash Cond: ((hl.machine_key)::text =
> (mi.machine_key)::text)
>                                 Join Filter: (((hl.contractor_id)::text
> = (mi.contractor_id)::text) OR ((hl.contractor_id IS NULL) AND
> (mi.contractor_id IS NULL)))
>                                 ->  HashAggregate
> (cost=395254.66..403316.13 rows=806147 width=86)
>                                       Group Key: hl.machine_key,
> hl.contractor_id, hl.operator_key, hl.object_key, hl.sub_object_key,
> hl.harvest_date, hl.species_group_key, hl.product_key
>                                       ->  Seq Scan on harvester_logs
> hl  (cost=0.00..234025.22 rows=8061472 width=54)
>                                 ->  Hash  (cost=1.60..1.60 rows=60
> width=35)
>                                       ->  Seq Scan on machine_info mi
> (cost=0.00..1.60 rows=60 width=35)
>                           ->  Hash  (cost=33.26..33.26 rows=1726 width=23)
>                                 ->  Seq Scan on objects o
> (cost=0.00..33.26 rows=1726 width=23)
>                     ->  Hash  (cost=37.90..37.90 rows=1490 width=29)
>                           ->  Seq Scan on products p (cost=0.00..37.90
> rows=1490 width=29)
>               ->  Hash  (cost=7.55..7.55 rows=455 width=31)
>                     ->  Seq Scan on species s  (cost=0.00..7.55
> rows=455 width=31)
>         ->  Hash  (cost=7.59..7.59 rows=4 width=21)
>               ->  Seq Scan on contractor_access  (cost=0.00..7.59
> rows=4 width=21)
>                     Filter: (t4e_contractor_id =
> 'name(at)email(dot)address'::text)
>
>
>
> Thanks
>
>
> Nicklas Avén
>
>
>
>
>
>
>
>
>
>

--
Adrian Klaver
adrian(dot)klaver(at)aklaver(dot)com

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Adrian Klaver 2019-02-15 16:08:10 Re: How to setup only one connection for the whole event loop?
Previous Message Adrian Klaver 2019-02-15 15:44:41 Re: PostgreSql Version Compatibility With Apache ActiveMQ