Re: unoptimized nested loops

From: Tim Kelly <gtkelly(at)dialectronics(dot)com>
To: Postgres General <pgsql-general(at)postgresql(dot)org>
Subject: Re: unoptimized nested loops
Date: 2022-06-02 10:52:18
Message-ID: 62989662.1010607@dialectronics.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Thank you to David, Jeff and Tom for your responses. Tom's response has
made me rethink my question. I may have provided too much information,
in the effort to anticipate suggestions. Let me rephrase:

I have two tables, a parent (named "metadata") and a child (named
"data"). Each table has two fields. One field is shared (the "id," an
alphanumeric field). The other field in each table is basically the
constraint (named "author" in metadata) and the target (named "content"
data). Each table has about 1.25m records/rows. There are about 1500
orphaned child records, but all parent records have a child record.

When I do a search, as the "human in charge (HIC)" I know that the
constraint from the parent table will yield a very small amount of child
table records in which the target needs to be found. A typical search
yields a few hundred to a few thousand parent records, which should take
milliseconds to search for the target. A search of all of the child
records for the target that is then compared against the constrained
parent records to produce a pure intersection is very inefficient.

I found discussions from ten years or more ago about where the order of
query arguments would affect the search optimization. Depending on the
database, the HIC would place a known driving constraint either last or
first in the arguments. This practice has been reasonably abandoned as
the default practice, but I have been unable to find how to override the
current practice of letting the planner do the work, and the planner is
getting it wrong. Something that should only take one or two seconds is
now taking three to four minutes.

So, using an even more egregious scenario I found:

c_db=>select count(id) from metadata where author like '%Tim%';

count
-------
261
(1 row)

Time: 650.753 ms

c_db=>select count(id) from data where data.content like '%some text%'
and data.id in (select id from metadata where metadata.author like '%Tim%');

count
-------
31
(1 row)

Time: 207354.109 ms

Just as a reminder, this is 30 seconds longer than it takes to search
the 1.25m records for the target:

c_db=> select count(id) from data where content like '%some text%';
count
---------
1167810
(1 row)

Time: 180144.251 ms

To address David's suggestion of turning off enable_nestloop, this
resulted in about a 10% improvement in speed. I found no appreciable
difference in time by setting statistics, although there was some
ambiguity in the suggestion. I assume the statistics are supposed to be
set on the content column,

ALTER TABLE data ALTER COLUMN content SET STATISTICS 1000;

To address Jeff's inquiry about planning on the smaller set:

c_db=> explain analyze select count(id) from metadata where author like
'%Tim%';
QUERY PLAN

------------------------------------------------------------------------------------------------------------------
Aggregate (cost=114040.65..114040.66 rows=1 width=11) (actual
time=681.639..681.639 rows=1 loops=1)
-> Seq Scan on metadata (cost=0.00..114040.64 rows=5 width=11)
(actual time=3.053..681.591 rows=261 loops=1)
Filter: ((author)::text ~~ '%Tim%'::text)
Total runtime: 681.685 ms
(4 rows)

Time: 682.142 ms

For completeness

c_db=> explain analyze select count(id) from metadata inner join data on
metadata.id = data.id where author like '%Tim%' and content like '%some
text%';
QUERY
PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=48203.00..48203.01 rows=1 width=11) (actual
time=208239.776..208239.777 rows=1 loops=1)
-> Nested Loop (cost=0.00..48202.99 rows=5 width=11) (actual
time=34102.795..208239.754 rows=31 loops=1)
-> Seq Scan on data (cost=0.00..47132.93 rows=125 width=11)
(actual time=4.714..179369.126 rows=1167810 loops=1)
Filter: (content ~~ '%some text%'::text)
-> Index Scan using metadata_pkey on metadata
(cost=0.00..8.55 rows=1 width=11) (actual time=0.024..0.024 rows=0
loops=1167810)
Index Cond: ((metadata.id)::text = (data.id)::text)
Filter: ((metadata.author)::text ~~ '%Tim%'::text)
Total runtime: 208239.847 ms
(8 rows)

Time: 208247.698 ms

As for the version

$ psql --version
psql (PostgreSQL) 8.4.1
contains support for command-line editing

Let's describe this system as "legacy" and updating is not an option.
If the planner from this version was not optimized compared to more
recent versions, the need for an override is even greater. However, I
am very reluctant to believe the version is at the heart of the problem.
I believe I am missing something and perhaps failing to properly
explain my need.

How do I override the planner and instruct the computer to do what I
say, regardless of the outcome?

Thank you,
tim

(I see I've probably provided too much information again.)

Tom Lane wrote:
> As best I can tell, the issue Tim's unhappy about is not so
> much the use of a nestloop as the lack of use of any index.
> But "string like '%foo%'" is not at all optimizable with a
> btree index. You might be able to get somewhere with a
> pg_trgm GIN or GIST index.
>
> regards, tom lane

In response to

Browse pgsql-general by date

  From Date Subject
Next Message operations i 2022-06-02 11:00:40 Re: How is this possible "publication does not exist"
Previous Message Amit Kapila 2022-06-02 09:51:17 Re: How is this possible "publication does not exist"