Re: directory tree query with big planner variation

From: Michael Stone <mstone+postgres(at)mathom(dot)us>
To: Axel Rau <Axel(dot)Rau(at)Chaos1(dot)DE>
Cc: Michael Stone <mstone+postgres(at)mathom(dot)us>, pgsql-performance(at)postgresql(dot)org
Subject: Re: directory tree query with big planner variation
Date: 2006-07-31 13:30:37
Message-ID: 20060731133034.GF2900@mathom.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Mon, Jul 31, 2006 at 01:54:24PM +0200, Axel Rau wrote:
>Am 31.07.2006 um 13:15 schrieb Michael Stone:
>>On Mon, Jul 31, 2006 at 12:48:11PM +0200, Axel Rau wrote:
>>> WHERE P.path ~ '^%@/[^/]*/$' ) AS NLPC
>>
>>This can't be indexed. You might try something like WHERE P.path
>>LIKE '%(at)%' AND P.path ~ '^%@/[^/]*/$'

Ignore that, I wasn't awake yet.

>Why does it quite well in this case:
>---------------------------------------
>-> Index Scan using path_name_idx on path p (cost=0.00..3.02 rows=1
>width=97) (actual time=15.480..56.935 rows=27 loops=1)
> Index Cond: ((path >= '/Users/axel/Library/
>Preferences/'::text) AND (path < '/Users/axel/Library/
>Preferences0'::text))
> Filter: ((path ~ '^/Users/axel/Library/Preferences/[^/]*/
>$'::text) AND (rtrim("replace"(path, '/Users/axel/Library/
>Preferences/'::text, ''::text), '/'::text) <> ''::text))
>---------------------------------------
>as compared to this case(ignoring the index on path):
>---------------------------------------
>-> Index Scan using path_pkey on path p (cost=0.00..2567.57
>rows=1941 width=97) (actual time=527.805..1521.911 rows=69 loops=1)
> Filter: ((path ~ '^/Users/axel/[^/]*/$'::text) AND (rtrim
>("replace"(path, '/Users/axel/'::text, ''::text), '/'::text) <>
>''::text))
>---------------------------------------
>? With all longer path names, I get the above (good)result.
>Should I put the rtrim/replace on the client side?

That's not the slow part. The slow part is retrieving every single file
for each of the subdirectories in order to determine whether there are
any files in the subdirectories.

>>The schema could be a lot more intelligent here. (E.g., store path
>>seperately from file/directory name, store type (file or directory)
>>seperately, etc.) Without improving the schema I don't think this
>>will ever be a speed demon.

>PATH holds complete pathnames of directories, FILENAME holds
>filenames and pathname components.
>Currently the schema is the lowest common denominator between SQLite,
>MySQL and pg and the bacula people will stay with that (-;).

Nothing I suggested raises the bar for the "lowest common denominator".
If I understand the intend of this SQL, you're pulling all the entries
in a directory in two parts. The first part (files) is fairly
straightforward. The second part (directories) consists of pulling any
file whose parent is a subdirectory of the directory you're looking for
(this is *all* children of the directory, since you have to retrieve
every element that begins with the directory, then discard those that
have an additional / in their name), counting how many of these there
are for each subdirectory, and discarding those results except for a
binary (yes there are children or no there aren't). This is a lot of
useless work to go through, and is going to be slow if you've got a lot
of stuff in a subdirectory. An alternative approach would be, for each
directory, to store all its children (files and subdirectories) along
with a flag indicating which it is. This would allow you to create the
collapsed tree view without walking all the children of a subdirectory.

Assuming you can't make changes to the schema, what about the query?
You've got this:

explain analyze SELECT X.name AS name, COUNT(CH) > 1 AS children
FROM
( SELECT RTRIM( REPLACE( NLPC.path, '%@/', ''),'/') AS name,
FN.name AS CH
FROM
( SELECT P.path,P.pathid
FROM bacula.path P
WHERE P.path ~ '^%@/[^/]*/$' ) AS NLPC
LEFT OUTER JOIN bacula.file F
ON
NLPC.pathid = F.pathid
LEFT OUTER JOIN bacula.filename FN
ON
F.filenameid = FN.filenameid
GROUP BY NLPC.path, FN.name
UNION
SELECT FN.name AS name, FN.name AS CH
FROM
bacula.path P, bacula.file F, bacula.filename FN
WHERE
P.path = '%@/' AND
P.pathid = F.pathid AND
F.filenameid = FN.filenameid
) AS X
WHERE X.name <> ''
GROUP BY X.name

I'm only looking at the first part, which reduces to
SELECT X.name AS name, COUNT(CH) > 1 AS children FROM
SELECT NLPC.path AS name, FN.name as CH
FROM ( SELECT P.path,P.pathid FROM bacula.path AS NLPC
LEFT OUTER JOIN bacula.file F ON NLPC.pathid=F.pathid
LEFT OUTER JOIN bacula.filename FN ON F.filenameid=FN.filenameid
GROUP BY NLPC.path,FN.name

Why is the filename table even accessed? Would the results be the
same if you did
SELECT NLPC.path AS name, F.fileid AS CH
and drop the LEFT OUTER JOIN bacula.filename altogether?

And then what happens if you try something like
SELECT X.name,X.children
FROM
(SELECT [rtrim]P.path,(SELECT count(*) FROM bacula.file F
WHERE F.pathid = P.pathid
LIMIT 2) > 1
FROM bacula.path P
WHERE P.path ~ '^%@/[^/]*/$'
UNION
SELECT FN.name,0
FROM bacula.path P, bacula.file F, bacula.filename FN
WHERE
P.path = '%@/' AND
P.pathid = F.pathid AND
F.filenameid = FN.filenameid
) AS X
WHERE X.name <> ''
GROUP BY X.name

It's hard to say without knowing what's actually *in* the tables, but
the existing query definately doesn't scale well for what I think it's
trying to do.

Mike Stone

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Mark Lewis 2006-07-31 13:53:21 Re: directory tree query with big planner variation
Previous Message Tom Lane 2006-07-31 12:53:34 Re: sub select performance due to seq scans