Re: directory tree query with big planner variation

From: Mark Lewis <mark(dot)lewis(at)mir3(dot)com>
To: Michael Stone <mstone+postgres(at)mathom(dot)us>
Cc: Axel Rau <Axel(dot)Rau(at)Chaos1(dot)DE>, pgsql-performance(at)postgresql(dot)org
Subject: Re: directory tree query with big planner variation
Date: 2006-07-31 13:53:21
Message-ID: 1154354001.1634.792.camel@archimedes
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

It seems like you might be able to avoid the expensive directory lookups
entirely without changing the schema by defining an immutable function
dir_depth(path), which would just count the number of forward slashes.
Then create a functional index on dir_depth(path) and in the query do a
check for directories with a given prefix and the expected dir_depth.

In 8.1 and later, this kind of query might be able to use a bitmap index
combining thingamajigger (the actual name escapes me right now).

This is just a hunch and I haven't looked too carefully at the
schema/query requirements to see if its feasible, but seems like a
reasonable approach.

-- Mark Lewis

On Mon, 2006-07-31 at 09:30 -0400, Michael Stone wrote:
> On Mon, Jul 31, 2006 at 01:54:24PM +0200, Axel Rau wrote:
> >Am 31.07.2006 um 13:15 schrieb Michael Stone:
> >>On Mon, Jul 31, 2006 at 12:48:11PM +0200, Axel Rau wrote:
> >>> WHERE P.path ~ '^%@/[^/]*/$' ) AS NLPC
> >>
> >>This can't be indexed. You might try something like WHERE P.path
> >>LIKE '%(at)%' AND P.path ~ '^%@/[^/]*/$'
>
> Ignore that, I wasn't awake yet.
>
> >Why does it quite well in this case:
> >---------------------------------------
> >-> Index Scan using path_name_idx on path p (cost=0.00..3.02 rows=1
> >width=97) (actual time=15.480..56.935 rows=27 loops=1)
> > Index Cond: ((path >= '/Users/axel/Library/
> >Preferences/'::text) AND (path < '/Users/axel/Library/
> >Preferences0'::text))
> > Filter: ((path ~ '^/Users/axel/Library/Preferences/[^/]*/
> >$'::text) AND (rtrim("replace"(path, '/Users/axel/Library/
> >Preferences/'::text, ''::text), '/'::text) <> ''::text))
> >---------------------------------------
> >as compared to this case(ignoring the index on path):
> >---------------------------------------
> >-> Index Scan using path_pkey on path p (cost=0.00..2567.57
> >rows=1941 width=97) (actual time=527.805..1521.911 rows=69 loops=1)
> > Filter: ((path ~ '^/Users/axel/[^/]*/$'::text) AND (rtrim
> >("replace"(path, '/Users/axel/'::text, ''::text), '/'::text) <>
> >''::text))
> >---------------------------------------
> >? With all longer path names, I get the above (good)result.
> >Should I put the rtrim/replace on the client side?
>
> That's not the slow part. The slow part is retrieving every single file
> for each of the subdirectories in order to determine whether there are
> any files in the subdirectories.
>
> >>The schema could be a lot more intelligent here. (E.g., store path
> >>seperately from file/directory name, store type (file or directory)
> >>seperately, etc.) Without improving the schema I don't think this
> >>will ever be a speed demon.
>
> >PATH holds complete pathnames of directories, FILENAME holds
> >filenames and pathname components.
> >Currently the schema is the lowest common denominator between SQLite,
> >MySQL and pg and the bacula people will stay with that (-;).
>
> Nothing I suggested raises the bar for the "lowest common denominator".
> If I understand the intend of this SQL, you're pulling all the entries
> in a directory in two parts. The first part (files) is fairly
> straightforward. The second part (directories) consists of pulling any
> file whose parent is a subdirectory of the directory you're looking for
> (this is *all* children of the directory, since you have to retrieve
> every element that begins with the directory, then discard those that
> have an additional / in their name), counting how many of these there
> are for each subdirectory, and discarding those results except for a
> binary (yes there are children or no there aren't). This is a lot of
> useless work to go through, and is going to be slow if you've got a lot
> of stuff in a subdirectory. An alternative approach would be, for each
> directory, to store all its children (files and subdirectories) along
> with a flag indicating which it is. This would allow you to create the
> collapsed tree view without walking all the children of a subdirectory.
>
> Assuming you can't make changes to the schema, what about the query?
> You've got this:
>
> explain analyze SELECT X.name AS name, COUNT(CH) > 1 AS children
> FROM
> ( SELECT RTRIM( REPLACE( NLPC.path, '%@/', ''),'/') AS name,
> FN.name AS CH
> FROM
> ( SELECT P.path,P.pathid
> FROM bacula.path P
> WHERE P.path ~ '^%@/[^/]*/$' ) AS NLPC
> LEFT OUTER JOIN bacula.file F
> ON
> NLPC.pathid = F.pathid
> LEFT OUTER JOIN bacula.filename FN
> ON
> F.filenameid = FN.filenameid
> GROUP BY NLPC.path, FN.name
> UNION
> SELECT FN.name AS name, FN.name AS CH
> FROM
> bacula.path P, bacula.file F, bacula.filename FN
> WHERE
> P.path = '%@/' AND
> P.pathid = F.pathid AND
> F.filenameid = FN.filenameid
> ) AS X
> WHERE X.name <> ''
> GROUP BY X.name
>
> I'm only looking at the first part, which reduces to
> SELECT X.name AS name, COUNT(CH) > 1 AS children FROM
> SELECT NLPC.path AS name, FN.name as CH
> FROM ( SELECT P.path,P.pathid FROM bacula.path AS NLPC
> LEFT OUTER JOIN bacula.file F ON NLPC.pathid=F.pathid
> LEFT OUTER JOIN bacula.filename FN ON F.filenameid=FN.filenameid
> GROUP BY NLPC.path,FN.name
>
> Why is the filename table even accessed? Would the results be the
> same if you did
> SELECT NLPC.path AS name, F.fileid AS CH
> and drop the LEFT OUTER JOIN bacula.filename altogether?
>
> And then what happens if you try something like
> SELECT X.name,X.children
> FROM
> (SELECT [rtrim]P.path,(SELECT count(*) FROM bacula.file F
> WHERE F.pathid = P.pathid
> LIMIT 2) > 1
> FROM bacula.path P
> WHERE P.path ~ '^%@/[^/]*/$'
> UNION
> SELECT FN.name,0
> FROM bacula.path P, bacula.file F, bacula.filename FN
> WHERE
> P.path = '%@/' AND
> P.pathid = F.pathid AND
> F.filenameid = FN.filenameid
> ) AS X
> WHERE X.name <> ''
> GROUP BY X.name
>
> It's hard to say without knowing what's actually *in* the tables, but
> the existing query definately doesn't scale well for what I think it's
> trying to do.
>
> Mike Stone
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message H Hale 2006-07-31 14:14:27 Re: sub select performance due to seq scans
Previous Message Michael Stone 2006-07-31 13:30:37 Re: directory tree query with big planner variation