From: | "Etsuro Fujita" <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp> |
---|---|
To: | "'Robert Haas'" <robertmhaas(at)gmail(dot)com>, "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | "'PostgreSQL-development'" <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: WIP Patch: Use sortedness of CSV foreign tables for query planning |
Date: | 2012-08-07 06:02:22 |
Message-ID: | 002501cd7462$35f41ca0$a1dc55e0$@lab.ntt.co.jp |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
> From: Robert Haas [mailto:robertmhaas(at)gmail(dot)com]
> On Mon, Aug 6, 2012 at 10:33 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> >> On Sun, Aug 5, 2012 at 10:41 PM, Etsuro Fujita
> >> <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> >>> I think file_fdw is useful for managing log files such as PG CSV logs.
Since
> >>> often, such files are sorted by timestamp, I think the patch can improve
> the
> >>> performance of log analysis, though I have to admit my demonstration was
> not
> >>> realistic.
> >
> >> Hmm, I guess I could buy that as a plausible use case.
> >
> > In the particular case of PG log files, I'd bet good money against them
> > being *exactly* sorted by timestamp. Clock skew between backends, or
> > varying amounts of time to construct and send messages, will result in
> > small inconsistencies. This would generally not matter, until the
> > planner relied on the claim of sortedness for something like a mergejoin
> > ... and then it would matter a lot.
>
> Hmm, true.
>
> > In general I'm quite suspicious of the idea of believing that externally
> > supplied data is sorted in exactly the way that PG thinks it should
> > sort. If we implement this you can bet that people will screw up, for
> > instance by using the wrong locale/collation to sort text data.
>
> I think that optimizations like this are going to be essential for
> things like pgsql_fdw (or other_rdms_fdw). Despite the thorny
> semantic issues, we're just not going to be able to get around it.
> There will even be people who want SELECT * FROM ft ORDER BY 1 to
> order by the remote side's notion of ordering rather than ours,
> despite the fact that the remote side has some insane-by-PG-standards
> definition of ordering. People are going to find ways to do that kind
> of thing whether we condone it or not, so we might as well start
> thinking now about how we're going to live with it. But that doesn't
> answer the question of whether or not we ought to support it for
> file_fdw in particular, which seems like a more arguable point.
For file_fdw, I feel inclined to simply implement file_fdw (1) to verify the key
column is sorted in the specified way at the execution phase ie, at the (first)
scan of a data file, only when pathkeys are set, and (2) to abort the
transaction if it detects the data file is not sorted.
Thanks,
Best regards,
Etsuro Fujita
From | Date | Subject | |
---|---|---|---|
Next Message | Craig Ringer | 2012-08-07 07:59:42 | Re: [PATCH] Docs: Make notes on sequences and rollback more obvious |
Previous Message | Alexander Korotkov | 2012-08-07 04:25:45 | Re: Statistics and selectivity estimation for ranges |