Re: Scrub one large table against another

From: Brendan Curran <brendan(dot)curran(at)gmail(dot)com>
To: "Jim C(dot) Nasby" <jim(at)nasby(dot)net>
Cc: g(at)nasby(dot)net, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Performance list <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Scrub one large table against another
Date: 2006-10-12 18:05:04
Message-ID: 452E83D0.5060309@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance


> Well, IN != EXISTS != JOIN. Exists just stops as soon as it finds a
> record. For some cases, it's equivalent to IN, but not all. IN has to
> de-duplicate it's list in some fashion. For small IN lists, you can do
> this with an OR, but at some point you need to switch to an actual
> unique (actually, I suspect the difference in PostgreSQL just depends on
> if you passed values into IN or a subquery). A join on the other hand
> doesn't worry about duplicates at all. There may be some brains in the
> planner that realize if a subquery will return a unique set (ie: you're
> querying on a primary key).
>

I agree, and it makes sense now that I consider it that IN would force the planner to implement some
form of unique check - possibly leveraging a PK or unique index if one is already available. Maybe
I'll tack up a note to the online documentation letting people know so that it's a little more
explicitly clear that when you choose IN on data that isn't explicitly unique (to the planner i.e.
post-analyze) you get the baggage of a forced unique whether you need it or not. Or perhaps someone
that knows the internals of the planner a little better than me should put some info up regarding that?

>
>> Just one more thing... I have found that maintaining a btree index on a
>> varchar(255) value is extremely expensive on insert/update/delete. It is
>> unfortunately necessary for me to maintain this index for queries and
>> reports so I am transitioning to using an unindexed staging table to
>> import data into before merging it with the larger table. All the docs
>> and posts recommend is to drop the index, import your data, and then
>> create the index again. This is untenable on a daily / bi-weekly basis.
>> Is there a more elegant solution to this indexing problem?
>
> You might be happier with tsearch than a regular index.

Thanks, I'll look into using tsearch2 as a possibility. From what I've seen so far it would add
quite a bit of complexity (necessary updates after inserts, proprietary query syntax that might
require a large amount of specialization from client apps) but in the end the overhead may be less
than that of maintaining the btree.

Thanks and Regards,
B

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Merlin Moncure 2006-10-12 18:21:55 Re: [PERFORM] Hints proposal
Previous Message Scott Marlowe 2006-10-12 17:59:42 Re: FW: Simple join optimized badly?