| From: | "Dave Held" <dave(dot)held(at)arrayservicesgrp(dot)com> | 
|---|---|
| To: | <pgsql-hackers(at)postgresql(dot)org> | 
| Subject: | Modifying COPY TO | 
| Date: | 2005-02-25 22:19:09 | 
| Message-ID: | 49E94D0CFCD4DB43AFBA928DDD20C8F902618457@asg002.asg.local | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
I am interested in hacking COPY TO such that one can specify that
rows are copied in a certain index order.  I got as far as 
src/backend/commands/copy.c:CopyTo(), and it looks like I would need
to modify the call to heap_beginscan() so that it uses a key.  However,
I couldn't figure out how to provide one, or if I'm even looking at the
right area.  Ideally, this behavior would be specified with a flag,
perhaps: "WITH INDEX <index_name>" or "WITH PRIMARY KEY"
or something similar.
The motivation for this change is as follows.  I have a fairly large
database (10 million+ records) that mirrors the data in a proprietary
system.  The only access to that data is through exported flat files.
Currently, those flat files are copied directly into a staging area in the 
db via a COPY FROM, the actual tables are truncated, and the
staging data is inserted into the live tables.  Since the data is read-only,
it doesn't matter that it is recreated every day.  However, as you
can imagine, the import process takes quite a while (several hours).
Also, rebuilding the db from scratch every day loses any statistical
information gathered from the execution of queries during the day.
A possibility that I would like to pursue is to keep the staging data
from the previous day, do a COPY TO, import the new data into
another staging table with a COPY FROM, then export the fresh
data with another COPY TO.  Then, I can write a fast C/C++
program to do a line-by-line comparison of each record, isolating
the ones that have changed from the previous day.  I can then
emit those records in a change file that should be relatively small
and easy to update.  Of course, this scheme can only work if
COPY TO emits the records in a reliable order.
Any assistance on this project would be greatly appreciated.  The
best I can see, I'm stuck on line 1053 from copy.c:
scandesc = heap_beginscan(rel, mySnapshot, 0, NULL);
I suspect that I want it to look like this:
scandesc = heap_beginscan(rel, mySnapshot, 1, key);
where 'key' is an appropriately constructed ScanKey.  It looks
like I want to call ScanKeyEntryInitialize(), but I'm not sure what
parameters I need to pass to it to get an index or the primary
key.  I mostly need help building the ScanKey object.  I think I 
can figure out how to hack the custom option, etc.  I should 
mention that I am using the 7.4.7 codebase on Linux 2.4.
__
David B. Held
Software Engineer/Array Services Group
200 14th Ave. East,  Sartell, MN 56377
320.534.3637 320.253.7800 800.752.8129
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Tom Lane | 2005-02-25 22:56:50 | Re: Modifying COPY TO | 
| Previous Message | Bruce Momjian | 2005-02-25 22:09:20 | Re: Development schedule |