Modifying COPY TO

From: "Dave Held" <dave(dot)held(at)arrayservicesgrp(dot)com>
To: <pgsql-hackers(at)postgresql(dot)org>
Subject: Modifying COPY TO
Date: 2005-02-25 22:19:09
Message-ID: 49E94D0CFCD4DB43AFBA928DDD20C8F902618457@asg002.asg.local
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I am interested in hacking COPY TO such that one can specify that
rows are copied in a certain index order. I got as far as
src/backend/commands/copy.c:CopyTo(), and it looks like I would need
to modify the call to heap_beginscan() so that it uses a key. However,
I couldn't figure out how to provide one, or if I'm even looking at the
right area. Ideally, this behavior would be specified with a flag,
perhaps: "WITH INDEX <index_name>" or "WITH PRIMARY KEY"
or something similar.

The motivation for this change is as follows. I have a fairly large
database (10 million+ records) that mirrors the data in a proprietary
system. The only access to that data is through exported flat files.
Currently, those flat files are copied directly into a staging area in the
db via a COPY FROM, the actual tables are truncated, and the
staging data is inserted into the live tables. Since the data is read-only,
it doesn't matter that it is recreated every day. However, as you
can imagine, the import process takes quite a while (several hours).
Also, rebuilding the db from scratch every day loses any statistical
information gathered from the execution of queries during the day.

A possibility that I would like to pursue is to keep the staging data
from the previous day, do a COPY TO, import the new data into
another staging table with a COPY FROM, then export the fresh
data with another COPY TO. Then, I can write a fast C/C++
program to do a line-by-line comparison of each record, isolating
the ones that have changed from the previous day. I can then
emit those records in a change file that should be relatively small
and easy to update. Of course, this scheme can only work if
COPY TO emits the records in a reliable order.

Any assistance on this project would be greatly appreciated. The
best I can see, I'm stuck on line 1053 from copy.c:

scandesc = heap_beginscan(rel, mySnapshot, 0, NULL);

I suspect that I want it to look like this:

scandesc = heap_beginscan(rel, mySnapshot, 1, key);

where 'key' is an appropriately constructed ScanKey. It looks
like I want to call ScanKeyEntryInitialize(), but I'm not sure what
parameters I need to pass to it to get an index or the primary
key. I mostly need help building the ScanKey object. I think I
can figure out how to hack the custom option, etc. I should
mention that I am using the 7.4.7 codebase on Linux 2.4.

__
David B. Held
Software Engineer/Array Services Group
200 14th Ave. East, Sartell, MN 56377
320.534.3637 320.253.7800 800.752.8129

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2005-02-25 22:56:50 Re: Modifying COPY TO
Previous Message Bruce Momjian 2005-02-25 22:09:20 Re: Development schedule