Re: Moving data from one set of tables to another?

From: "Sean Davis" <sdavis2(at)mail(dot)nih(dot)gov>
To: "Howard Eglowstein" <howard(at)yankeescientific(dot)com>
Cc: pgsql-novice(at)postgresql(dot)org
Subject: Re: Moving data from one set of tables to another?
Date: 2008-09-19 16:53:37
Message-ID: 264855a00809190953r1480bebch641b5c96eb7cff6b@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-novice

On Fri, Sep 19, 2008 at 12:04 PM, Howard Eglowstein
<howard(at)yankeescientific(dot)com> wrote:
> Yes, I have been deleting them as I go. I thought about running one pass to
> move the data over and a second one to then delete the records. The data in
> two of the tables is only loosely linked to the data in the first by the
> value in one column, and I was concerned about how to know where to restart
> the process if it stopped and I had to restart it later. Deleting the three
> rows after the database reported successfully writing the three new ones
> seemed like a good idea at the time. I don't want to stop the process now,
> but I'll look at having the program keep track of its progress and then go
> back and delete the old data when it's done.
>
> And yes, I do have a complete backup of the data from before I started any
> of this. I can easily go back to where I was and try again or tweak the
> process as needed. The database tuning is a problem I think we have from
> before this procedure and I'll have to look at again after this data is
> moved around.

You might want to do all of this--inserts and deletes--within a
transaction. Then, if ANY step fails, the entire process can be
rolled back.

> Carol Walter wrote:
>>
>> Database tuning can really be an issue. I have a development copy and a
>> production copy of most of my databases. They are on two different
>> machines. The databases used to tuned the same way, however one of the
>> machines has more processing power and one has more memory. When we retuned
>> the databases to take advantage of the machines strong points, it decreased
>> the time it took to run some queries by 400%.
>>
>> Carol
>>
>> P.S. If I understand your process, and your deleting the records as you
>> go, that would make me really nervous. As soon as you start, you no longer
>> have an intact table that has all the data in it. While modern databases
>> engines do a lot to protect your data, there is always some quirk that can
>> happen. If you have enough space, you might consider running the delete
>> after the tables are created.
>>
>>
>> On Sep 19, 2008, at 11:07 AM, Howard Eglowstein wrote:
>>
>>> There are a lot of issues at work here. The speed of the machine, the
>>> rest of the machine's workload, the database configuration, etc. This
>>> machine is about 3 years old and not as fast as a test machine I have at my
>>> desk. It's also running three web services and accepting new data into the
>>> current year's tables at the rate of one set of rows every few seconds. The
>>> database when I started didn't have any indices applied. I indexed a few
>>> columns which seemed to help tremendously (a factor of 10 at least) and
>>> perhaps a few more might help.
>>>
>>> Considering that searching the tables now with the data split into 3 rows
>>> takes a minute or more to search the whole database, I suspect that there's
>>> still organizational issues that could be addressed to speed up all PG
>>> operations. I'm far more concerned with robustness and I'm not too keen on
>>> trying too many experiments until I get the data broken up and backed up
>>> again.
>>>
>>> I doubt this machine could perform 7 SQL operations on 1.5 million rows
>>> in each of 3 tables in a few seconds or minutes on a good day, with the
>>> wind, rolling down hill. I'd like to be proven wrong though...
>>>
>>> Howard
>>>
>>> Sean Davis wrote:
>>>>
>>>> On Fri, Sep 19, 2008 at 10:48 AM, Howard Eglowstein
>>>> <howard(at)yankeescientific(dot)com> wrote:
>>>>
>>>>> Absolutely true, and if the data weren't stored on the same machine
>>>>> which is
>>>>> running the client, I would have worked harder to combine statements.
>>>>> In
>>>>> this case though, the server and the data are on the same machine and
>>>>> the
>>>>> client application doing the SELECT, INSERT and DELETEs is also on the
>>>>> same
>>>>> machine.
>>>>>
>>>>> I'd like to see how to have done this with combined statements if I
>>>>> ever
>>>>> have to do it again in a different setup, but it is working well now.
>>>>> It's
>>>>> moved about 1/2 million records so far since last night.
>>>>>
>>>>
>>>> So the 150ms was per row? Not to belabor the point, but I have done
>>>> this with tables with tens-of-millions of rows in the space of seconds
>>>> to minutes (for the entire move, not per row), depending on the exact
>>>> details of the table(s). No overnight involved. The network is one
>>>> issue (which you have avoided by being local), but the encoding and
>>>> decoding overhead to go to a client is another one that is entirely
>>>> avoided. When you have some free time, do benchmark, as I think the
>>>> difference could be substantial.
>>>>
>>>> Sean
>>>>
>>>>
>>>>> Sean Davis wrote:
>>>>>
>>>>>> On Fri, Sep 19, 2008 at 7:42 AM, Howard Eglowstein
>>>>>> <howard(at)yankeescientific(dot)com> wrote:
>>>>>>
>>>>>>
>>>>>>> So you'd agree then that I'll need 7 SQL statements but that I could
>>>>>>> stack
>>>>>>> the INSERT and the first SELECT if I wanted to? Cool. That's what I
>>>>>>> ended
>>>>>>> up
>>>>>>> with in C code and it's working pretty well. I did some indexing on
>>>>>>> the
>>>>>>> database and got the whole transaction down to about 150ms for the
>>>>>>> sequence.
>>>>>>> I guess that's as good as it's going to get.
>>>>>>>
>>>>>>>
>>>>>> Keep in mind that the INSERT [...] SELECT [...] is done server-side,
>>>>>> so the data never goes over the wire to the client. This is very
>>>>>> different than doing the select, accumulating the data, and then doing
>>>>>> the insert and is likely to be much faster, relatively. 150ms is
>>>>>> already pretty fast, but the principle of doing as much on the server
>>>>>> as possible is an important one when looking for efficiency,
>>>>>> especially when data sizes are large.
>>>>>>
>>>>>> Glad to hear that it is working.
>>>>>>
>>>>>> Sean
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Sean Davis wrote:
>>>>>>>
>>>>>>>
>>>>>>>> On Thu, Sep 18, 2008 at 7:28 PM, Howard Eglowstein
>>>>>>>> <howard(at)yankeescientific(dot)com> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> What confuses me is that I need to do the one select with all three
>>>>>>>>> tables
>>>>>>>>> and then do three inserts, no? The results is that the 150 fields I
>>>>>>>>> get
>>>>>>>>> back
>>>>>>>>> from the select have to be split into 3 groups of 50 fields each
>>>>>>>>> and
>>>>>>>>> then
>>>>>>>>> written into three tables.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> You do the insert part of the command three times, once for each new
>>>>>>>> table, so three separate SQL statements. The select remains
>>>>>>>> basically
>>>>>>>> the same for all three, with only the column selection changing
>>>>>>>> (data_a.* when inserting into new_a, data_b.* when inserting into
>>>>>>>> new_b, etc.). Just leave the ids the same as in the first set of
>>>>>>>> tables. There isn't a need to change them in nearly every case. If
>>>>>>>> you need to add a new ID column, you can do that as a serial column
>>>>>>>> in
>>>>>>>> the new tables, but I would stick to the original IDs, if possible.
>>>>>>>>
>>>>>>>> Sean
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> What you're suggesting is that there is some statement which could
>>>>>>>>> do
>>>>>>>>> the
>>>>>>>>> select and the three inserts at once?
>>>>>>>>>
>>>>>>>>> Howard
>>>>>>>>>
>>>>>>>>> Sean Davis wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> You might want to look at insert into ... select ...
>>>>>>>>>>
>>>>>>>>>> You should be able to do this with 1 query per new table (+ the
>>>>>>>>>> deletes, obviously). For a few thousand records, I would expect
>>>>>>>>>> that
>>>>>>>>>> the entire process might take a few seconds.
>>>>>>>>>>
>>>>>>>>>> Sean
>>>>>>>>>>
>>>>>>>>>> On Thu, Sep 18, 2008 at 6:39 PM, Howard Eglowstein
>>>>>>>>>> <howard(at)yankeescientific(dot)com> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Somewhat empty, yes. The single set of 'data_' tables contains 3
>>>>>>>>>>> years
>>>>>>>>>>> worth
>>>>>>>>>>> of data. I want to move 2 years worth out into the 'new_' tables.
>>>>>>>>>>> When
>>>>>>>>>>> I'm
>>>>>>>>>>> done, there will still be 1 year's worth of data left in the
>>>>>>>>>>> original
>>>>>>>>>>> table.
>>>>>>>>>>>
>>>>>>>>>>> Howard
>>>>>>>>>>>
>>>>>>>>>>> Carol Walter wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> What do you want for your end product? Are the old tables empty
>>>>>>>>>>>> after
>>>>>>>>>>>> you
>>>>>>>>>>>> put the data into the new tables?
>>>>>>>>>>>>
>>>>>>>>>>>> Carol
>>>>>>>>>>>>
>>>>>>>>>>>> On Sep 18, 2008, at 3:02 PM, Howard Eglowstein wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> I have three tables called 'data_a', 'data_b' and 'data_c'
>>>>>>>>>>>>> which
>>>>>>>>>>>>> each
>>>>>>>>>>>>> have 50 columns. One of the columns in each is 'id' and is used
>>>>>>>>>>>>> to
>>>>>>>>>>>>> keep
>>>>>>>>>>>>> track of which data in data_b and data_c corresponds to a row
>>>>>>>>>>>>> in
>>>>>>>>>>>>> data_a. If
>>>>>>>>>>>>> I want to get all of the data in all 150 fields for this month
>>>>>>>>>>>>> (for
>>>>>>>>>>>>> example), I can get it with:
>>>>>>>>>>>>>
>>>>>>>>>>>>> select * from (data_a, data_b, data_c) where
>>>>>>>>>>>>> data_a.id=data_b.id
>>>>>>>>>>>>> AND
>>>>>>>>>>>>> data_a.id = data_c.id AND timestamp >= '2008-09-01 00:00:00'
>>>>>>>>>>>>> and
>>>>>>>>>>>>> timestamp
>>>>>>>>>>>>> <= '2008-09-30 23:59:59'
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>>> What I need to do is execute this search which might return
>>>>>>>>>>>>> several
>>>>>>>>>>>>> thousand rows and write the same structure into 'new_a',
>>>>>>>>>>>>> 'new_b'
>>>>>>>>>>>>> and
>>>>>>>>>>>>> 'new_c'. What i'm doing now in a C program is executing the
>>>>>>>>>>>>> search
>>>>>>>>>>>>> above.
>>>>>>>>>>>>> Then I execute:
>>>>>>>>>>>>>
>>>>>>>>>>>>> INSERT INTO data_a (timestamp, field1, field2 ...[imagine 50 of
>>>>>>>>>>>>> them])
>>>>>>>>>>>>> VALUES ('2008-09-01 00:00:00', 'ABC', 'DEF', ...);
>>>>>>>>>>>>> Get the ID that was assigned to this row since 'id' is a serial
>>>>>>>>>>>>> field
>>>>>>>>>>>>> and
>>>>>>>>>>>>> the number is assigned sequentially. Say it comes back as '1'.
>>>>>>>>>>>>> INSERT INTO data_b (id, field1, field2 ...[imagine 50 of them])
>>>>>>>>>>>>> VALUES
>>>>>>>>>>>>> ('1', 'ABC', 'DEF', ...);
>>>>>>>>>>>>> INSERT INTO data_c (id, field1, field2 ...[imagine 50 of them])
>>>>>>>>>>>>> VALUES
>>>>>>>>>>>>> ('1', 'ABC', 'DEF', ...);
>>>>>>>>>>>>>
>>>>>>>>>>>>> That moves a copy of the three rows of data form the three
>>>>>>>>>>>>> tables
>>>>>>>>>>>>> into
>>>>>>>>>>>>> the three separate new tables.
>>>>>>>>>>>>> From the original group of tables, the id for these rows was,
>>>>>>>>>>>>> let's
>>>>>>>>>>>>> say,
>>>>>>>>>>>>> '1234'. Then I execute:
>>>>>>>>>>>>>
>>>>>>>>>>>>> DELETE FROM data_a where id='1234';
>>>>>>>>>>>>> DELETE FROM data_b where id='1234';
>>>>>>>>>>>>> DELETE FROM data_c where id='1234';
>>>>>>>>>>>>>
>>>>>>>>>>>>> That deletes the old data.
>>>>>>>>>>>>>
>>>>>>>>>>>>> This works fine and gives me exactly what I wanted, but is
>>>>>>>>>>>>> there a
>>>>>>>>>>>>> better
>>>>>>>>>>>>> way? This is 7 SQL calls and it takes about 3 seconds per moved
>>>>>>>>>>>>> record
>>>>>>>>>>>>> on
>>>>>>>>>>>>> our Linux box.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Any thoughts or suggestions would be appreciated.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Howard
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Sent via pgsql-novice mailing list
>>>>>>>>>>>>> (pgsql-novice(at)postgresql(dot)org)
>>>>>>>>>>>>> To make changes to your subscription:
>>>>>>>>>>>>> http://www.postgresql.org/mailpref/pgsql-novice
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> No virus found in this incoming message.
>>>>>>>>>>>> Checked by AVG - http://www.avg.com Version: 8.0.169 / Virus
>>>>>>>>>>>> Database:
>>>>>>>>>>>> 270.6.21/1678 - Release Date: 9/18/2008 9:01 AM
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Sent via pgsql-novice mailing list (pgsql-novice(at)postgresql(dot)org)
>>>>>>>>>>> To make changes to your subscription:
>>>>>>>>>>> http://www.postgresql.org/mailpref/pgsql-novice
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> No virus found in this incoming message.
>>>>>>>>>> Checked by AVG - http://www.avg.com Version: 8.0.169 / Virus
>>>>>>>>>> Database:
>>>>>>>>>> 270.6.21/1678 - Release Date: 9/18/2008 9:01 AM
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>
>>>>>>>>
>>>>>>>> No virus found in this incoming message.
>>>>>>>> Checked by AVG - http://www.avg.com Version: 8.0.169 / Virus
>>>>>>>> Database:
>>>>>>>> 270.7.0/1679 - Release Date: 9/18/2008 5:03 PM
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------------------
>>>>>>
>>>>>>
>>>>>> No virus found in this incoming message.
>>>>>> Checked by AVG - http://www.avg.com Version: 8.0.169 / Virus Database:
>>>>>> 270.7.0/1679 - Release Date: 9/18/2008 5:03 PM
>>>>>>
>>>>>>
>>>>>>
>>>> >
>>>> ------------------------------------------------------------------------
>>>>
>>>>
>>>> No virus found in this incoming message.
>>>> Checked by AVG - http://www.avg.com Version: 8.0.169 / Virus Database:
>>>> 270.7.0/1679 - Release Date: 9/18/2008 5:03 PM
>>>>
>>>>
>>>
>>>
>>> --
>>> Sent via pgsql-novice mailing list (pgsql-novice(at)postgresql(dot)org)
>>> To make changes to your subscription:
>>> http://www.postgresql.org/mailpref/pgsql-novice
>>
>> ------------------------------------------------------------------------
>>
>>
>> No virus found in this incoming message.
>> Checked by AVG - http://www.avg.com Version: 8.0.169 / Virus Database:
>> 270.7.0/1679 - Release Date: 9/18/2008 5:03 PM
>>
>>
>
>
> --
> Sent via pgsql-novice mailing list (pgsql-novice(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-novice
>

In response to

Browse pgsql-novice by date

  From Date Subject
Next Message Devrim GÜNDÜZ 2008-09-23 05:29:24 Re: How do create a user with a bashscript
Previous Message Howard Eglowstein 2008-09-19 16:04:10 Re: Moving data from one set of tables to another?