Re: Seeking Suggestions for Best Practices: Archiving and Migrating Historical Data in PostgreSQL

From: Ron Johnson <ronljohnsonjr(at)gmail(dot)com>
To: Andy Hartman <hartman60home(at)gmail(dot)com>
Cc: Pgsql-admin <pgsql-admin(at)lists(dot)postgresql(dot)org>
Subject: Re: Seeking Suggestions for Best Practices: Archiving and Migrating Historical Data in PostgreSQL
Date: 2025-05-30 19:31:01
Message-ID: CANzqJaBygDCVnip18DJ_NKhwSewoJ=q7x3hRD3ev5Jj1y0wbQA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-admin

That's an unanswerable question, as I would not use Windows. 😁

Seriously though, since it's an image-heavy database full of PDF and TIFF
files, I'd do what I did on Linux when needing to migrate/upgrade a 6TB
(including indices) db from PG 9.6 to PG 14, and took four hours:
pg_dump -Z1 --jobs=16

On Fri, May 30, 2025 at 2:39 PM Andy Hartman <hartman60home(at)gmail(dot)com>
wrote:

> What would you use for backup if PG hosted on Windows
>
> On Fri, May 30, 2025 at 2:10 PM Ron Johnson <ronljohnsonjr(at)gmail(dot)com>
> wrote:
>
>> Hmm... that was a few years ago, back when v12 was new. It took about a
>> month (mainly because they didn't want me running exports during "office
>> hours").
>>
>> There were 120 INSERT & SELECT (no UPDATE or DELETE) tables, so I was
>> able to add indices on date columns, create by-month views. (We migrated
>> the dozen or so *relatively* small UPDATE tables on cut-over day. On
>> that same day, I migrated the current month and the previous month's data
>> in those 120 tables.
>>
>> I made separate cron jobs to:
>> - export views from Oracle into COPY-style tab-separated flat files,
>> - lz4-compress views that had finished exporting, and
>> - scp files that were finished compressing, to an AWS EC2 VM.
>>
>> These jobs pipelined, so there was always a job exporting, always a job
>> ready to compress tsv files, and another job ready to scp the lz4 files.
>> When there was nothing for a step to do, the job would sleep for a couple
>> of minutes, then check if there was more work to do.
>>
>> On the AWS EC2 VM, a different cron job waited for files to finish
>> transferring, then loaded them into the correct table. Just like with the
>> source host jobs, the "load" job would sleep a bit and then check for more
>> work. I manually applied Indices.
>>
>> The AWS RDS PG12 database was about 4TB. Snapshots were handled by AWS.
>> If this had been one of my on-prem systems, I'd have used pgbackrest.
>> (pgbackrest is impressively fast: takes good advantage of PG's 1GB file
>> max, and globs "small" files into one big file.)
>>
>> On Fri, May 30, 2025 at 12:15 PM Andy Hartman <hartman60home(at)gmail(dot)com>
>> wrote:
>>
>>> what was the duration start to finish of the migration of the 6tb of
>>> data. then what do you use for a quick backup after archived PG data
>>>
>>> Thanks.
>>>
>>> On Fri, May 30, 2025 at 11:29 AM Ron Johnson <ronljohnsonjr(at)gmail(dot)com>
>>> wrote:
>>>
>>>> On Fri, May 30, 2025 at 3:51 AM Motog Plus <mplus7535(at)gmail(dot)com> wrote:
>>>>
>>>>> Hi Team,
>>>>>
>>>>> We are currently planning a data archival initiative for our
>>>>> production PostgreSQL databases and would appreciate suggestions or
>>>>> insights from the community regarding best practices and proven approaches.
>>>>>
>>>>> **Scenario:**
>>>>> - We have a few large tables (several hundred million rows) where we
>>>>> want to archive historical data (e.g., older than 1 year).
>>>>> - The archived data should be moved to a separate PostgreSQL database
>>>>> (on a same or different server).
>>>>> - Our goals are: efficient data movement, minimal downtime, and safe
>>>>> deletion from the source after successful archival.
>>>>>
>>>>> - PostgreSQL version: 15.12
>>>>> - Both source and target databases are PostgreSQL.
>>>>>
>>>>> We explored using `COPY TO` and `COPY FROM` with CSV files, uploaded
>>>>> to a SharePoint or similar storage system. However, our infrastructure team
>>>>> raised concerns around the computational load of large CSV processing and
>>>>> potential security implications with file transfers.
>>>>>
>>>>> We’d like to understand:
>>>>> - What approaches have worked well for you in practice?
>>>>>
>>>>
>>>> This is how I migrated 6TB of data from an Oracle database to
>>>> Postgresql, and then implemented quarterly archiving of the PG database:
>>>> - COPY FROM (SELECT * FROM live_table WHERE date_fld in
>>>> some_manageable_date_range) TO STDOUT.
>>>> - Compress
>>>> - scp
>>>> - COPY TO archive_table.
>>>> - Index
>>>> - DELETE FROM live_table WHERE date_fld in some_manageable_date_range
>>>> (This I only did in the PG archive process
>>>>
>>>> (Naturally, the Oracle migration used Oracle-specific commands.)
>>>>
>>>> - Are there specific tools or strategies you’d recommend for ongoing
>>>>> archival?
>>>>>
>>>>
>>>> I write generic bash loops to which you pass an array that contains the
>>>> table name, PK, date column and date range.
>>>>
>>>> Given a list of tables, it did the COPY FROM, lz4 and scp. Once that
>>>> finished successfully, another script dropped archive indices on the
>>>> current table, COPY TO and CREATE INDEX statements. A third script did the
>>>> deletes.
>>>>
>>>> This works even when the live database tables are all connected via
>>>> FK. You just need to carefully order the tables in your script.
>>>>
>>>>
>>>>> - Any performance or consistency issues we should watch out for?
>>>>>
>>>>
>>>> My rules for scripting are "bite-sized pieces" and "check those return
>>>> codes!".
>>>>
>>>>
>>>>> Your insights or any relevant documentation/pointers would be
>>>>> immensely helpful.
>>>>>
>>>>
>>>> Index support uber alles. When deleting from a table which relies on a
>>>> foreign key link to a table which _does_ have a date field, don't hesitate
>>>> to join on that table.
>>>>
>>>> And DELETE of bite-sized chunks is faster than people give it credit
>>>> for.
>>>>
>>>> --
>>>> Death to <Redacted>, and butter sauce.
>>>> Don't boil me, I'm still alive.
>>>> <Redacted> lobster!
>>>>
>>>
>>
>> --
>> Death to <Redacted>, and butter sauce.
>> Don't boil me, I'm still alive.
>> <Redacted> lobster!
>>
>

--
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Achilleas Mantzios 2025-05-31 20:43:52 PostgreSQL 16.6 , query stuck with STAT Ssl, wait_event_type : IPC , wait_event : ParallelFinish
Previous Message Andy Hartman 2025-05-30 18:39:35 Re: Seeking Suggestions for Best Practices: Archiving and Migrating Historical Data in PostgreSQL