Re: Seeking Suggestions for Best Practices: Archiving and Migrating Historical Data in PostgreSQL

From: Motog Plus <mplus7535(at)gmail(dot)com>
To: Ron Johnson <ronljohnsonjr(at)gmail(dot)com>, Pgsql-admin <pgsql-admin(at)lists(dot)postgresql(dot)org>
Cc: Andy Hartman <hartman60home(at)gmail(dot)com>
Subject: Re: Seeking Suggestions for Best Practices: Archiving and Migrating Historical Data in PostgreSQL
Date: 2025-06-02 12:55:32
Message-ID: CAL5Gnivs1MKypYvrOGFLyk73KG8wmg1qAint=pkMdTkCGBEXMQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Hi Everyone,

Thank you all for the helpful suggestions, insights, and follow-up
questions. I truly appreciate the time and effort you’ve taken to share
your experiences and recommendations.

To answer one of the common questions: **yes, we are using partitioned
tables**, primarily based on a timestamp column. This setup is already
helping us manage and isolate historical data more effectively.

The input from this community has been incredibly valuable in helping us
shape our archival approach. We’re currently evaluating a few options based
on your feedback and will proceed with a solution that best balances
efficiency, reliability, and security.

We may reach out again with more specific questions or for further
suggestions once we finalize the approach and start implementation.

Thanks again for your support!

Best regards,
Ramzy

On Sat, May 31, 2025, 01:01 Ron Johnson <ronljohnsonjr(at)gmail(dot)com> wrote:

> That's an unanswerable question, as I would not use Windows. 😁
>
> Seriously though, since it's an image-heavy database full of PDF and TIFF
> files, I'd do what I did on Linux when needing to migrate/upgrade a 6TB
> (including indices) db from PG 9.6 to PG 14, and took four hours:
> pg_dump -Z1 --jobs=16
>
>
> On Fri, May 30, 2025 at 2:39 PM Andy Hartman <hartman60home(at)gmail(dot)com>
> wrote:
>
>> What would you use for backup if PG hosted on Windows
>>
>> On Fri, May 30, 2025 at 2:10 PM Ron Johnson <ronljohnsonjr(at)gmail(dot)com>
>> wrote:
>>
>>> Hmm... that was a few years ago, back when v12 was new. It took about
>>> a month (mainly because they didn't want me running exports during "office
>>> hours").
>>>
>>> There were 120 INSERT & SELECT (no UPDATE or DELETE) tables, so I was
>>> able to add indices on date columns, create by-month views. (We migrated
>>> the dozen or so *relatively* small UPDATE tables on cut-over day. On
>>> that same day, I migrated the current month and the previous month's data
>>> in those 120 tables.
>>>
>>> I made separate cron jobs to:
>>> - export views from Oracle into COPY-style tab-separated flat files,
>>> - lz4-compress views that had finished exporting, and
>>> - scp files that were finished compressing, to an AWS EC2 VM.
>>>
>>> These jobs pipelined, so there was always a job exporting, always a job
>>> ready to compress tsv files, and another job ready to scp the lz4 files.
>>> When there was nothing for a step to do, the job would sleep for a couple
>>> of minutes, then check if there was more work to do.
>>>
>>> On the AWS EC2 VM, a different cron job waited for files to finish
>>> transferring, then loaded them into the correct table. Just like with the
>>> source host jobs, the "load" job would sleep a bit and then check for more
>>> work. I manually applied Indices.
>>>
>>> The AWS RDS PG12 database was about 4TB. Snapshots were handled by
>>> AWS. If this had been one of my on-prem systems, I'd have used
>>> pgbackrest. (pgbackrest is impressively fast: takes good advantage of PG's
>>> 1GB file max, and globs "small" files into one big file.)
>>>
>>> On Fri, May 30, 2025 at 12:15 PM Andy Hartman <hartman60home(at)gmail(dot)com>
>>> wrote:
>>>
>>>> what was the duration start to finish of the migration of the 6tb of
>>>> data. then what do you use for a quick backup after archived PG data
>>>>
>>>> Thanks.
>>>>
>>>> On Fri, May 30, 2025 at 11:29 AM Ron Johnson <ronljohnsonjr(at)gmail(dot)com>
>>>> wrote:
>>>>
>>>>> On Fri, May 30, 2025 at 3:51 AM Motog Plus <mplus7535(at)gmail(dot)com>
>>>>> wrote:
>>>>>
>>>>>> Hi Team,
>>>>>>
>>>>>> We are currently planning a data archival initiative for our
>>>>>> production PostgreSQL databases and would appreciate suggestions or
>>>>>> insights from the community regarding best practices and proven approaches.
>>>>>>
>>>>>> **Scenario:**
>>>>>> - We have a few large tables (several hundred million rows) where we
>>>>>> want to archive historical data (e.g., older than 1 year).
>>>>>> - The archived data should be moved to a separate PostgreSQL database
>>>>>> (on a same or different server).
>>>>>> - Our goals are: efficient data movement, minimal downtime, and safe
>>>>>> deletion from the source after successful archival.
>>>>>>
>>>>>> - PostgreSQL version: 15.12
>>>>>> - Both source and target databases are PostgreSQL.
>>>>>>
>>>>>> We explored using `COPY TO` and `COPY FROM` with CSV files, uploaded
>>>>>> to a SharePoint or similar storage system. However, our infrastructure team
>>>>>> raised concerns around the computational load of large CSV processing and
>>>>>> potential security implications with file transfers.
>>>>>>
>>>>>> We’d like to understand:
>>>>>> - What approaches have worked well for you in practice?
>>>>>>
>>>>>
>>>>> This is how I migrated 6TB of data from an Oracle database to
>>>>> Postgresql, and then implemented quarterly archiving of the PG database:
>>>>> - COPY FROM (SELECT * FROM live_table WHERE date_fld in
>>>>> some_manageable_date_range) TO STDOUT.
>>>>> - Compress
>>>>> - scp
>>>>> - COPY TO archive_table.
>>>>> - Index
>>>>> - DELETE FROM live_table WHERE date_fld in some_manageable_date_range
>>>>> (This I only did in the PG archive process
>>>>>
>>>>> (Naturally, the Oracle migration used Oracle-specific commands.)
>>>>>
>>>>> - Are there specific tools or strategies you’d recommend for ongoing
>>>>>> archival?
>>>>>>
>>>>>
>>>>> I write generic bash loops to which you pass an array that contains
>>>>> the table name, PK, date column and date range.
>>>>>
>>>>> Given a list of tables, it did the COPY FROM, lz4 and scp. Once that
>>>>> finished successfully, another script dropped archive indices on the
>>>>> current table, COPY TO and CREATE INDEX statements. A third script did the
>>>>> deletes.
>>>>>
>>>>> This works even when the live database tables are all connected via
>>>>> FK. You just need to carefully order the tables in your script.
>>>>>
>>>>>
>>>>>> - Any performance or consistency issues we should watch out for?
>>>>>>
>>>>>
>>>>> My rules for scripting are "bite-sized pieces" and "check those return
>>>>> codes!".
>>>>>
>>>>>
>>>>>> Your insights or any relevant documentation/pointers would be
>>>>>> immensely helpful.
>>>>>>
>>>>>
>>>>> Index support uber alles. When deleting from a table which relies on
>>>>> a foreign key link to a table which _does_ have a date field, don't
>>>>> hesitate to join on that table.
>>>>>
>>>>> And DELETE of bite-sized chunks is faster than people give it credit
>>>>> for.
>>>>>
>>>>> --
>>>>> Death to <Redacted>, and butter sauce.
>>>>> Don't boil me, I'm still alive.
>>>>> <Redacted> lobster!
>>>>>
>>>>
>>>
>>> --
>>> Death to <Redacted>, and butter sauce.
>>> Don't boil me, I'm still alive.
>>> <Redacted> lobster!
>>>
>>
>
> --
> Death to <Redacted>, and butter sauce.
> Don't boil me, I'm still alive.
> <Redacted> lobster!
>

In response to

Browse pgsql-admin by date

  From Date Subject
Next Message Sabyasachi Mukherjee 2025-06-03 11:44:41 Question about permissions in the Schema
Previous Message Achilleas Mantzios 2025-06-01 05:32:06 Re: PostgreSQL 16.6 , query stuck with STAT Ssl, wait_event_type : IPC , wait_event : ParallelFinish