Re: Guidance Needed: Scaling PostgreSQL for 12 TB Data Growth - New Feature Implementation

From: Ron Johnson <ronljohnsonjr(at)gmail(dot)com>
To: Pgsql-admin <pgsql-admin(at)lists(dot)postgresql(dot)org>
Subject: Re: Guidance Needed: Scaling PostgreSQL for 12 TB Data Growth - New Feature Implementation
Date: 2025-06-26 14:54:14
Message-ID: CANzqJaBAK8OG0SwFdtWVhJP9v1Kw6fmH=ZH=z2BfVLYGWmqy7Q@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-admin pgsql-performance

Our application does "manual" horizontal partitioning, with a "reference"
database holding info on which customers are in which child database (where
each of the three children is on a separate VM).

Note: largest DB is 4.8TB, not 6TB. Total size is 11.3TB.

I can't/won't answer your specific questions, but will give some guidance.

Seven years of OLTP in one database is pretty severe. Part of the design
should be an archive process to (quarterly, monthly, semiannually,
whatever) move data from _the_ OLTP database to the "archive" database
(which _can_ be in the same instance, but is preferably on a different
instance). Either your application can do the joining of current and
archived data, or you can use postgres_fdw to combine the current and
archived data.

(Having said that, we have 10 years of image/bytea data in our databases.
Tables are partitioned via inheritance.)

"Small" OLTP databases can have table partitions split by account number or
"true PK" (which means you don't have to add a date field to the synthetic
PK), while giant Reporting and DW tables can be partitioned by date.

Two other reason to regularly archive:
1) small instances backup *AND RESTORE* faster than giant instances.
2) pg_basebackup creates replication faster on a small instance.

On Thu, Jun 26, 2025 at 9:43 AM Motog Plus <mplus7535(at)gmail(dot)com> wrote:

> Thanks Ron, for the feedback and for sharing your experience with
> PostgreSQL handling such large databases – that's very encouraging to hear.
> We are using postgres version 15.12.
>
> You're absolutely right about "typical transaction loads" not being a
> useful term without more context. My apologies for the vagueness. We
> actually have two distinct workloads on separate servers:
>
> OLTP: This is our primary transactional workload and has replication
> setup, pgpool - II
> Reporting/DW: This is for reporting purposes.
>
>
> The growth figures I initially shared (8-9 TB) were a more conservative
> estimate for OLTP.
>
> However, after a more focused rough estimate for our OLTP workload alone,
> we anticipate it could reach 35-40 TB of data over the next 5-7 years.
>
>
> Specifically for our OLTP databases (which I listed in my initial email):
>
> Database C could reach 30-32 TB, with the acc schema within it potentially
> growing to 13-15 TB.
> Database M might reach 5-7 TB.
> Database P could reach 1-2 TB.
>
>
> Given these revised, more detailed projections for the OLTP side, we would
> be extremely grateful for your and the community's guidance on all the
> questions we originally posed, specifically considering these new volume
> expectations for our OLTP workload:
>
>
> 1. Will PostgreSQL be able to handle this much load (35-40 TB, with one DB
> potentially at 30-32 TB and a schema at 13-15 TB) for an OLTP environment?
>
> 2. Should we still consider splitting our database "C" into two DBs (C1
> for "acc" schema and C2 for the rest), given the projected 13-15 TB for acc
> alone?
>
> 3. Should we assign a new DB server to C2, or keep it on the same server,
> particularly now with these larger OLTP volumes?
>
> 4. Will a single DB server be able to handle 30+ TB of OLTP data, or is
> there a particular limit per DB server from a performance point of view for
> OLTP?
>
> 5. What are the best practices, apart from indexing and partitioning, to
> keep in mind for such large-scale OLTP data management?
>
> 6. What hardware configuration (RAM, CPU, storage I/O, storage type like
> NVMe) would you recommend for future OLTP database servers to efficiently
> handle these new projected sizes?
>
> 7. Is a horizontal scaling solution (open source, apart from Citus)
> possible in PostgreSQL for these OLTP volumes, and do you have any pointers
> on that?
>
>
> Thanks again for your time and invaluable guidance.
>
> We truly appreciate the community's expertise.
>
> Regards,
> Ramzy
>
> On Thu, Jun 26, 2025, 18:19 Ron Johnson <ronljohnsonjr(at)gmail(dot)com> wrote:
>
>> PG easily handles our 6TB database, as well as 3 and 5TB databases (all
>> on different VMs), and has done so since at least v8.4.
>>
>> Ours are on single LVM mount points, as are the disks that hold the
>> PgBackRest savesets.
>>
>> "considering typical transaction loads."
>>
>> Pfft...there are no typical transaction loads. Is this db OLTP,
>> Reporting or DW?
>>
>> On Wed, Jun 25, 2025 at 4:35 AM Motog Plus <mplus7535(at)gmail(dot)com> wrote:
>>
>>> Dear PostgreSQL Community,
>>>
>>> We are implementing a new feature in our application that is expected to
>>> generate a significant amount of data, and we are seeking your expert
>>> guidance on how to best handle this growth within our existing PostgreSQL
>>> setup.
>>>
>>>
>>>
>>> Currently, our PostgreSQL instance runs on an EC2 c5.4xlarge Ubuntu
>>> instance with the following specifications:
>>>
>>> - *RAM:* 32 GB
>>> - *Disk:* 1.2 TB
>>> - *vCPUs:* 16
>>>
>>>
>>>
>>> Our database architecture utilizes a primary-standby streaming
>>> replication setup. Application modules (running in Kubernetes pods) connect
>>> to the database through Pgpool-II, using HikariCP for connection pooling.
>>>
>>>
>>>
>>> We have multiple databases on our primary server, with their approximate
>>> current sizes as follows:
>>>
>>> - *C:* 620 GB
>>> - *M:* 225 GB
>>> - *P:* 59 GB
>>> - *K:* 13 MB
>>>
>>>
>>>
>>> The total current size of our databases is around *1 TB*. With the new
>>> feature, we anticipate a substantial increase in data, potentially reaching *10
>>> TB* over the next 5-7 years.
>>>
>>>
>>>
>>> Below is the table for current size and expected growth in size:
>>>
>>>
>>>
>>> *S.No.*
>>>
>>> *DB*
>>>
>>> *Current DB size*
>>>
>>> *Future DB size*
>>>
>>> *Schema Name*
>>>
>>> *Current Schema size*
>>>
>>> *Future Schema size *
>>>
>>> 1
>>>
>>> C
>>>
>>> 1 TB
>>>
>>> 8 TB - 10 TB
>>>
>>> acc
>>>
>>> 297 GB
>>>
>>> 3 TB - 4 TB
>>>
>>> po
>>>
>>> 270 GB
>>>
>>> 2.6 TB - 3.5 TB
>>>
>>> pa
>>>
>>> 27 GB
>>>
>>> 270 GB
>>>
>>> pra
>>>
>>> 13 GB
>>>
>>> 130 GB
>>>
>>> fu
>>>
>>> 13 GB
>>>
>>> 130 GB
>>>
>>> te
>>>
>>> 167 MB
>>>
>>> 2 GB
>>>
>>> pro
>>>
>>> 30 MB
>>>
>>> 300 MB
>>>
>>> 2
>>>
>>> M
>>>
>>> 225 GB
>>>
>>> 2.2 TB - 3 TB
>>>
>>> bi
>>>
>>> 82 GB
>>>
>>> 820 GB
>>>
>>> co
>>>
>>> 80 GB
>>>
>>> 800 GB
>>>
>>> ps
>>>
>>> 17 GB
>>>
>>> 170 GB
>>>
>>> qo
>>>
>>> 16 GB
>>>
>>> 160 GB
>>>
>>> to
>>>
>>> 7 GB
>>>
>>> 70 GB
>>>
>>> in
>>>
>>> 7 GB
>>>
>>> 70 GB
>>>
>>> di
>>>
>>> 6 GB
>>>
>>> 60 GB
>>>
>>> no
>>>
>>> 4 GB
>>>
>>> 40 GB
>>>
>>> do
>>>
>>> 4 GB
>>>
>>> 40 GB
>>>
>>> cl
>>>
>>> 3 GB
>>>
>>> 30 GB
>>>
>>> 3
>>>
>>> P
>>>
>>> 60 GB
>>>
>>> 600 GB
>>>
>>> au
>>>
>>> 45 GB
>>>
>>> 450 GB
>>>
>>> fi
>>>
>>> 8 GB
>>>
>>> 80 GB
>>>
>>> con
>>>
>>> 4 GB
>>>
>>> 40 GB
>>>
>>> ba
>>>
>>> 1 GB
>>>
>>> 10 GB
>>>
>>> li
>>>
>>> 2 MB
>>>
>>> 20 GB
>>>
>>>
>>>
>>>
>>>
>>> We would greatly appreciate your insights on the following points:
>>>
>>> 1. *Scalability for Large Datasets:* Conceptually, PostgreSQL is
>>> known to handle large datasets. However, we'd like to confirm if a single
>>> PostgreSQL instance can realistically and efficiently manage 10-12 TB of
>>> data in a production environment, considering typical transaction loads.
>>> 2. *Database Split Strategy:* Our largest database, "C," currently
>>> occupies 620 GB. It contains multiple schemas. We are considering splitting
>>> database "C" into two new databases: "C1" to exclusively house the "acc"
>>> schema, and "C2" for the remaining schemas. Is this a recommended approach
>>> for managing growth, and what are the potential pros and cons?
>>> 3. *Server Allocation for Split Databases:* If we proceed with
>>> splitting "C" into "C1" and "C2," would it be advisable to assign a new,
>>> separate database server for "C2," or could both "C1" and "C2" reside on
>>> the same database server? What factors should we consider in making this
>>> decision?
>>> 4. *Performance Limits per Database and Database Server:* From a
>>> performance perspective, is there a general "limit" or best practice for
>>> the maximum amount of data a single database server should handle (e.g., 10
>>> TB) and similarly general limit per database? How does this influence the
>>> decision to add more database servers?
>>> 5. *Best Practices for Large-Scale Data Management:* Beyond standard
>>> practices like indexing and partitioning, what other best practices should
>>> we consider implementing to ensure optimal performance and manageability
>>> with such a large dataset? This could include configurations, maintenance
>>> strategies, etc.
>>> 6. *Hardware Configuration Recommendations:* Based on our projected
>>> data growth and desired performance, what hardware configurations (e.g.,
>>> RAM, CPU, storage I/O, storage type like NVMe) would you recommend for
>>> future database servers to efficiently handle 10-12 TB?
>>> 7. *Open-Source Horizontal Scaling Solutions:* Are there any
>>> open-source horizontal scaling solutions for PostgreSQL (other than Citus
>>> Data) that the community recommends or has experience with for managing
>>> extremely large datasets? Any pointers or guidance on this would be highly
>>> valuable.
>>>
>>>
>>>
>>> Thank you in advance for your time and expertise. We look forward to
>>> your valuable insights.
>>>
>>> Thanks & Regards,
>>>
>>> Ramzy
>>>
>>
>>
>> --
>> Death to <Redacted>, and butter sauce.
>> Don't boil me, I'm still alive.
>> <Redacted> lobster!
>>
>

--
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!

In response to

Browse pgsql-admin by date

  From Date Subject
Next Message Achilleas Mantzios 2025-06-30 07:36:13 Fast Logical replication setup, via VM clone , PostgreSQL 16.9
Previous Message Paul Smith* 2025-06-26 14:30:22 Re: Guidance Needed: Scaling PostgreSQL for 12 TB Data Growth - New Feature Implementation

Browse pgsql-performance by date

  From Date Subject
Next Message Greg Sabino Mullane 2025-07-04 13:45:55 Re: Guidance Needed: Scaling PostgreSQL for 12 TB Data Growth - New Feature Implementation
Previous Message Paul Smith* 2025-06-26 14:30:22 Re: Guidance Needed: Scaling PostgreSQL for 12 TB Data Growth - New Feature Implementation