Re: Logical replication, need to reclaim big disk space

From: Achilleas Mantzios <a(dot)mantzios(at)cloud(dot)gatewaynet(dot)com>
To: Moreno Andreo <moreno(dot)andreo(at)evolu-s(dot)it>, pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: Logical replication, need to reclaim big disk space
Date: 2025-05-19 18:49:53
Message-ID: ff22f2bb-cead-4bf3-8e19-7fdd8a31c6c2@cloud.gatewaynet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 19/5/25 17:38, Moreno Andreo wrote:

>
>
> On 19/05/25 14:41, Achilleas Mantzios wrote:
>>
>>
>> On 5/19/25 09:14, Moreno Andreo wrote:
>>>
>>> On 16/05/25 21:33, Achilleas Mantzios wrote:
>>>> On 16/5/25 18:45, Moreno Andreo wrote:
>>>>
>>>>> Hi,
>>>>>     we are moving our old binary data approach, moving them from
>>>>> bytea fields in a table to external storage (making database
>>>>> smaller and related operations faster and smarter).
>>>>> In short, we have a job that runs in background and copies data
>>>>> from the table to an external file and then sets the bytea field
>>>>> to NULL.
>>>>> (UPDATE tbl SET blob = NULL, ref = 'path/to/file' WHERE id = <uuid>)
>>>>>
>>>>> This results, at the end of the operations, to a table that's less
>>>>> than one tenth in size.
>>>>> We have a multi-tenant architecture (100s of schemas with
>>>>> identical architecture, all inheriting from public) and we are
>>>>> performing the task on one table per schema.
>>>>>
>>>> So? toasted data are kept on separate TOAST tables, unless those
>>>> bytea cols are selected, you won't even touch them. I cannot
>>>> understand what you are trying to achieve here.
>>>>
>>>> Years ago, when I made the mistake to go for a coffee and let my
>>>> developers "improvise" , the result was a design similar to what
>>>> you are trying to achieve. Years after, I am seriously considering
>>>> moving those data back to PostgreSQL.
>>> The "related operations" I was talking about are backups and
>>> database maintenance when needed, cluster/replica management, etc.
>>> With a smaller database size they would be easier in timing and
>>> effort, right?
>> Ok, but you'll lose replica functionality for those blobs, which
>> means you don't care about them, correct me if I am wrong.
> I'm not saying I don't care about them, the opposite, they are
> protected with Object Versioning and soft deletion, this should assure
> a good protection against e.g. ransomware, if someone manages to get
> in there (and if this happens, we'll have bigger troubles than this).
PostgreSQL has become very popular because of ppl who care about their data.
>>> We are mostly talking about costs, here. To give things their names,
>>> I'm moving bytea contents (85% of total data) to files into Google
>>> Cloud Storage buckets, that has a fraction of the cost of the disks
>>> holding my database (on GCE, to be clear ).
>> May I ask the size of the bytea data (uncompressed) ?.
> single records vary from 150k to 80 MB, the grand total is more than
> 8,5 TB in a circa 10 TB data footprint
>>> This data is not accessed frequently (just by the owner when he
>>> needs to do it), so no need to keep it on expensive hardware.
>>> I've already read in these years that keeping many big bytea fields
>>> in databases is not recommended, but might have misunderstood this.
>>
>> Ok, I assume those are unimportant data, but let me ask, what is the
>> longevity or expected legitimacy of those ? I haven't worked with
>> those just reading  :
>>
>> https://cloud.google.com/storage/pricing?_gl=1*1b25r8o*_up*MQ..&gclid=CjwKCAjwravBBhBjEiwAIr30VKfaOJytxmk7J29vjG4rBBkk2EUimPU5zPibST73nm3XRL2h0O9SxRoCaogQAvD_BwE&gclsrc=aw.ds#storage-pricing
>>
>> would you choose e.g. "*Anywhere Cache storage" ?
>> *
>>
> Absolutely not, this is *not* unimportant data, and we are using
> Standard Storage, for 0,02$/GB/month + operations, that compared to a
> 0.17$/GB/month of an SSD or even more for the Hyperdisks we are using,
> is a good price drop.
How about hosting your data in your own storage and spend 0$/GB/month ?
>>
>> **
>>
>>> Another way would have been to move these tables to a different
>>> tablespace, in cheaper storage, but it still would have been 3 times
>>> the buckets cost.
>>>
>> can you actually mount those Cloud Storage Buckets under a supported
>> FS in linux and just move them to tablespaces backed by this storage ?
>>
> Never tried, I mounted this via FUSE and had some simple operations in
> the past, but not sure it can handle database operations in terms of
> I/O bandwidth
>>
>>> Why are you considering to get data back to database tables?
>> Because now if we need to migrate from cloud to on-premise, or just
>> upgrade or move the specific server which holds those data I will
>> have an extra headache. Also this is a single point of failure, or
>> best case a cause for fragmented technology introduced just for the
>> sake of keeping things out of the DB.
> This is managed as an hierarchical disk structure, so the calling
> server may be literally everywhere, it just needs an account (or a
> service account) to get in there ,
and you are locked in a proprietary solution. and at their mercy of any
future increases in cost.
>>>>
>>>>
>>>>> The problem is: this is generating BIG table bloat, as you may
>>>>> imagine.
>>>>> Running a VACUUM FULL on an ex-22GB table on a standalone test
>>>>> server is almost immediate.
>>>>> If I had only one server, I'll process a table a time, with a
>>>>> nightly script, and issue a VACUUM FULL to tables that have
>>>>> already been processed.
>>>>>
>>>>> But I'm in a logical replication architecture (we are using a
>>>>> multimaster system called pgEdge, but I don't think it will make
>>>>> big difference, since it's based on logical replication), and I'm
>>>>> building a test cluster.
>>>>>
>>>> So you use PgEdge , but you wanna lose all the benefits of
>>>> multi-master , since your binary data won't be replicated ...
>>> I don't think I need it to be replicated, since this data cannot be
>>> "edited", so either it's there or it's been deleted. Buckets have
>>> protections for data deletions or events like ransomware attacks and
>>> such.
>>> Also multi-master was an absolute requirement one year ago because
>>> of a project we were building, but it has been abandoned and now a
>>> simple logical replication would be enough, but let's do one thing a
>>> time.
>> Multi-master is cool, you can configure your pooler / clients to take
>> advantage of this for full load balanced architecture, but if not a
>> strict requirement , you can live without it, as so many of us, and
>> employ other means of load balancing the reads.
> That's what we are doing, it's a really cool feature, but I
> experienced (maybe because it uses old pglogical extension) that the
> replication is a bit fragile, especially when dealing with those bytea
> fields (when I ingest big loads, say 25-30 GB or more), it happened to
> break replication, and recreating a replica from scratch with "normal
> size" tables is not a big deal, since it can be achieved
> automatically, because they normally fit in shared memory and can be
> transferred by the replicator, but you can imagine what would be the
> effort and the downtime necessary to create a base backup, transfer it
> to the replica, build the DB and restart a 10-TB database (ATM we are
> running with a 2-node cluster).
Break this in batches, use modern techniques for robust data loading, in
smaller transactions, if you have to.
>>>>> I've been instructed to issue VACUUM FULL on both nodes, nightly,
>>>>> but before proceeding I read on docs that VACUUM FULL can disrupt
>>>>> logical replication, so I'm a bit concerned on how to proceed.
>>>>> Rows are cleared one a time (one transaction, one row, to keep
>>>>> errors to the record that issued them)
>> Mind if you shared the specific doc ?
> Obviously I can't find it from a quick search, I'll search deeper, I
> don't think it went off a dream :-).
>>>>>
>>>> PgEdge is based on the old pg_logical, the old 2ndQuadrant
>>>> extension, not the native logical replication we have since pgsql
>>>> 10. But I might be mistaken.
>>> Don't know about this, it keeps running on latest pg versions (we
>>> are about to upgrade to 17.4, if I'm not wrong), but I'll ask
>>>>> I read about extensions like pg_squeeze, but I wonder if they are
>>>>> still not dangerous for replication.
>>>>>
>>>> What's pgEdge take on that, I mean the bytea thing you are trying
>>>> to achieve here.
>>> They are positive, it's they that suggested to do VACUUM FULL on
>>> both nodes... I'm quite new to replication, so I'm searching some
>>> advise here.
>>
>> As I told you, pgEdge logical replication (old 2ndquadrant BDR) !=
>> native logical replication. You may look here :
>>
>> https://github.com/pgEdge/spock
>>
>> If multi-master is not a must you could convert  to vanilla
>> postgresql and focus on standard physical and logical replication.
>>
> No, multimaster is cool, but as I said, the project has been
> discontinued and it's not a must anymore. This is the first step,
> actually. We are planning to return to plain PostgreSQL, or CloudSQL
> for PostgreSQL, using logical replication (that seems the most
> reliable of the two). We created a test case for both the options, and
> they seem to be OK for now, even if I have still to do adequate stress
> tests. And when I'll do the migration, I'd like to be migrating plain
> data only and leave blobs where they are.

as you wish. But this design has inherent data infra fragmentation as
you understand.

Personally I like to let the DB take care of the data, and I take care
of the DB, not a plethora of extra systems that we need to keep
connected and consistent.

>>>>> Thanks for your help.
>>>>> Moreno.-
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Ron Johnson 2025-05-19 19:05:51 Re: get speed help
Previous Message Ron Johnson 2025-05-19 16:43:42 Re: get speed help