Re: Logical replication, need to reclaim big disk space

From: Moreno Andreo <moreno(dot)andreo(at)evolu-s(dot)it>
To: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: Logical replication, need to reclaim big disk space
Date: 2025-05-20 11:12:42
Message-ID: 38b57615-d0d9-47b0-bc71-50567c03dd90@evolu-s.it
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 20/05/25 12:58, Achilleas Mantzios wrote:
>
> Στις 20/5/25 12:17, ο/η Moreno Andreo έγραψε:
>
>>
>>
>> On 19/05/25 20:49, Achilleas Mantzios wrote:
>>>
>>> On 19/5/25 17:38, Moreno Andreo wrote:
>>>
>>>>
>>>>
>>>> On 19/05/25 14:41, Achilleas Mantzios wrote:
>>>>>
>>>>>
>>>>> On 5/19/25 09:14, Moreno Andreo wrote:
>>>>>>
>>>>>> On 16/05/25 21:33, Achilleas Mantzios wrote:
>>>>>>> On 16/5/25 18:45, Moreno Andreo wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>     we are moving our old binary data approach, moving them
>>>>>>>> from bytea fields in a table to external storage (making
>>>>>>>> database smaller and related operations faster and smarter).
>>>>>>>> In short, we have a job that runs in background and copies data
>>>>>>>> from the table to an external file and then sets the bytea
>>>>>>>> field to NULL.
>>>>>>>> (UPDATE tbl SET blob = NULL, ref = 'path/to/file' WHERE id =
>>>>>>>> <uuid>)
>>>>>>>>
>>>>>>>> This results, at the end of the operations, to a table that's
>>>>>>>> less than one tenth in size.
>>>>>>>> We have a multi-tenant architecture (100s of schemas with
>>>>>>>> identical architecture, all inheriting from public) and we are
>>>>>>>> performing the task on one table per schema.
>>>>>>>>
>>>>>>> So? toasted data are kept on separate TOAST tables, unless those
>>>>>>> bytea cols are selected, you won't even touch them. I cannot
>>>>>>> understand what you are trying to achieve here.
>>>>>>>
>>>>>>> Years ago, when I made the mistake to go for a coffee and let my
>>>>>>> developers "improvise" , the result was a design similar to what
>>>>>>> you are trying to achieve. Years after, I am seriously
>>>>>>> considering moving those data back to PostgreSQL.
>>>>>> The "related operations" I was talking about are backups and
>>>>>> database maintenance when needed, cluster/replica management,
>>>>>> etc. With a smaller database size they would be easier in timing
>>>>>> and effort, right?
>>>>> Ok, but you'll lose replica functionality for those blobs, which
>>>>> means you don't care about them, correct me if I am wrong.
>>>> I'm not saying I don't care about them, the opposite, they are
>>>> protected with Object Versioning and soft deletion, this should
>>>> assure a good protection against e.g. ransomware, if someone
>>>> manages to get in there (and if this happens, we'll have bigger
>>>> troubles than this).
>>> PostgreSQL has become very popular because of ppl who care about
>>> their data.
>> Yeah, it's always been famous for its robustness, and that's why I
>> chose PostgreSQL more than 10 years ago, and, in spite of how a
>> "normal" user treats his PC, we never had corruption (only where
>> FS/disk were failing, but that's not PG fault)
>>>>>> We are mostly talking about costs, here. To give things their
>>>>>> names, I'm moving bytea contents (85% of total data) to files
>>>>>> into Google Cloud Storage buckets, that has a fraction of the
>>>>>> cost of the disks holding my database (on GCE, to be clear ).
>>>>> May I ask the size of the bytea data (uncompressed) ?.
>>>> single records vary from 150k to 80 MB, the grand total is more
>>>> than 8,5 TB in a circa 10 TB data footprint
>>>>>> This data is not accessed frequently (just by the owner when he
>>>>>> needs to do it), so no need to keep it on expensive hardware.
>>>>>> I've already read in these years that keeping many big bytea
>>>>>> fields in databases is not recommended, but might have
>>>>>> misunderstood this.
>>>>>
>>>>> Ok, I assume those are unimportant data, but let me ask, what is
>>>>> the longevity or expected legitimacy of those ? I haven't worked
>>>>> with those just reading  :
>>>>>
>>>>> https://cloud.google.com/storage/pricing?_gl=1*1b25r8o*_up*MQ..&gclid=CjwKCAjwravBBhBjEiwAIr30VKfaOJytxmk7J29vjG4rBBkk2EUimPU5zPibST73nm3XRL2h0O9SxRoCaogQAvD_BwE&gclsrc=aw.ds#storage-pricing
>>>>>
>>>>> would you choose e.g. "*Anywhere Cache storage" ?
>>>>> *
>>>>>
>>>> Absolutely not, this is *not* unimportant data, and we are using
>>>> Standard Storage, for 0,02$/GB/month + operations, that compared to
>>>> a 0.17$/GB/month of an SSD or even more for the Hyperdisks we are
>>>> using, is a good price drop.
>>> How about hosting your data in your own storage and spend 0$/GB/month ?
>> If we could host on our own hardware I'd not be here talking. Maybe
>> we would have a 10-node full-mesh multimaster architecture with
>> barman backup on 2 separate SANs.
>> But we are a small company that has to balance performance,
>> consistency, security and, last but not latter, costs. And margins
>> are tightening.
>>>
>>>>> **
>>>>>
>>>>>> Another way would have been to move these tables to a different
>>>>>> tablespace, in cheaper storage, but it still would have been 3
>>>>>> times the buckets cost.
>>>>>>
>>>>> can you actually mount those Cloud Storage Buckets under a
>>>>> supported FS in linux and just move them to tablespaces backed by
>>>>> this storage ?
>>>>>
>>>> Never tried, I mounted this via FUSE and had some simple operations
>>>> in the past, but not sure it can handle database operations in
>>>> terms of I/O bandwidth
>>>>>
>>>>>> Why are you considering to get data back to database tables?
>>>>> Because now if we need to migrate from cloud to on-premise, or
>>>>> just upgrade or move the specific server which holds those data I
>>>>> will have an extra headache. Also this is a single point of
>>>>> failure, or best case a cause for fragmented technology introduced
>>>>> just for the sake of keeping things out of the DB.
>>>> This is managed as an hierarchical disk structure, so the calling
>>>> server may be literally everywhere, it just needs an account (or a
>>>> service account) to get in there ,
>>> and you are locked in a proprietary solution. and at their mercy of
>>> any future increases in cost.
>> Since we cannot host on our hardware, the only thing is to keep an
>> eye on costs and migrate (yeah, more work) when it's becoming
>> expensive. Every solution is proprietary, if you want to run on
>> cloud. Even the VMs where PostgreSQL is running.
>>>>>>>
>>>>>>>
>>>>>>>> The problem is: this is generating BIG table bloat, as you may
>>>>>>>> imagine.
>>>>>>>> Running a VACUUM FULL on an ex-22GB table on a standalone test
>>>>>>>> server is almost immediate.
>>>>>>>> If I had only one server, I'll process a table a time, with a
>>>>>>>> nightly script, and issue a VACUUM FULL to tables that have
>>>>>>>> already been processed.
>>>>>>>>
>>>>>>>> But I'm in a logical replication architecture (we are using a
>>>>>>>> multimaster system called pgEdge, but I don't think it will
>>>>>>>> make big difference, since it's based on logical replication),
>>>>>>>> and I'm building a test cluster.
>>>>>>>>
>>>>>>> So you use PgEdge , but you wanna lose all the benefits of
>>>>>>> multi-master , since your binary data won't be replicated ...
>>>>>> I don't think I need it to be replicated, since this data cannot
>>>>>> be "edited", so either it's there or it's been deleted. Buckets
>>>>>> have protections for data deletions or events like ransomware
>>>>>> attacks and such.
>>>>>> Also multi-master was an absolute requirement one year ago
>>>>>> because of a project we were building, but it has been abandoned
>>>>>> and now a simple logical replication would be enough, but let's
>>>>>> do one thing a time.
>>>>> Multi-master is cool, you can configure your pooler / clients to
>>>>> take advantage of this for full load balanced architecture, but if
>>>>> not a strict requirement , you can live without it, as so many of
>>>>> us, and employ other means of load balancing the reads.
>>>> That's what we are doing, it's a really cool feature, but I
>>>> experienced (maybe because it uses old pglogical extension) that
>>>> the replication is a bit fragile, especially when dealing with
>>>> those bytea fields (when I ingest big loads, say 25-30 GB or more),
>>>> it happened to break replication, and recreating a replica from
>>>> scratch with "normal size" tables is not a big deal, since it can
>>>> be achieved automatically, because they normally fit in shared
>>>> memory and can be transferred by the replicator, but you can
>>>> imagine what would be the effort and the downtime necessary to
>>>> create a base backup, transfer it to the replica, build the DB and
>>>> restart a 10-TB database (ATM we are running with a 2-node cluster).
>>> Break this in batches, use modern techniques for robust data
>>> loading, in smaller transactions, if you have to.
>> Normally it's run via COPY commands, I can throttle COPY or break it
>> in batches. At the moment, while the schema is offline, we disconnect
>> replication from the bytea tables, feed them, wait for checkpoints to
>> return normal and then resume replication between tables before
>> putting schema online. This is safe, even if far from being
>> optimized. It's a migration tool, it won't be used forever, just to
>> move customers from their current architecture to new cloud one.
>>>>>>>> I've been instructed to issue VACUUM FULL on both nodes,
>>>>>>>> nightly, but before proceeding I read on docs that VACUUM FULL
>>>>>>>> can disrupt logical replication, so I'm a bit concerned on how
>>>>>>>> to proceed. Rows are cleared one a time (one transaction, one
>>>>>>>> row, to keep errors to the record that issued them)
>>>>> Mind if you shared the specific doc ?
>>>> Obviously I can't find it from a quick search, I'll search deeper,
>>>> I don't think it went off a dream :-).
>>>>>>>>
>>>>>>> PgEdge is based on the old pg_logical, the old 2ndQuadrant
>>>>>>> extension, not the native logical replication we have since
>>>>>>> pgsql 10. But I might be mistaken.
>>>>>> Don't know about this, it keeps running on latest pg versions (we
>>>>>> are about to upgrade to 17.4, if I'm not wrong), but I'll ask
>>>>>>>> I read about extensions like pg_squeeze, but I wonder if they
>>>>>>>> are still not dangerous for replication.
>>>>>>>>
>>>>>>> What's pgEdge take on that, I mean the bytea thing you are
>>>>>>> trying to achieve here.
>>>>>> They are positive, it's they that suggested to do VACUUM FULL on
>>>>>> both nodes... I'm quite new to replication, so I'm searching some
>>>>>> advise here.
>>>>>
>>>>> As I told you, pgEdge logical replication (old 2ndquadrant BDR) !=
>>>>> native logical replication. You may look here :
>>>>>
>>>>> https://github.com/pgEdge/spock
>>>>>
>>>>> If multi-master is not a must you could convert  to vanilla
>>>>> postgresql and focus on standard physical and logical replication.
>>>>>
>>>> No, multimaster is cool, but as I said, the project has been
>>>> discontinued and it's not a must anymore. This is the first step,
>>>> actually. We are planning to return to plain PostgreSQL, or
>>>> CloudSQL for PostgreSQL, using logical replication (that seems the
>>>> most reliable of the two). We created a test case for both the
>>>> options, and they seem to be OK for now, even if I have still to do
>>>> adequate stress tests. And when I'll do the migration, I'd like to
>>>> be migrating plain data only and leave blobs where they are.
>>>
>>> as you wish. But this design has inherent data infra fragmentation
>>> as you understand.
>>>
>>> Personally I like to let the DB take care of the data, and I take
>>> care of the DB, not a plethora of extra systems that we need to keep
>>> connected and consistent.
>>>
>> We followed this idea when the application (old version) was on
>> customer premises, so backups and operations were simple and getting
>> in trouble (e.g. customer deleting a directory from their PC) has
>> happened a very few times, just when they launched disk cleanup on
>> windows :-)
>>
>> Now we host a full cloud solution, so we got rid of many potential
>> problems generated by the end user, but bumped into other, as you
>> certainly imagine. We have to keep it consistent, fast, reliable,
>> keeping an eye on costs.
>> You are right, but the more I was working with this solution, the
>> more I'm having the impression of dealing with something heavy, hard
>> to mantain because of these rarely-accessed files that sum up most of
>> my data. Maybe it's just my impression, maybe I need some expertise
>> in an area that's still quite new for me.
>> At the moment that seems a good compromise between stability and
>> costs. Maybe in one year I'll be in your position (considering
>> getting everything back), but for now we are thinking forward in that
>> way.
>>
> Makes perfect sense.
>> This been said, the original question :-)
>> Would be VACUUM FULL a risky operation? Has it to be done on all
>> nodes, obviously in a low-traffic and low-access timing (night)?
> VACUUM affects the physical blocks. In a physical streaming
> replication scenario that might (or not) potentiallyt affect read-only
> queries on the hot standby (depending on usage and settings). Normally
> I cannot see how a VACUUM (plain or FULL) would interact with logical
> replication in any way. But again, since you run PgEdge specific, you
> have to ask them.
Thanks. This makes me think I misread or misinterpreted something. They
already suggested me that to use VACUUM FULL on both nodes, but that
"thing" I read (or I'm convinced to have) made me think twice before
crashing everything. Two experts' according words is quite enough for me.
I will start this evening and see what happens.

Thanks for the help and the very interesting discussion.
>>>
>>>>>>>> Thanks for your help.
>>>>>>>> Moreno.-
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>
>>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Melanie Plageman 2025-05-20 15:48:13 Re: Capturing both IP address and hostname in the log
Previous Message Achilleas Mantzios 2025-05-20 10:58:59 Re: Logical replication, need to reclaim big disk space