| From: | Achilleas Mantzios <a(dot)mantzios(at)cloud(dot)gatewaynet(dot)com> | 
|---|---|
| To: | Moreno Andreo <moreno(dot)andreo(at)evolu-s(dot)it>, pgsql-general(at)lists(dot)postgresql(dot)org | 
| Subject: | Re: Logical replication, need to reclaim big disk space | 
| Date: | 2025-05-20 10:58:59 | 
| Message-ID: | eb87f911-3c1e-41d3-8c1f-f6f92d812231@cloud.gatewaynet.com | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-general | 
Στις 20/5/25 12:17, ο/η Moreno Andreo έγραψε:
>
>
> On 19/05/25 20:49, Achilleas Mantzios wrote:
>>
>> On 19/5/25 17:38, Moreno Andreo wrote:
>>
>>>
>>>
>>> On 19/05/25 14:41, Achilleas Mantzios wrote:
>>>>
>>>>
>>>> On 5/19/25 09:14, Moreno Andreo wrote:
>>>>>
>>>>> On 16/05/25 21:33, Achilleas Mantzios wrote:
>>>>>> On 16/5/25 18:45, Moreno Andreo wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>     we are moving our old binary data approach, moving them from 
>>>>>>> bytea fields in a table to external storage (making database 
>>>>>>> smaller and related operations faster and smarter).
>>>>>>> In short, we have a job that runs in background and copies data 
>>>>>>> from the table to an external file and then sets the bytea field 
>>>>>>> to NULL.
>>>>>>> (UPDATE tbl SET blob = NULL, ref = 'path/to/file' WHERE id = 
>>>>>>> <uuid>)
>>>>>>>
>>>>>>> This results, at the end of the operations, to a table that's 
>>>>>>> less than one tenth in size.
>>>>>>> We have a multi-tenant architecture (100s of schemas with 
>>>>>>> identical architecture, all inheriting from public) and we are 
>>>>>>> performing the task on one table per schema.
>>>>>>>
>>>>>> So? toasted data are kept on separate TOAST tables, unless those 
>>>>>> bytea cols are selected, you won't even touch them. I cannot 
>>>>>> understand what you are trying to achieve here.
>>>>>>
>>>>>> Years ago, when I made the mistake to go for a coffee and let my 
>>>>>> developers "improvise" , the result was a design similar to what 
>>>>>> you are trying to achieve. Years after, I am seriously 
>>>>>> considering moving those data back to PostgreSQL.
>>>>> The "related operations" I was talking about are backups and 
>>>>> database maintenance when needed, cluster/replica management, etc. 
>>>>> With a smaller database size they would be easier in timing and 
>>>>> effort, right?
>>>> Ok, but you'll lose replica functionality for those blobs, which 
>>>> means you don't care about them, correct me if I am wrong.
>>> I'm not saying I don't care about them, the opposite, they are 
>>> protected with Object Versioning and soft deletion, this should 
>>> assure a good protection against e.g. ransomware, if someone manages 
>>> to get in there (and if this happens, we'll have bigger troubles 
>>> than this).
>> PostgreSQL has become very popular because of ppl who care about 
>> their data.
> Yeah, it's always been famous for its robustness, and that's why I 
> chose PostgreSQL more than 10 years ago, and, in spite of how a 
> "normal" user treats his PC, we never had corruption (only where 
> FS/disk were failing, but that's not PG fault)
>>>>> We are mostly talking about costs, here. To give things their 
>>>>> names, I'm moving bytea contents (85% of total data) to files into 
>>>>> Google Cloud Storage buckets, that has a fraction of the cost of 
>>>>> the disks holding my database (on GCE, to be clear ).
>>>> May I ask the size of the bytea data (uncompressed) ?.
>>> single records vary from 150k to 80 MB, the grand total is more than 
>>> 8,5 TB in a circa 10 TB data footprint
>>>>> This data is not accessed frequently (just by the owner when he 
>>>>> needs to do it), so no need to keep it on expensive hardware.
>>>>> I've already read in these years that keeping many big bytea 
>>>>> fields in databases is not recommended, but might have 
>>>>> misunderstood this.
>>>>
>>>> Ok, I assume those are unimportant data, but let me ask, what is 
>>>> the longevity or expected legitimacy of those ? I haven't worked 
>>>> with those just reading  :
>>>>
>>>> https://cloud.google.com/storage/pricing?_gl=1*1b25r8o*_up*MQ..&gclid=CjwKCAjwravBBhBjEiwAIr30VKfaOJytxmk7J29vjG4rBBkk2EUimPU5zPibST73nm3XRL2h0O9SxRoCaogQAvD_BwE&gclsrc=aw.ds#storage-pricing
>>>>
>>>> would you choose e.g. "*Anywhere Cache storage" ?
>>>> *
>>>>
>>> Absolutely not, this is *not* unimportant data, and we are using 
>>> Standard Storage, for 0,02$/GB/month + operations, that compared to 
>>> a 0.17$/GB/month of an SSD or even more for the Hyperdisks we are 
>>> using, is a good price drop.
>> How about hosting your data in your own storage and spend 0$/GB/month ?
> If we could host on our own hardware I'd not be here talking. Maybe we 
> would have a 10-node full-mesh multimaster architecture with barman 
> backup on 2 separate SANs.
> But we are a small company that has to balance performance, 
> consistency, security and, last but not latter, costs. And margins are 
> tightening.
>>
>>>> **
>>>>
>>>>> Another way would have been to move these tables to a different 
>>>>> tablespace, in cheaper storage, but it still would have been 3 
>>>>> times the buckets cost.
>>>>>
>>>> can you actually mount those Cloud Storage Buckets under a 
>>>> supported FS in linux and just move them to tablespaces backed by 
>>>> this storage ?
>>>>
>>> Never tried, I mounted this via FUSE and had some simple operations 
>>> in the past, but not sure it can handle database operations in terms 
>>> of I/O bandwidth
>>>>
>>>>> Why are you considering to get data back to database tables?
>>>> Because now if we need to migrate from cloud to on-premise, or just 
>>>> upgrade or move the specific server which holds those data I will 
>>>> have an extra headache. Also this is a single point of failure, or 
>>>> best case a cause for fragmented technology introduced just for the 
>>>> sake of keeping things out of the DB.
>>> This is managed as an hierarchical disk structure, so the calling 
>>> server may be literally everywhere, it just needs an account (or a 
>>> service account) to get in there ,
>> and you are locked in a proprietary solution. and at their mercy of 
>> any future increases in cost.
> Since we cannot host on our hardware, the only thing is to keep an eye 
> on costs and migrate (yeah, more work) when it's becoming expensive. 
> Every solution is proprietary, if you want to run on cloud. Even the 
> VMs where PostgreSQL is running.
>>>>>>
>>>>>>
>>>>>>> The problem is: this is generating BIG table bloat, as you may 
>>>>>>> imagine.
>>>>>>> Running a VACUUM FULL on an ex-22GB table on a standalone test 
>>>>>>> server is almost immediate.
>>>>>>> If I had only one server, I'll process a table a time, with a 
>>>>>>> nightly script, and issue a VACUUM FULL to tables that have 
>>>>>>> already been processed.
>>>>>>>
>>>>>>> But I'm in a logical replication architecture (we are using a 
>>>>>>> multimaster system called pgEdge, but I don't think it will make 
>>>>>>> big difference, since it's based on logical replication), and 
>>>>>>> I'm building a test cluster.
>>>>>>>
>>>>>> So you use PgEdge , but you wanna lose all the benefits of 
>>>>>> multi-master , since your binary data won't be replicated ...
>>>>> I don't think I need it to be replicated, since this data cannot 
>>>>> be "edited", so either it's there or it's been deleted. Buckets 
>>>>> have protections for data deletions or events like ransomware 
>>>>> attacks and such.
>>>>> Also multi-master was an absolute requirement one year ago because 
>>>>> of a project we were building, but it has been abandoned and now a 
>>>>> simple logical replication would be enough, but let's do one thing 
>>>>> a time.
>>>> Multi-master is cool, you can configure your pooler / clients to 
>>>> take advantage of this for full load balanced architecture, but if 
>>>> not a strict requirement , you can live without it, as so many of 
>>>> us, and employ other means of load balancing the reads.
>>> That's what we are doing, it's a really cool feature, but I 
>>> experienced (maybe because it uses old pglogical extension) that the 
>>> replication is a bit fragile, especially when dealing with those 
>>> bytea fields (when I ingest big loads, say 25-30 GB or more), it 
>>> happened to break replication, and recreating a replica from scratch 
>>> with "normal size" tables is not a big deal, since it can be 
>>> achieved automatically, because they normally fit in shared memory 
>>> and can be transferred by the replicator, but you can imagine what 
>>> would be the effort and the downtime necessary to create a base 
>>> backup, transfer it to the replica, build the DB and restart a 10-TB 
>>> database (ATM we are running with a 2-node cluster).
>> Break this in batches, use modern techniques for robust data loading, 
>> in smaller transactions, if you have to.
> Normally it's run via COPY commands, I can throttle COPY or break it 
> in batches. At the moment, while the schema is offline, we disconnect 
> replication from the bytea tables, feed them, wait for checkpoints to 
> return normal and then resume replication between tables before 
> putting schema online. This is safe, even if far from being optimized. 
> It's a migration tool, it won't be used forever, just to move 
> customers from their current architecture to new cloud one.
>>>>>>> I've been instructed to issue VACUUM FULL on both nodes, 
>>>>>>> nightly, but before proceeding I read on docs that VACUUM FULL 
>>>>>>> can disrupt logical replication, so I'm a bit concerned on how 
>>>>>>> to proceed. Rows are cleared one a time (one transaction, one 
>>>>>>> row, to keep errors to the record that issued them)
>>>> Mind if you shared the specific doc ?
>>> Obviously I can't find it from a quick search, I'll search deeper, I 
>>> don't think it went off a dream :-).
>>>>>>>
>>>>>> PgEdge is based on the old pg_logical, the old 2ndQuadrant 
>>>>>> extension, not the native logical replication we have since pgsql 
>>>>>> 10. But I might be mistaken.
>>>>> Don't know about this, it keeps running on latest pg versions (we 
>>>>> are about to upgrade to 17.4, if I'm not wrong), but I'll ask
>>>>>>> I read about extensions like pg_squeeze, but I wonder if they 
>>>>>>> are still not dangerous for replication.
>>>>>>>
>>>>>> What's pgEdge take on that, I mean the bytea thing you are trying 
>>>>>> to achieve here.
>>>>> They are positive, it's they that suggested to do VACUUM FULL on 
>>>>> both nodes... I'm quite new to replication, so I'm searching some 
>>>>> advise here.
>>>>
>>>> As I told you, pgEdge logical replication (old 2ndquadrant BDR) != 
>>>> native logical replication. You may look here :
>>>>
>>>> https://github.com/pgEdge/spock
>>>>
>>>> If multi-master is not a must you could convert  to vanilla 
>>>> postgresql and focus on standard physical and logical replication.
>>>>
>>> No, multimaster is cool, but as I said, the project has been 
>>> discontinued and it's not a must anymore. This is the first step, 
>>> actually. We are planning to return to plain PostgreSQL, or CloudSQL 
>>> for PostgreSQL, using logical replication (that seems the most 
>>> reliable of the two). We created a test case for both the options, 
>>> and they seem to be OK for now, even if I have still to do adequate 
>>> stress tests. And when I'll do the migration, I'd like to be 
>>> migrating plain data only and leave blobs where they are.
>>
>> as you wish. But this design has inherent data infra fragmentation as 
>> you understand.
>>
>> Personally I like to let the DB take care of the data, and I take 
>> care of the DB, not a plethora of extra systems that we need to keep 
>> connected and consistent.
>>
> We followed this idea when the application (old version) was on 
> customer premises, so backups and operations were simple and getting 
> in trouble (e.g. customer deleting a directory from their PC) has 
> happened a very few times, just when they launched disk cleanup on 
> windows :-)
>
> Now we host a full cloud solution, so we got rid of many potential 
> problems generated by the end user, but bumped into other, as you 
> certainly imagine. We have to keep it consistent, fast, reliable, 
> keeping an eye on costs.
> You are right, but the more I was working with this solution, the more 
> I'm having the impression of dealing with something heavy, hard to 
> mantain because of these rarely-accessed files that sum up most of my 
> data. Maybe it's just my impression, maybe I need some expertise in an 
> area that's still quite new for me.
> At the moment that seems a good compromise between stability and 
> costs. Maybe in one year I'll be in your position (considering getting 
> everything back), but for now we are thinking forward in that way.
>
Makes perfect sense.
> This been said, the original question :-)
> Would be VACUUM FULL a risky operation? Has it to be done on all 
> nodes, obviously in a low-traffic and low-access timing (night)?
VACUUM affects the physical blocks. In a physical streaming replication 
scenario that might (or not) potentiallyt affect read-only queries on 
the hot standby (depending on usage and settings). Normally I cannot see 
how a VACUUM (plain or FULL) would interact with logical replication in 
any way. But again, since you run PgEdge specific, you have to ask them.
>>
>>>>>>> Thanks for your help.
>>>>>>> Moreno.-
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Moreno Andreo | 2025-05-20 11:12:42 | Re: Logical replication, need to reclaim big disk space | 
| Previous Message | Moreno Andreo | 2025-05-20 09:17:25 | Re: Logical replication, need to reclaim big disk space |