Re: Logical replication, need to reclaim big disk space

From: Achilleas Mantzios <a(dot)mantzios(at)cloud(dot)gatewaynet(dot)com>
To: Moreno Andreo <moreno(dot)andreo(at)evolu-s(dot)it>, pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: Logical replication, need to reclaim big disk space
Date: 2025-05-20 10:58:59
Message-ID: eb87f911-3c1e-41d3-8c1f-f6f92d812231@cloud.gatewaynet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Στις 20/5/25 12:17, ο/η Moreno Andreo έγραψε:

>
>
> On 19/05/25 20:49, Achilleas Mantzios wrote:
>>
>> On 19/5/25 17:38, Moreno Andreo wrote:
>>
>>>
>>>
>>> On 19/05/25 14:41, Achilleas Mantzios wrote:
>>>>
>>>>
>>>> On 5/19/25 09:14, Moreno Andreo wrote:
>>>>>
>>>>> On 16/05/25 21:33, Achilleas Mantzios wrote:
>>>>>> On 16/5/25 18:45, Moreno Andreo wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>     we are moving our old binary data approach, moving them from
>>>>>>> bytea fields in a table to external storage (making database
>>>>>>> smaller and related operations faster and smarter).
>>>>>>> In short, we have a job that runs in background and copies data
>>>>>>> from the table to an external file and then sets the bytea field
>>>>>>> to NULL.
>>>>>>> (UPDATE tbl SET blob = NULL, ref = 'path/to/file' WHERE id =
>>>>>>> <uuid>)
>>>>>>>
>>>>>>> This results, at the end of the operations, to a table that's
>>>>>>> less than one tenth in size.
>>>>>>> We have a multi-tenant architecture (100s of schemas with
>>>>>>> identical architecture, all inheriting from public) and we are
>>>>>>> performing the task on one table per schema.
>>>>>>>
>>>>>> So? toasted data are kept on separate TOAST tables, unless those
>>>>>> bytea cols are selected, you won't even touch them. I cannot
>>>>>> understand what you are trying to achieve here.
>>>>>>
>>>>>> Years ago, when I made the mistake to go for a coffee and let my
>>>>>> developers "improvise" , the result was a design similar to what
>>>>>> you are trying to achieve. Years after, I am seriously
>>>>>> considering moving those data back to PostgreSQL.
>>>>> The "related operations" I was talking about are backups and
>>>>> database maintenance when needed, cluster/replica management, etc.
>>>>> With a smaller database size they would be easier in timing and
>>>>> effort, right?
>>>> Ok, but you'll lose replica functionality for those blobs, which
>>>> means you don't care about them, correct me if I am wrong.
>>> I'm not saying I don't care about them, the opposite, they are
>>> protected with Object Versioning and soft deletion, this should
>>> assure a good protection against e.g. ransomware, if someone manages
>>> to get in there (and if this happens, we'll have bigger troubles
>>> than this).
>> PostgreSQL has become very popular because of ppl who care about
>> their data.
> Yeah, it's always been famous for its robustness, and that's why I
> chose PostgreSQL more than 10 years ago, and, in spite of how a
> "normal" user treats his PC, we never had corruption (only where
> FS/disk were failing, but that's not PG fault)
>>>>> We are mostly talking about costs, here. To give things their
>>>>> names, I'm moving bytea contents (85% of total data) to files into
>>>>> Google Cloud Storage buckets, that has a fraction of the cost of
>>>>> the disks holding my database (on GCE, to be clear ).
>>>> May I ask the size of the bytea data (uncompressed) ?.
>>> single records vary from 150k to 80 MB, the grand total is more than
>>> 8,5 TB in a circa 10 TB data footprint
>>>>> This data is not accessed frequently (just by the owner when he
>>>>> needs to do it), so no need to keep it on expensive hardware.
>>>>> I've already read in these years that keeping many big bytea
>>>>> fields in databases is not recommended, but might have
>>>>> misunderstood this.
>>>>
>>>> Ok, I assume those are unimportant data, but let me ask, what is
>>>> the longevity or expected legitimacy of those ? I haven't worked
>>>> with those just reading  :
>>>>
>>>> https://cloud.google.com/storage/pricing?_gl=1*1b25r8o*_up*MQ..&gclid=CjwKCAjwravBBhBjEiwAIr30VKfaOJytxmk7J29vjG4rBBkk2EUimPU5zPibST73nm3XRL2h0O9SxRoCaogQAvD_BwE&gclsrc=aw.ds#storage-pricing
>>>>
>>>> would you choose e.g. "*Anywhere Cache storage" ?
>>>> *
>>>>
>>> Absolutely not, this is *not* unimportant data, and we are using
>>> Standard Storage, for 0,02$/GB/month + operations, that compared to
>>> a 0.17$/GB/month of an SSD or even more for the Hyperdisks we are
>>> using, is a good price drop.
>> How about hosting your data in your own storage and spend 0$/GB/month ?
> If we could host on our own hardware I'd not be here talking. Maybe we
> would have a 10-node full-mesh multimaster architecture with barman
> backup on 2 separate SANs.
> But we are a small company that has to balance performance,
> consistency, security and, last but not latter, costs. And margins are
> tightening.
>>
>>>> **
>>>>
>>>>> Another way would have been to move these tables to a different
>>>>> tablespace, in cheaper storage, but it still would have been 3
>>>>> times the buckets cost.
>>>>>
>>>> can you actually mount those Cloud Storage Buckets under a
>>>> supported FS in linux and just move them to tablespaces backed by
>>>> this storage ?
>>>>
>>> Never tried, I mounted this via FUSE and had some simple operations
>>> in the past, but not sure it can handle database operations in terms
>>> of I/O bandwidth
>>>>
>>>>> Why are you considering to get data back to database tables?
>>>> Because now if we need to migrate from cloud to on-premise, or just
>>>> upgrade or move the specific server which holds those data I will
>>>> have an extra headache. Also this is a single point of failure, or
>>>> best case a cause for fragmented technology introduced just for the
>>>> sake of keeping things out of the DB.
>>> This is managed as an hierarchical disk structure, so the calling
>>> server may be literally everywhere, it just needs an account (or a
>>> service account) to get in there ,
>> and you are locked in a proprietary solution. and at their mercy of
>> any future increases in cost.
> Since we cannot host on our hardware, the only thing is to keep an eye
> on costs and migrate (yeah, more work) when it's becoming expensive.
> Every solution is proprietary, if you want to run on cloud. Even the
> VMs where PostgreSQL is running.
>>>>>>
>>>>>>
>>>>>>> The problem is: this is generating BIG table bloat, as you may
>>>>>>> imagine.
>>>>>>> Running a VACUUM FULL on an ex-22GB table on a standalone test
>>>>>>> server is almost immediate.
>>>>>>> If I had only one server, I'll process a table a time, with a
>>>>>>> nightly script, and issue a VACUUM FULL to tables that have
>>>>>>> already been processed.
>>>>>>>
>>>>>>> But I'm in a logical replication architecture (we are using a
>>>>>>> multimaster system called pgEdge, but I don't think it will make
>>>>>>> big difference, since it's based on logical replication), and
>>>>>>> I'm building a test cluster.
>>>>>>>
>>>>>> So you use PgEdge , but you wanna lose all the benefits of
>>>>>> multi-master , since your binary data won't be replicated ...
>>>>> I don't think I need it to be replicated, since this data cannot
>>>>> be "edited", so either it's there or it's been deleted. Buckets
>>>>> have protections for data deletions or events like ransomware
>>>>> attacks and such.
>>>>> Also multi-master was an absolute requirement one year ago because
>>>>> of a project we were building, but it has been abandoned and now a
>>>>> simple logical replication would be enough, but let's do one thing
>>>>> a time.
>>>> Multi-master is cool, you can configure your pooler / clients to
>>>> take advantage of this for full load balanced architecture, but if
>>>> not a strict requirement , you can live without it, as so many of
>>>> us, and employ other means of load balancing the reads.
>>> That's what we are doing, it's a really cool feature, but I
>>> experienced (maybe because it uses old pglogical extension) that the
>>> replication is a bit fragile, especially when dealing with those
>>> bytea fields (when I ingest big loads, say 25-30 GB or more), it
>>> happened to break replication, and recreating a replica from scratch
>>> with "normal size" tables is not a big deal, since it can be
>>> achieved automatically, because they normally fit in shared memory
>>> and can be transferred by the replicator, but you can imagine what
>>> would be the effort and the downtime necessary to create a base
>>> backup, transfer it to the replica, build the DB and restart a 10-TB
>>> database (ATM we are running with a 2-node cluster).
>> Break this in batches, use modern techniques for robust data loading,
>> in smaller transactions, if you have to.
> Normally it's run via COPY commands, I can throttle COPY or break it
> in batches. At the moment, while the schema is offline, we disconnect
> replication from the bytea tables, feed them, wait for checkpoints to
> return normal and then resume replication between tables before
> putting schema online. This is safe, even if far from being optimized.
> It's a migration tool, it won't be used forever, just to move
> customers from their current architecture to new cloud one.
>>>>>>> I've been instructed to issue VACUUM FULL on both nodes,
>>>>>>> nightly, but before proceeding I read on docs that VACUUM FULL
>>>>>>> can disrupt logical replication, so I'm a bit concerned on how
>>>>>>> to proceed. Rows are cleared one a time (one transaction, one
>>>>>>> row, to keep errors to the record that issued them)
>>>> Mind if you shared the specific doc ?
>>> Obviously I can't find it from a quick search, I'll search deeper, I
>>> don't think it went off a dream :-).
>>>>>>>
>>>>>> PgEdge is based on the old pg_logical, the old 2ndQuadrant
>>>>>> extension, not the native logical replication we have since pgsql
>>>>>> 10. But I might be mistaken.
>>>>> Don't know about this, it keeps running on latest pg versions (we
>>>>> are about to upgrade to 17.4, if I'm not wrong), but I'll ask
>>>>>>> I read about extensions like pg_squeeze, but I wonder if they
>>>>>>> are still not dangerous for replication.
>>>>>>>
>>>>>> What's pgEdge take on that, I mean the bytea thing you are trying
>>>>>> to achieve here.
>>>>> They are positive, it's they that suggested to do VACUUM FULL on
>>>>> both nodes... I'm quite new to replication, so I'm searching some
>>>>> advise here.
>>>>
>>>> As I told you, pgEdge logical replication (old 2ndquadrant BDR) !=
>>>> native logical replication. You may look here :
>>>>
>>>> https://github.com/pgEdge/spock
>>>>
>>>> If multi-master is not a must you could convert  to vanilla
>>>> postgresql and focus on standard physical and logical replication.
>>>>
>>> No, multimaster is cool, but as I said, the project has been
>>> discontinued and it's not a must anymore. This is the first step,
>>> actually. We are planning to return to plain PostgreSQL, or CloudSQL
>>> for PostgreSQL, using logical replication (that seems the most
>>> reliable of the two). We created a test case for both the options,
>>> and they seem to be OK for now, even if I have still to do adequate
>>> stress tests. And when I'll do the migration, I'd like to be
>>> migrating plain data only and leave blobs where they are.
>>
>> as you wish. But this design has inherent data infra fragmentation as
>> you understand.
>>
>> Personally I like to let the DB take care of the data, and I take
>> care of the DB, not a plethora of extra systems that we need to keep
>> connected and consistent.
>>
> We followed this idea when the application (old version) was on
> customer premises, so backups and operations were simple and getting
> in trouble (e.g. customer deleting a directory from their PC) has
> happened a very few times, just when they launched disk cleanup on
> windows :-)
>
> Now we host a full cloud solution, so we got rid of many potential
> problems generated by the end user, but bumped into other, as you
> certainly imagine. We have to keep it consistent, fast, reliable,
> keeping an eye on costs.
> You are right, but the more I was working with this solution, the more
> I'm having the impression of dealing with something heavy, hard to
> mantain because of these rarely-accessed files that sum up most of my
> data. Maybe it's just my impression, maybe I need some expertise in an
> area that's still quite new for me.
> At the moment that seems a good compromise between stability and
> costs. Maybe in one year I'll be in your position (considering getting
> everything back), but for now we are thinking forward in that way.
>
Makes perfect sense.
> This been said, the original question :-)
> Would be VACUUM FULL a risky operation? Has it to be done on all
> nodes, obviously in a low-traffic and low-access timing (night)?
VACUUM affects the physical blocks. In a physical streaming replication
scenario that might (or not) potentiallyt affect read-only queries on
the hot standby (depending on usage and settings). Normally I cannot see
how a VACUUM (plain or FULL) would interact with logical replication in
any way. But again, since you run PgEdge specific, you have to ask them.
>>
>>>>>>> Thanks for your help.
>>>>>>> Moreno.-
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Moreno Andreo 2025-05-20 11:12:42 Re: Logical replication, need to reclaim big disk space
Previous Message Moreno Andreo 2025-05-20 09:17:25 Re: Logical replication, need to reclaim big disk space