Re: Pluggable Storage - Andres's take

From: Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Pluggable Storage - Andres's take
Date: 2018-09-10 07:42:00
Message-ID: CAJrrPGfH7awxE5jDwJ8_gU2J9LE56=jGjEEkXRTG37Js2EC7LA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

On Wed, Sep 5, 2018 at 2:04 PM Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>
wrote:

>
> On Tue, Sep 4, 2018 at 10:33 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
>> Hi,
>>
>> Thanks for the patches!
>>
>> On 2018-09-03 19:06:27 +1000, Haribabu Kommi wrote:
>> > I found couple of places where the zheap is using some extra logic in
>> > verifying
>> > whether it is zheap AM or not, based on that it used to took some extra
>> > decisions.
>> > I am analyzing all the extra code that is done, whether any callbacks
>> can
>> > handle it
>> > or not? and how? I can come back with more details later.
>>
>> Yea, I think some of them will need to stay (particularly around
>> integrating undo) and som other ones we'll need to abstract.
>>
>
> OK. I will list all the areas that I found with my observation of how to
> abstract or leaving it and then implement around it.
>

The following are the change where the code is specific to checking whether
it is a zheap relation or not?

Overall I found that It needs 3 new API's at the following locations.
1. RelationSetNewRelfilenode
2. heap_create_init_fork
3. estimate_rel_size
4. Facility to provide handler options like (skip WAL and etc).
_hash_vacuum_one_page:
xlrec.flags = RelationStorageIsZHeap(heapRel) ?
XLOG_HASH_VACUUM_RELATION_STORAGE_ZHEAP : 0;

_bt_delitems_delete:
xlrec_delete.flags = RelationStorageIsZHeap(heapRel) ?
XLOG_BTREE_DELETE_RELATION_STORAGE_ZHEAP : 0;

Storing the type of the handler and while checking for these new types
adding a
new API for special handing can remove the need of the above code.
RelationAddExtraBlocks:
if (RelationStorageIsZHeap(relation))
{
ZheapInitPage(page, BufferGetPageSize(buffer));
freespace = PageGetZHeapFreeSpace(page);
}

Adding a new API for PageInt and PageGetHeapFreeSpace to redirect the calls
to specific
table am handlers.

visibilitymap_set:
if (RelationStorageIsZHeap(rel))
{
recptr = log_zheap_visible(rel->rd_node, heapBuf, vmBuf,
cutoff_xid, flags);
/*
* We do not have a page wise visibility flag in zheap.
* So no need to set LSN on zheap page.
*/
}

Handler options may solve need of above code.

validate_index_heapscan:
/* Set up for predicate or expression evaluation */
/* For zheap relations, the tuple is locally allocated, so free it. */
ExecStoreHeapTuple(heapTuple, slot, RelationStorageIsZHeap(heapRelation));

This will solve after changing the validate_index_heapscan function to
slotify.

RelationTruncate:
/* Create the meta page for zheap */
if (RelationStorageIsZHeap(rel))
RelationSetNewRelfilenode(rel, rel->rd_rel->relpersistence,
InvalidTransactionId,
InvalidMultiXactId);
if (rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
rel->rd_rel->relkind != 'p')
{
heap_create_init_fork(rel);
if (RelationStorageIsZHeap(rel))
ZheapInitMetaPage(rel, INIT_FORKNUM);
}

new API in RelationSetNewRelfilenode and heap_create_init_fork can solve it.
cluster:
if (RelationStorageIsZHeap(rel))
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("cannot cluster a zheap table")));

No change required.

copyFrom:
/*
* In zheap, we don't support the optimization for HEAP_INSERT_SKIP_WAL.
* See zheap_prepare_insert for details.
* PBORKED / ZBORKED: abstract
*/
if (!RelationStorageIsZHeap(cstate->rel) && !XLogIsNeeded())
hi_options |= HEAP_INSERT_SKIP_WAL;
How about requesting the table am handler to provide options and use them
here?
ExecuteTruncateGuts:

// PBORKED: Need to abstract this
minmulti = GetOldestMultiXactId();

/*
* Need the full transaction-safe pushups.
*
* Create a new empty storage file for the relation, and assign it
* as the relfilenode value. The old storage file is scheduled for
* deletion at commit.
*
* PBORKED: needs to be a callback
*/
if (RelationStorageIsZHeap(rel))
RelationSetNewRelfilenode(rel, rel->rd_rel->relpersistence,
InvalidTransactionId, InvalidMultiXactId);
else
RelationSetNewRelfilenode(rel, rel->rd_rel->relpersistence,
RecentXmin, minmulti);
if (rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED)
{
heap_create_init_fork(rel);
if (RelationStorageIsZHeap(rel))
ZheapInitMetaPage(rel, INIT_FORKNUM);
}

New API inside RelationSetNewRelfilenode can handle it.
ATRewriteCatalogs:
/* Inherit the storage_engine reloption from the parent table. */
if (RelationStorageIsZHeap(rel))
{
static char *validnsps[] = HEAP_RELOPT_NAMESPACES;
DefElem *storage_engine;

storage_engine = makeDefElemExtended("toast", "storage_engine",
(Node *) makeString("zheap"),
DEFELEM_UNSPEC, -1);
reloptions = transformRelOptions((Datum) 0,
list_make1(storage_engine),
"toast",
validnsps, true, false);
}

I don't think anything can be done in API.

ATRewriteTable:
/*
* In zheap, we don't support the optimization for HEAP_INSERT_SKIP_WAL.
* See zheap_prepare_insert for details.
*
* ZFIXME / PFIXME: We probably need a different abstraction for this.
*/
if (!RelationStorageIsZHeap(newrel) && !XLogIsNeeded())
hi_options |= HEAP_INSERT_SKIP_WAL;

Options can solve this also.

estimate_rel_size:
if (curpages < 10 &&
(rel->rd_rel->relpages == 0 ||
(RelationStorageIsZHeap(rel) &&
rel->rd_rel->relpages == ZHEAP_METAPAGE + 1)) &&
!rel->rd_rel->relhassubclass &&
rel->rd_rel->relkind != RELKIND_INDEX)
curpages = 10;

/* report estimated # pages */
*pages = curpages;
/* quick exit if rel is clearly empty */
if (curpages == 0 || (RelationStorageIsZHeap(rel) &&
curpages == ZHEAP_METAPAGE + 1))
{
*tuples = 0;
*allvisfrac = 0;
break;
}
/* coerce values in pg_class to more desirable types */
relpages = (BlockNumber) rel->rd_rel->relpages;
reltuples = (double) rel->rd_rel->reltuples;
relallvisible = (BlockNumber) rel->rd_rel->relallvisible;

/*
* If it's a zheap relation, then subtract the pages
* to account for the metapage.
*/
if (relpages > 0 && RelationStorageIsZHeap(rel))
{
curpages--;
relpages--;
}

An API may be needed to find out estimation size based on handler type?
pg_stat_get_tuples_hot_updated and others:
/*
* Counter tuples_hot_updated stores number of hot updates for heap table
* and the number of inplace updates for zheap table.
*/
if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL ||
RelationStorageIsZHeap(rel))
result = 0;
else
result = (int64) (tabentry->tuples_hot_updated);

Is the special condition is needed? The values should be 0 because of zheap
right?

RelationSetNewRelfilenode:
/* Initialize the metapage for zheap relation. */
if (RelationStorageIsZHeap(relation))
ZheapInitMetaPage(relation, MAIN_FORKNUM);
New API in RelationSetNetRelfilenode Can solve this problem.

>
>> > >> And then:
>> > >> - lotsa cleanups
>> > >> - rebasing onto a newer version of the abstract slot patchset
>> > >> - splitting out smaller patches
>> > >>
>> > >>
>> > >> You'd moved the bulk insert into tableam callbacks - I don't quite
>> get
>> > >> why? There's not really anything AM specific in that code?
>> > >>
>> > >
>> > > The main reason of adding them to AM is just to provide a control to
>> > > the specific AM to decide whether they can support the bulk insert or
>> > > not?
>> > >
>> > > Current framework doesn't support AM specific bulk insert state to be
>> > > passed from one function to another and it's structure is fixed. This
>> needs
>> > > to be enhanced to add AM specific private members also.
>> > >
>> >
>> > Do you want me to work on it to make it generic to AM methods to extend
>> > the structure?
>>
>> I think the best thing here would be to *remove* all AM abstraction for
>> bulk insert, until it's actuallly needed. The likelihood of us getting
>> the interface right and useful without an actual user seems low. Also,
>> this already is a huge patch...
>>
>
> OK. Will remove them and share the patch.
>

Bulk insert API changes are removed.

>
>
>>
>> > @@ -308,7 +308,7 @@ static void CopyFromInsertBatch(CopyState cstate,
>> EState *estate,
>> > CommandId mycid, int hi_options,
>> > ResultRelInfo *resultRelInfo,
>> > BulkInsertState bistate,
>> > - int nBufferedTuples,
>> TupleTableSlot **bufferedSlots,
>> > + int nBufferedSlots,
>> TupleTableSlot **bufferedSlots,
>> > uint64 firstBufferedLineNo);
>> > static bool CopyReadLine(CopyState cstate);
>> > static bool CopyReadLineText(CopyState cstate);
>> > @@ -2309,11 +2309,12 @@ CopyFrom(CopyState cstate)
>> > void *bistate;
>> > uint64 processed = 0;
>> > bool useHeapMultiInsert;
>> > - int nBufferedTuples = 0;
>> > + int nBufferedSlots = 0;
>> > int prev_leaf_part_index = -1;
>>
>> > -#define MAX_BUFFERED_TUPLES 1000
>> > +#define MAX_BUFFERED_SLOTS 1000
>>
>> What's the point of these renames? We're still dealing in tuples. Just
>> seems to make the patch larger.
>>
>
> OK. I will correct it.
>
>
>>
>> > if (useHeapMultiInsert)
>> > {
>> > + int tup_size;
>> > +
>> > /* Add this tuple to the tuple
>> buffer */
>> > - if (nBufferedTuples == 0)
>> > + if (nBufferedSlots == 0)
>> > + {
>> > firstBufferedLineNo =
>> cstate->cur_lineno;
>> > -
>> Assert(bufferedSlots[nBufferedTuples] == myslot);
>> > - nBufferedTuples++;
>> > +
>> > + /*
>> > + * Find out the Tuple
>> size of the first tuple in a batch and
>> > + * use it for the rest
>> tuples in a batch. There may be scenarios
>> > + * where the first tuple
>> is very small and rest can be large, but
>> > + * that's rare and this
>> should work for majority of the scenarios.
>> > + */
>> > + tup_size =
>> heap_compute_data_size(myslot->tts_tupleDescriptor,
>> > +
>> myslot->tts_values,
>> > +
>> myslot->tts_isnull);
>> > + }
>>
>> This seems too exensive to me. I think it'd be better if we instead
>> used the amount of input data consumed for the tuple as a proxy. Does that
>> sound reasonable?
>>
>
> Yes, the cstate structure contains the line_buf member that holds the
> information of
> the line length of the row, this can be used as a tuple length to limit
> the size usage.
> comments?
>

copy from batch insert memory usage limit fix and Provide grammer support
for USING method
to create table as also.

Regards,
Haribabu Kommi
Fujitsu Australia

Attachment Content-Type Size
0001-copy-memory-limit-fix.patch application/octet-stream 1.9 KB
0003-CREATE-AS-USING-method-grammer-support.patch application/octet-stream 2.0 KB
0002-Remove-of-Bulk-insert-state-API.patch application/octet-stream 6.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jinhua Luo 2018-09-10 08:32:39 Can I just reload the slave to change primary_conninfo?
Previous Message Hannu Krosing 2018-09-10 07:28:31 Re: CREATE ROUTINE MAPPING