Skip to content

Commit 3e70693

Browse files
authored
Implement index prefetch for index and index-only scans (#277)
* Implement index prefetch for index and index-only scans * Move prefetch_blocks array to the end of BTScanOpaqueData struct
1 parent 05108c9 commit 3e70693

File tree

9 files changed

+301
-6
lines changed

9 files changed

+301
-6
lines changed

src/backend/access/nbtree/README

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1054,3 +1054,47 @@ item is irrelevant, and need not be stored at all. This arrangement
10541054
corresponds to the fact that an L&Y non-leaf page has one more pointer
10551055
than key. Suffix truncation's negative infinity attributes behave in
10561056
the same way.
1057+
1058+
Notes About Index Scan Prefetch
1059+
-------------------------------
1060+
1061+
Prefetch can significantly improve the speed of OLAP queries.
1062+
To be able to perform prefetch, we need to know which pages will
1063+
be accessed during the scan. It is trivial for heap- and bitmap scans,
1064+
but requires more effort for index scans: to implement prefetch for
1065+
index scans, we need to find out subsequent leaf pages.
1066+
1067+
Postgres links all pages at the same level of the B-Tree in a doubly linked list and uses this list for
1068+
forward and backward iteration. This list, however, can not trivially be used for prefetching because to locate the next page because we need first to load the current page. To prefetch more than only the next page, we can utilize the parent page's downlinks instead, as it contains references to most of the target page's sibling pages.
1069+
1070+
Because Postgres' nbtree pages have no reference to their parent page, we need to remember the parent page when descending the btree and use it to prefetch subsequent pages. We will utilize the parent's linked list to improve the performance of this prefetch system past the key range of the parent page.
1071+
1072+
We should prefetch not only leaf pages, but also the next parent page.
1073+
The trick is to correctly calculate the moment when it will be needed:
1074+
We should not issue the prefetch request when prefetch requests for all children from the current parent page have already been issued, but when there are only effective_io_concurrency line pointers left to prefetch from the page.
1075+
1076+
Currently there are two different prefetch implementations for
1077+
index-only scan and index scan. Index-only scan doesn't need to access heap tuples so it prefetches
1078+
only B-Tree leave pages (and their parents). Prefetch of index-only scan is performed only
1079+
if parallel plan is not used. Parallel index scan is using critical section for obtaining next
1080+
page by parallel worker. Leaf page is loaded in this critical section.
1081+
And if most of time is spent in loading the page, then it actually eliminates any concurrency
1082+
and makes prefetch useless. For relatively small tables Postgres will not choose parallel plan in
1083+
any case. And for large tables it can be enforced by setting max_parallel_workers_per_gather=0.
1084+
1085+
Prefetch for normal (not index-only) index tries to prefetch heap tuples
1086+
referenced from leaf page. Average number of items per page
1087+
is about 100 which is comparable with default value of effective_io_concurrency.
1088+
So there is not so much sense trying to prefetch also next leaf page.
1089+
1090+
As far as it is difficult to estimate number of entries traversed by index scan,
1091+
we prefer not to prefetch large number of pages from the very beginning.
1092+
Such useless prefetch can reduce the performance of point lookups.
1093+
Instead of it we start with smallest prefetch distance and increase it
1094+
by INCREASE_PREFETCH_DISTANCE_STEP after processing each item
1095+
until it reaches effective_io_concurrency. In case of index-only
1096+
scan we increase prefetch distance after processing each leaf pages
1097+
and for index scan - after processing each tuple.
1098+
The only exception is case when no key bounds are specified.
1099+
In this case we traverse the whole relation and it makes sense
1100+
to start with the largest possible prefetch distance from the very beginning.

src/backend/access/nbtree/nbtinsert.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2159,7 +2159,7 @@ _bt_insert_parent(Relation rel,
21592159
BlockNumberIsValid(RelationGetTargetBlock(rel))));
21602160

21612161
/* Find the leftmost page at the next level up */
2162-
pbuf = _bt_get_endpoint(rel, opaque->btpo_level + 1, false, NULL);
2162+
pbuf = _bt_get_endpoint(rel, opaque->btpo_level + 1, false, NULL, NULL);
21632163
/* Set up a phony stack entry pointing there */
21642164
stack = &fakestack;
21652165
stack->bts_blkno = BufferGetBlockNumber(pbuf);

src/backend/access/nbtree/nbtree.c

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -367,6 +367,7 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
367367

368368
so->killedItems = NULL; /* until needed */
369369
so->numKilled = 0;
370+
so->prefetch_maximum = 0; /* disable prefetch */
370371

371372
/*
372373
* We don't know yet whether the scan will be index-only, so we do not

src/backend/access/nbtree/nbtsearch.c

Lines changed: 210 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -17,12 +17,14 @@
1717

1818
#include "access/nbtree.h"
1919
#include "access/relscan.h"
20+
#include "catalog/catalog.h"
2021
#include "miscadmin.h"
22+
#include "optimizer/cost.h"
2123
#include "pgstat.h"
2224
#include "storage/predicate.h"
2325
#include "utils/lsyscache.h"
2426
#include "utils/rel.h"
25-
27+
#include "utils/spccache.h"
2628

2729
static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
2830
static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
@@ -46,6 +48,7 @@ static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
4648
static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
4749
static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
4850

51+
#define INCREASE_PREFETCH_DISTANCE_STEP 1
4952

5053
/*
5154
* _bt_drop_lock_and_maybe_pin()
@@ -841,6 +844,70 @@ _bt_compare(Relation rel,
841844
return 0;
842845
}
843846

847+
848+
/*
849+
* _bt_read_parent_for_prefetch - read parent page and extract references to children for prefetch.
850+
* This functions returns offset of first item.
851+
*/
852+
static int
853+
_bt_read_parent_for_prefetch(IndexScanDesc scan, BlockNumber parent, ScanDirection dir)
854+
{
855+
Relation rel = scan->indexRelation;
856+
BTScanOpaque so = (BTScanOpaque) scan->opaque;
857+
Buffer buf;
858+
Page page;
859+
BTPageOpaque opaque;
860+
OffsetNumber offnum;
861+
OffsetNumber n_child;
862+
int next_parent_prefetch_index;
863+
int i, j;
864+
865+
buf = _bt_getbuf(rel, parent, BT_READ);
866+
page = BufferGetPage(buf);
867+
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
868+
offnum = P_FIRSTDATAKEY(opaque);
869+
n_child = PageGetMaxOffsetNumber(page) - offnum + 1;
870+
871+
/* Position where we should insert prefetch of parent page: we intentionally use prefetch_maximum here instead of current_prefetch_distance,
872+
* assuming that it will reach prefetch_maximum before we reach and of the parent page
873+
*/
874+
next_parent_prefetch_index = (n_child > so->prefetch_maximum)
875+
? n_child - so->prefetch_maximum : 0;
876+
877+
if (ScanDirectionIsForward(dir))
878+
{
879+
so->next_parent = opaque->btpo_next;
880+
if (so->next_parent == P_NONE)
881+
next_parent_prefetch_index = -1;
882+
for (i = 0, j = 0; i < n_child; i++)
883+
{
884+
ItemId itemid = PageGetItemId(page, offnum + i);
885+
IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
886+
if (i == next_parent_prefetch_index)
887+
so->prefetch_blocks[j++] = so->next_parent; /* time to prefetch next parent page */
888+
so->prefetch_blocks[j++] = BTreeTupleGetDownLink(itup);
889+
}
890+
}
891+
else
892+
{
893+
so->next_parent = opaque->btpo_prev;
894+
if (so->next_parent == P_NONE)
895+
next_parent_prefetch_index = -1;
896+
for (i = 0, j = 0; i < n_child; i++)
897+
{
898+
ItemId itemid = PageGetItemId(page, offnum + n_child - i - 1);
899+
IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
900+
if (i == next_parent_prefetch_index)
901+
so->prefetch_blocks[j++] = so->next_parent; /* time to prefetch next parent page */
902+
so->prefetch_blocks[j++] = BTreeTupleGetDownLink(itup);
903+
}
904+
}
905+
so->n_prefetch_blocks = j;
906+
so->last_prefetch_index = 0;
907+
_bt_relbuf(rel, buf);
908+
return offnum;
909+
}
910+
844911
/*
845912
* _bt_first() -- Find the first item in a scan.
846913
*
@@ -1100,6 +1167,37 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
11001167
}
11011168
}
11021169

1170+
/* Neon: initialize prefetch */
1171+
so->n_prefetch_requests = 0;
1172+
so->n_prefetch_blocks = 0;
1173+
so->last_prefetch_index = 0;
1174+
so->next_parent = P_NONE;
1175+
so->prefetch_maximum = IsCatalogRelation(rel)
1176+
? effective_io_concurrency
1177+
: get_tablespace_io_concurrency(rel->rd_rel->reltablespace);
1178+
1179+
if (scan->xs_want_itup) /* index only scan */
1180+
{
1181+
if (enable_indexonlyscan_prefetch)
1182+
{
1183+
/* We disable prefetch for parallel index-only scan.
1184+
* Neon prefetch is efficient only if prefetched blocks are accessed by the same worker
1185+
* which issued prefetch request. The logic of splitting pages between parallel workers in
1186+
* index scan doesn't allow to satisfy this requirement.
1187+
* Also prefetch of leave pages will be useless if expected number of rows fits in one page.
1188+
*/
1189+
if (scan->parallel_scan)
1190+
so->prefetch_maximum = 0; /* disable prefetch */
1191+
}
1192+
else
1193+
so->prefetch_maximum = 0; /* disable prefetch */
1194+
}
1195+
else if (!enable_indexscan_prefetch || !scan->heapRelation)
1196+
so->prefetch_maximum = 0; /* disable prefetch */
1197+
1198+
/* If key bounds are not specified, then we will scan the whole relation and it make sense to start with the largest possible prefetch distance */
1199+
so->current_prefetch_distance = (keysCount == 0) ? so->prefetch_maximum : 0;
1200+
11031201
/*
11041202
* If we found no usable boundary keys, we have to start from one end of
11051203
* the tree. Walk down that edge to the first or last key, and scan from
@@ -1370,6 +1468,21 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
13701468
*/
13711469
stack = _bt_search(rel, &inskey, &buf, BT_READ, scan->xs_snapshot);
13721470

1471+
/* Start prefetching for index only scan */
1472+
if (so->prefetch_maximum > 0 && stack != NULL && scan->xs_want_itup) /* index only scan */
1473+
{
1474+
int first_offset = _bt_read_parent_for_prefetch(scan, stack->bts_blkno, dir);
1475+
int skip = ScanDirectionIsForward(dir)
1476+
? stack->bts_offset - first_offset
1477+
: first_offset + so->n_prefetch_blocks - 1 - stack->bts_offset;
1478+
Assert(so->n_prefetch_blocks >= skip);
1479+
so->current_prefetch_distance = INCREASE_PREFETCH_DISTANCE_STEP;
1480+
so->n_prefetch_requests = Min(so->current_prefetch_distance, so->n_prefetch_blocks - skip);
1481+
so->last_prefetch_index = skip + so->n_prefetch_requests;
1482+
for (int i = skip; i < so->last_prefetch_index; i++)
1483+
PrefetchBuffer(rel, MAIN_FORKNUM, so->prefetch_blocks[i]);
1484+
}
1485+
13731486
/* don't need to keep the stack around... */
13741487
_bt_freestack(stack);
13751488

@@ -1497,9 +1610,63 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
14971610
/* OK, itemIndex says what to return */
14981611
currItem = &so->currPos.items[so->currPos.itemIndex];
14991612
scan->xs_heaptid = currItem->heapTid;
1500-
if (scan->xs_want_itup)
1613+
if (scan->xs_want_itup) /* index-only scan */
1614+
{
15011615
scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
1616+
}
1617+
else if (so->prefetch_maximum > 0)
1618+
{
1619+
int prefetchLimit, prefetchDistance;
1620+
1621+
/* Neon: prefetch referenced heap pages.
1622+
* As far as it is difficult to predict how much items index scan will return
1623+
* we do not want to prefetch many heap pages from the very beginning because
1624+
* them may not be needed. So we are going to increase prefetch distance by INCREASE_PREFETCH_DISTANCE_STEP
1625+
* at each index scan iteration until it reaches prefetch_maximum.
1626+
*/
1627+
1628+
/* Advance pefetch distance until it reaches prefetch_maximum */
1629+
if (so->current_prefetch_distance + INCREASE_PREFETCH_DISTANCE_STEP <= so->prefetch_maximum)
1630+
so->current_prefetch_distance += INCREASE_PREFETCH_DISTANCE_STEP;
1631+
else
1632+
so->current_prefetch_distance = so->prefetch_maximum;
1633+
1634+
/* How much we can prefetch */
1635+
prefetchLimit = Min(so->current_prefetch_distance, so->currPos.lastItem - so->currPos.firstItem + 1);
15021636

1637+
/* Active prefeth requests */
1638+
prefetchDistance = so->n_prefetch_requests;
1639+
1640+
/*
1641+
* Consume one prefetch request (if any)
1642+
*/
1643+
if (prefetchDistance != 0)
1644+
prefetchDistance -= 1;
1645+
1646+
/* Keep number of active prefetch requests equal to the current prefetch distance.
1647+
* When prefetch distance reaches prefetch maximum, this loop performs at most one iteration,
1648+
* but at the beginning of index scan it performs up to INCREASE_PREFETCH_DISTANCE_STEP+1 iterations
1649+
*/
1650+
if (ScanDirectionIsForward(dir))
1651+
{
1652+
while (prefetchDistance < prefetchLimit && so->currPos.itemIndex + prefetchDistance <= so->currPos.lastItem)
1653+
{
1654+
BlockNumber blkno = BlockIdGetBlockNumber(&so->currPos.items[so->currPos.itemIndex + prefetchDistance].heapTid.ip_blkid);
1655+
PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, blkno);
1656+
prefetchDistance += 1;
1657+
}
1658+
}
1659+
else
1660+
{
1661+
while (prefetchDistance < prefetchLimit && so->currPos.itemIndex - prefetchDistance >= so->currPos.firstItem)
1662+
{
1663+
BlockNumber blkno = BlockIdGetBlockNumber(&so->currPos.items[so->currPos.itemIndex - prefetchDistance].heapTid.ip_blkid);
1664+
PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, blkno);
1665+
prefetchDistance += 1;
1666+
}
1667+
}
1668+
so->n_prefetch_requests = prefetchDistance; /* update number of active prefetch requests */
1669+
}
15031670
return true;
15041671
}
15051672

@@ -1906,6 +2073,30 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
19062073
so->markItemIndex = -1;
19072074
}
19082075

2076+
if (scan->xs_want_itup && so->prefetch_maximum > 0) /* Prefetching of leave pages for index-only scan */
2077+
{
2078+
/* Advance pefetch distance until it reaches prefetch_maximum */
2079+
if (so->current_prefetch_distance + INCREASE_PREFETCH_DISTANCE_STEP <= so->prefetch_maximum)
2080+
so->current_prefetch_distance += INCREASE_PREFETCH_DISTANCE_STEP;
2081+
2082+
so->n_prefetch_requests -= 1; /* we load next leaf page, so decrement number of active prefetch requests */
2083+
2084+
/* Check if the are more children to prefetch at current parent page */
2085+
if (so->last_prefetch_index == so->n_prefetch_blocks && so->next_parent != P_NONE)
2086+
{
2087+
/* we have prefetched all items from current parent page, let's move to the next parent page */
2088+
_bt_read_parent_for_prefetch(scan, so->next_parent, dir);
2089+
so->n_prefetch_requests -= 1; /* loading parent page consumes one more prefetch request */
2090+
}
2091+
2092+
/* Try to keep number of active prefetch requests equal to current prefetch distance */
2093+
while (so->n_prefetch_requests < so->current_prefetch_distance && so->last_prefetch_index < so->n_prefetch_blocks)
2094+
{
2095+
so->n_prefetch_requests += 1;
2096+
PrefetchBuffer(scan->indexRelation, MAIN_FORKNUM, so->prefetch_blocks[so->last_prefetch_index++]);
2097+
}
2098+
}
2099+
19092100
if (ScanDirectionIsForward(dir))
19102101
{
19112102
/* Walk right to the next page with data */
@@ -2310,6 +2501,7 @@ _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
23102501
*/
23112502
Buffer
23122503
_bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
2504+
BlockNumber* parent,
23132505
Snapshot snapshot)
23142506
{
23152507
Buffer buf;
@@ -2318,6 +2510,7 @@ _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
23182510
OffsetNumber offnum;
23192511
BlockNumber blkno;
23202512
IndexTuple itup;
2513+
BlockNumber parent_blocknum = P_NONE;
23212514

23222515
/*
23232516
* If we are looking for a leaf page, okay to descend from fast root;
@@ -2335,6 +2528,7 @@ _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
23352528
page = BufferGetPage(buf);
23362529
TestForOldSnapshot(snapshot, rel, page);
23372530
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
2531+
blkno = BufferGetBlockNumber(buf);
23382532

23392533
for (;;)
23402534
{
@@ -2373,12 +2567,15 @@ _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
23732567
offnum = P_FIRSTDATAKEY(opaque);
23742568

23752569
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
2570+
parent_blocknum = blkno;
23762571
blkno = BTreeTupleGetDownLink(itup);
23772572

23782573
buf = _bt_relandgetbuf(rel, buf, blkno, BT_READ);
23792574
page = BufferGetPage(buf);
23802575
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
23812576
}
2577+
if (parent)
2578+
*parent = parent_blocknum;
23822579

23832580
return buf;
23842581
}
@@ -2402,13 +2599,13 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
24022599
BTPageOpaque opaque;
24032600
OffsetNumber start;
24042601
BTScanPosItem *currItem;
2405-
2602+
BlockNumber parent;
24062603
/*
24072604
* Scan down to the leftmost or rightmost leaf page. This is a simplified
24082605
* version of _bt_search(). We don't maintain a stack since we know we
24092606
* won't need it.
24102607
*/
2411-
buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir), scan->xs_snapshot);
2608+
buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir), &parent, scan->xs_snapshot);
24122609

24132610
if (!BufferIsValid(buf))
24142611
{
@@ -2421,6 +2618,15 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
24212618
return false;
24222619
}
24232620

2621+
/* Start prefetching for index-only scan */
2622+
if (so->prefetch_maximum > 0 && parent != P_NONE && scan->xs_want_itup) /* index only scan */
2623+
{
2624+
_bt_read_parent_for_prefetch(scan, parent, dir);
2625+
so->n_prefetch_requests = so->last_prefetch_index = Min(so->prefetch_maximum, so->n_prefetch_blocks);
2626+
for (int i = 0; i < so->last_prefetch_index; i++)
2627+
PrefetchBuffer(rel, MAIN_FORKNUM, so->prefetch_blocks[i]);
2628+
}
2629+
24242630
PredicateLockPage(rel, BufferGetBlockNumber(buf), scan->xs_snapshot);
24252631
page = BufferGetPage(buf);
24262632
opaque = (BTPageOpaque) PageGetSpecialPointer(page);

src/backend/optimizer/path/costsize.c

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -151,6 +151,8 @@ bool enable_parallel_hash = true;
151151
bool enable_partition_pruning = true;
152152
bool enable_async_append = true;
153153
bool enable_seqscan_prefetch = true;
154+
bool enable_indexscan_prefetch = true;
155+
bool enable_indexonlyscan_prefetch = true;
154156

155157
typedef struct
156158
{

0 commit comments

Comments
 (0)