LucidDbIndexAccess

From Eigenpedia

Jump to: navigation, search

This document describes how Table Scan in LucidDb queries can be optimized using Index Access. It provides working knowledge of predicate recognition, costing of index access paths, and index selection from multiple candidate indexes.

Some knowledge of Bitmap Index and Table Access is helpful to understanding the concepts referenced here.

Query Predicates

The evaluation of filters with range predicates or equality predicates on leading key columns can benefit from using bitmap indexes to access only the rows that satisfy the filter, rather than evaluating the filter on all the rows returned from a full table scan.

For example, if there is an index on column DEPTNO of table EMPS, a query that retrieves "records for all employees who work for department #10" will be able to use the index to look up the RIDs for these rows and only scan the portion of the table EMPS that contain those RIDs.

explain plan for
select * from emps 
where deptno = 10;

produces this plan:

'FennelToIteratorConverter'
'  LcsRowScanRel(table=[[LOCALDB, SALES, EMPS]], projection=[*], clustered indexes=[[SYS$CLUSTERED_INDEX$EMPS$DEPTNO, SYS$CLUSTERED_INDEX$EMPS$EMPNO, SYS$CLUSTERED_INDEX$EMPS$ENAME]])'
'    LcsIndexSearchRel(table=[[LOCALDB, SALES, EMPS]], index=[DEPTNO_IX], projection=[*], inputKeyProj=[[1, 3]], inputDirectiveProj=[[0, 2]], startRidParamId=[0], rowLimitParamId=[0])'
'      FennelValuesRel(tuples=[[{ '[', 10, ']', 10 }]])'

Predicates that can be used in index search is called "sargable" predicates. Not all predicates are sargable. For example, if the predicate is changed to:

   deptno = 10 OR empno = 100

it is not correct to use the index on deptno to retrieve only rows satisfying "deptno = 10", because the result set could miss rows that only satisfy the second condition. If there is another index on empno, then it is possible to combine(OR) the resulting RIDs from the two indexes before feeding the RIDs into the table scan, as long as the OR condition is recognized and bitmap OR is supported. Whether a predicate is sargable(or contains sargable fragments) depends on the types of the predicates, the presence of suitable indexes, and the support offered by the DB.

The following predicates are not sargable due to the types of the conditions used:

  deptno is not null
  name like 'abc'

The supported conditions are defined by the following rules:

<supported condition> ::= 
    <OR condition>(COL) |
    <OR condition>(COL1) AND <supported condition>
<OR condition>(COL) :: = 
    <simple condition>(COL) |
    <simple condition>(COL) OR <OR condition>(COL) | 
    NOT <OR condition>(COL)
<simple condition>(COL) ::=
    COL =  value |
    COL >  value |
    COL >= value |
    COL <  value |
    COL <= value | 
    COL is null
and additionally for boolean types:
<simple condition>(COL) ::=
    COL |
    COL is true    |
    COL is false   |
    COL is unknown

One can tell from the rules that LucidDB currently only recognizes predicates in conjunctive form(AND'ed together) at the top level, with the exception of predicates on the same column that are OR'ed together. These are all sargable predicates supported by LucidDB:

   (deptno = 10 OR deptno = 20)
   (deptno = 10 OR deptno = 20) AND empno = 100
   (deptno = 10 OR deptno = 20) AND (empno = 100 OR empno = 200)

If there is no index on empno, these last two predicates are not sargable but the fragments referencing deptno are sargable.

Index Access Selection

Consider a more complex query with this predicate:

   deptno = 10 AND empno > 100

If there are indexes Index_D(deptno), Index_E(empno), and Index_DE(deptno, empno), LucidDb can choose from several combinations of these access methods: index access, residual filtering, regular post-scan filtering. For example,

(1)Use Index_DE to find only matching RIDs that satisfy both conditions and use those RIDs to perform table scan. The plan will look like:

'FennelToIteratorConverter'
'  LcsRowScanRel(table=[[LOCALDB, SALES, EMPS]], projection=[*], clustered indexes=[[SYS$CLUSTERED_INDEX$EMPS$DEPTNO, SYS$CLUSTERED_INDEX$EMPS$EMPNO, SYS$CLUSTERED_INDEX$EMPS$ENAME]])'
'    LcsIndexSearchRel(table=[[LOCALDB, SALES, EMPS]], index=[INDEX_DE], projection=[*], inputKeyProj=[[1, 2, 4, 5]], inputDirectiveProj=[[0, 3]], startRidParamId=[0], rowLimitParamId=[0])'
'      FennelValuesRel(tuples=[[{ '(', 10, 100, '+', 10, null }]])'

(2)Use Index_D to find matching RIDS for rows satisfying the condition "deptno = 10", and then perform column filtering on those rows so that only the rows passing both conditions are returned by the scan:

'FennelToIteratorConverter'
'  LcsRowScanRel(table=[[LOCALDB, SALES, EMPS]], projection=[*], clustered indexes=[[SYS$CLUSTERED_INDEX$EMPS$DEPTNO, SYS$CLUSTERED_INDEX$EMPS$EMPNO, SYS$CLUSTERED_INDEX$EMPS$ENAME]], residual columns=[[1]])'
'    LcsIndexSearchRel(table=[[LOCALDB, SALES, EMPS]], index=[INDEX_D], projection=[*], inputKeyProj=[[1, 3]], inputDirectiveProj=[[0, 2]], startRidParamId=[0], rowLimitParamId=[0])'
'      FennelValuesRel(tuples=[[{ '[', 10, ']', 10} ]])'
'    FennelValuesRel(tuples=[[{ '(', 100, '+', null }]])'

(3)...and other variations such as using Index_E and residual filtering on deptno = 10; residual filtering for both columns:

'FennelToIteratorConverter'
'  LcsRowScanRel(table=[[LOCALDB, SALES, EMPS]], projection=[*], clustered indexes=[[SYS$CLUSTERED_INDEX$EMPS$DEPTNO, SYS$CLUSTERED_INDEX$EMPS$EMPNO, SYS$CLUSTERED_INDEX$EMPS$ENAME]], residual columns=[[0, 1]])'
'    FennelValuesRel(tuples=[[{ '[', 10, ']', 10} ]])'
'    FennelValuesRel(tuples=[[{ '(', 100, '+', null }]])'

The optimizer computes the cost for each one of these combinations and selects the one with the lowest cost. Obviously, if a query contains many predicates against a table with many indexes, the number of combinations(the "search space") grows quickly. To contain the search space growth, LucidDB index selection algorithm does not actually consider all the valid mappings of assigning index to evaluate predicates. It uses a greedy algorithm that calculates the cost of table scan while adding to the set of indexes used by this scan. The cost should reduce at the beginning reflecting the saving if scan is avoided for non-qualifying rows. The saving will become less when enough indexes are used since fewer rows will qualify. When the cost stops reducing, or an empirical thresh hold value has been reach for the combined selectivity of the selected indexes, the optimal set of indexes has been found.

Index Access Costing

For each valid access path for a table scan, including index access path, a cost is computed and the one with the lowest cost is chosen. Some assumptions are made about the cost of performing a unit of work, such as calculating a filter, computing the ORing of two bitmaps of unit size, or to scan a row from the disk; also, it is assumed that column values for all qualifying rows reside in contiguous regions of the disk rather than in scattered areas.

The cost of scanning a table with sargable predicates consists of

   * the cost of performing the index search for the predicates using indexes
   * the cost of scanning the table using the RID stream from Index Search
   * the cost of evaluating the remaining predicates using residual filtering

and the cost of performing index search consists of

   * the cost of scanning the indexes
   * the cost of merging and sorting the bitmaps from each index
   * the cost of intersecting the resulting bitmaps from all indexes
   * the cost of intersecting the resulting bitmap with the deletion index bitmap

The intuition to get from the cost formula is that using one more index incurs a cost which is unrelated to the selectivity of the indexes already in use; while as the cost of using a residual filter is proportional to the selectivity of indexes already in use. This is why it is not always optimal to use all possible indexes.

Personal tools