LucidDbSubqueries

From Eigenpedia

Jump to: navigation, search

Supported Usage of Subqueries in LucidDB

Contents

Introduction

A subquery is a SELECT statement within another statement. With subquery, the user does not have to write complex joins to perform lookups between relations (or within the same relation). The SQL is also more readable. For example, "find the empno of the youngest employee" can be expressed as,

   Q1
   select e1.empno
   from emps e1
   where e1.age = (select min(e2.age) from emps e2);

Subqueries come in many flavors and can be used in different ways inside a query statement.

Types of Subqueries

Correlated Subquery

A correlated subquery, as opposed to uncorrelated subquery, is a subquery that references values from the outer query. For example, "find the empno of the youngest employee in his department" can be expressed by the following query,

   Q2
   select e1.empno
   from emps e1
   where e1.age = (select min(e2.age) from emps e2 where e2.deptno = e1.deptno);

Scalar Subquery

A scalar subquery has the property of returning only one value (i.e. one row of a single column). It needs to appear inside a pair of parenthesis. Scalar subqueries can also be correlated. For example, "list a employee's empno and the difference between his age and the average age in the department", can be written as this query,

   Q3
   select e1.empno,
              (e1.age - (select avg(age) from emps e2 where e2.deptno = e1.deptno))
   from emps e1;

Table Subquery

A table subquery can return multiple rows, each having multiple columns.

Usages of Subqueries

Subqueries can be used in many places inside a SQL statement, for example, in predicates, in the select list, in lateral table constructs and as operand to row operaters like aggregates.

Predicates

Subqueries are used a lot in predicates, either directly in a predicate or via operators over the subquery.

A scalar subquery can be used in a predicate where a value is allowed. For example, " find all employees whose department name is not null",

   Q4
   select empno
   from  emps
   where (select name from depts where depts.deptno = emps.deptno) is not null;

A table subquery can appear in predicates that compare a value against a set of values(IN/NOT IN, ANY, ALL, SOME), or predicates that check the property of a set of values (EXISTS/NOT EXISTS). For example, "find all employees who work for departments listed in depts table":

   Q5
   select name, empno
   from emps
   where deptno in (select deptno from depts);

This question can also be answered by the following query,

   Q6
   select name, empno
   from emps
   where exists (select deptno from depts where deptno = emps.deptno);

Select List

Scalar subqueries can also appear in select list. For example, "list all the departments and the youngest worker's age in that department",

   Q7
   select deptno, (select min(age) from emps where emps.deptno = depts.deptno)
   from depts;

Aggregates

Mondrian can generate subqueries as operands to aggregates. For example, "find the ratio of an employee's age to the sum of ages of all employees in his department". (Note that the double-parentheses are required by the standard: outer pair for sum arguments, and inner pair for scalar subquery):

   Q8
   select e1.empno, e1.age/sum((select e2.age from emps e2 where e1.deptno = e2.deptno))
   from emps e1;

Lateral Derived Table

A Lateral Derived Table is used like an inlined view when a subquery is encapsulated by the LATERAL keyword. It differs from inlined views in that it can reference expressions produced by relations left, or earlier in join order, to the subquery. For example,

   Q9
   select emps.empno, d.deptno
   from emps, lateral (select * from depts where depts.deptno = emps.deptno) as d;

LucidDB Subquery Rewrite

LucidDB does not have the concept of using a separate RelNode tree (the "nested cursor" concept) to evaluate an expression. Subqueries are always evaluated as part of the same RelNode tree as the enclosing query. Join is used to connect the two RelNode trees. Join methods like hash join can be more efficient because the inputs are read only once. Furthermore, if there is any correlation, i.e. the left input of the join provides values to the right input, a join with the ability to restart the right input and pass values from the left input to the right is required. LucidDB does not have support for this type of Join. (Cartesian product cannot pass values and Hash Join does not restart the right input). Correlated subqueries will need to be decorrelated before evaluation.

Decorrelation

While translating SQL representation to logical representation, subqueries are rewritten into joins and filters if possible. For example, Q5 is equivalent to the following query.

   Q10
   select emps.name, emps.empno
   from emps inner join (select distinct deptno from depts) d
   on emps.deptno = d.deptno;

Notice the distinct aggregate is added because IN tests membership in a set so the equivalent join cannot have duplicates in the right input. In LucidDB, this query will generate the following physical plan. The distinct aggregate is folded into the Hash Join (and makes it a LeftSemi join which removes duplicates).

    FennelToIteratorConverter                               
        FennelReshapeRel(...)
            LhxJoinRel(leftKeys=[[1]], rightKeys=[[0]], joinType=[LEFTSEMI])
                FtrsIndexScanRel(table=[[LOCALDB, SALES, EMPS]], projection=[[1, 2]],...)
                FtrsIndexScanRel(table=[[LOCALDB, SALES, DEPTS]], projection=[[0]],...)

More Complex Examples

Subquery decorrelation is the process of transforming the above logical representation into an equivalent form using only joins and filters. Before decorrelation, the logical representation looks like this for Q6:

    ProjectRel(NAME=[$1], EMPNO=[$0])
        FilterRel(condition=[IS TRUE($10)])
           CorrelatorRel(condition=[true], joinType=[left], correlations=[[var0=offset2]])
               TableAccessRel(table=[[LOCALDB, SALES, EMPS]])
               AggregateRel(groupCount=[0], agg#0=[MIN(0)])
                   ProjectRel($f0=[true])
                       ProjectRel(DEPTNO=[$0])
                           FilterRel(condition=[=($0, $cor0.DEPTNO)])
                              TableAccessRel(table=[[LOCALDB, SALES, DEPTS]])

In LucidDB, decorrelation is done by joining the subquery with value generators that provide the correlated input to the subquery. The result of decorrelation is the logical representation of this equivalent query (for Q6).

   Q11
   select from emp
                       left outer join
                       (select G.deptno, MIN(G.indicator) as indicator
                        from (select depts.deptno as deptno, TRUE as indicator
                                  from depts inner join
                                           (select distinct depts from emps) G
                                  on depts.deptno = G.deptno)
                        group by G.deptno) T
   where T.indicator is true";

Note the MIN(G.indicator) in T is simply to produce a value TRUE for each group of G.deptno. The corresponding logical plan looks like this:

    ProjectRel(NAME=[$1], EMPNO=[$0])
        FilterRel(condition=[IS TRUE($11)])
           JoinRel(condition=[=($2, $10)], joinType=[left])
              TableAccessRel(table=[[LOCALDB, SALES, EMPS]])
              AggregateRel(groupCount=[1], agg#0=[MIN(1)])
                   ProjectRel($f0=[$1], $f0=[$0])
                      ProjectRel($f0=[true], $f0=[$1])
                         ProjectRel(DEPTNO=[$0], $f0=[$2])
                             FilterRel(condition=[=($0, $2)])
                               JoinRel(condition=[true], joinType=[inner])
                                 TableAccessRel(table=[[LOCALDB, SALES, DEPTS]])
                                 AggregateRel(groupCount=[1])
                                      ProjectRel($f0=[$2])
                                          TableAccessRel(table=[[LOCALDB, SALES, EMPS]])

The correlated logical plan is only available in the trace file (org.eigenbase.sql2rel.level=FINE). The decorrelated plan is also available via sqlline be issuing "explain plan without implementation" statement. Some of the ProjectRels in the plan will be optimized away (either pushed into the scan or combined) during physical plan generation. So the explain plan output actually has fewer operations:

    FennelToIteratorConverter                                                                                                                                                                           
        FennelReshapeRel(projection=[[1, 0]], filterOp=[COMP_EQ], filterOrdinals=[[4]], filterTuple=[[true]], ...)
            LhxJoinRel(leftKeys=[[2]], rightKeys=[[0]], joinType=[LEFT])                                                                                                                                            
                FtrsIndexScanRel(table=[[LOCALDB, SALES, EMPS]], projection=[[0, 1, 2]], ....)  
                LhxAggRel(groupCount=[1], agg#0=[MIN(1)])                                                                                                                                                     
                    IteratorToFennelConverter                                                                                                                                                                   
                      IterCalcRel(expr#0..1=[{inputs}], expr#2=[true], $f0=[$t1], $f0=[$t2])                                                                                                                    
                        FennelToIteratorConverter                                                                                                                                                               
                            LhxJoinRel(leftKeys=[[0]], rightKeys=[[0]], joinType=[INNER])                                                                                                                         
                                FtrsIndexScanRel(table=[[LOCALDB, SALES, DEPTS]], projection=[[0]], ...)
                                LhxAggRel(groupCount=[1])                                                                                                                                                           
                                    FennelRenameRel(fieldNames=[[$f0]])                                                                                                                                               
                                    FtrsIndexScanRel(table=[[LOCALDB, SALES, EMPS]], projection=[[2]],...)

A subquery can itself contain other subqueries. The inner subquery can correlate to both the outer subquery and the main query, as in this example:

   Q12
   select name
   from emps
   where exists(select *
                from depts
                where depts.deptno > emps.deptno or
                      exists (select *
                              from depts2
                              where depts.name = depts2.name
                                    and depts2.deptno <> emps.empno));

The decorrelated logical plan is quite complex in this case. The hash lines below mark the subtree rooted at a certain join rel.

    ProjectRel(NAME=[$1])
        FilterRel(condition=[IS TRUE($12)])
       --- JoinRel(condition=[AND(=($2, $11), =($0, $10))], joinType=[left])
       |      TableAccessRel(table=[[LOCALDB, SALES, EMPS]])                                  
       |      AggregateRel(groupCount=[2], agg#0=[MIN(2)])                                    
       |          ProjectRel($f00=[$1], $f01=[$2], $f0=[$0])                                    
       |             ProjectRel($f0=[true], $f00=[$2], $f01=[$3])                                
       |                 ProjectRel(DEPTNO=[$0], NAME=[$1], $f00=[$3], $f01=[$5])                  
       |                    FilterRel(condition=[OR(>($0, $5), IS TRUE($4))])                       
       |                 ---- JoinRel(condition=[true], joinType=[inner])                           
       |                 |   --- JoinRel(condition=[=($1, $2)], joinType=[left])                     
       |                 |   |      TableAccessRel(table=[[LOCALDB, SALES, DEPTS]])                   
       |                 |   |      AggregateRel(groupCount=[2], agg#0=[MIN(2)])                      
       |                 |   |           ProjectRel($f0=[$1], $f00=[$2], $f0=[$0])                       
       |                 |   |               ProjectRel($f0=[true], $f0=[$2], $f00=[$3])                   
       |                 |   |                   ProjectRel(DEPTNO=[$0], NAME=[$1], $f0=[$2], $f00=[$3])     
       |                 |   |                       FilterRel(condition=[AND(=($2, $1), <>($0, $3))])         
       |                 |   |                   ----- JoinRel(condition=[true], joinType=[inner])             
       |                 |   |                   |         TableAccessRel(table=[[LOCALDB, SALES, DEPTS2]])      
       |                 |   |                   |    ---  JoinRel(condition=[true], joinType=[inner])           
       |                 |   |                   |    |        AggregateRel(groupCount=[1])                        
       |                 |   |                   |    |             ProjectRel($f0=[$1])                              
       |                 |   |                   |    |                 TableAccessRel(table=[[LOCALDB, SALES, DEPTS]]) 
       |                 |   |                   |    |       AggregateRel(groupCount=[1])                        
       |                 |   |                   |    |            ProjectRel($f0=[$0])                              
       |                 |   ---                |--  ---              TableAccessRel(table=[[LOCALDB, SALES, EMPS]])  
       |                 |       AggregateRel(groupCount=[1])                                        
       |                 |            ProjectRel($f0=[$2])                                              
       -----            ---          TableAccessRel(table=[[LOCALDB, SALES, EMPS]])

NULL Semantics

For IN predicates, a null value belongs neither to the set returned by the subquery nor to the complement of that set. Because an IN subquery is translated to an inner join which does not match nulls against any value, this semantics is preserved. However, when translating NOT IN query into logical representation, special filters must be added so that a null value does not qualify the NOT IN condition. For example, "find all employees who do not work for any department in the depts table",

       Q13
       select empno
       from emps
       where deptno NOT IN (select deptno from depts)

has the following logical plan:

        ProjectRel(NAME=[$1], EMPNO=[$0])
            FilterRel(condition=[NOT(IS TRUE($12))])
               FilterRel(condition=[IS NOT NULL($10)])
                  JoinRel(condition=[=($10, $11)], joinType=[left])
                     ProjectRel([$0...9] = {inputs}, DEPTNO=[$2])
                     TableAccessRel(table=[[LOCALDB, SALES, EMPS]])
                         AggregateRel(groupCount=[1], agg#0=[MIN(1)])
                              ProjectRel($f0=[$0], $f1=[true])
                                  ProjectRel(DEPTNO=[$0])
                                      TableAccessRel(table=[[LOCALDB, SALES, DEPTS]])

Left outer joins include null keys on the left in the non joining set, which are rows satisfying the condition "NOT(IS TRUE($12))". The second filter "IS NOT NULL($10)" removes those nulls from this set. The remaining non joining set is the result set.

Reducing Non-Correlated Subqueries To Constants

Scalar, non-correlated subqueries always evaluate to a single, constant value. Therefore, LucidDB, by default, evaluates the subquery during optimization time, and replaces the subquery with the resulting constant. If the converted non-correlated subquery is contained in the WHERE clause of a query, this can result in improved query performance because it allows for more efficient processing of the filter. For example, if the filter is on an indexed column, the index can now be used to process the filter. Or, even if there is no index, the filter can be pushed down to the row scan, avoiding the need to invoke the calculator to process the filter.

When EXPLAIN PLAN is run on the query, the conversion is not explicitly done, but it is reflected in the plan as follows:

explain plan without implementation for
select * from emps
   where deptno = (select min(deptno) from depts);
'column0'
'ProjectRel(EMPNO=[$0], NAME=[$1], DEPTNO=[$2], GENDER=[$3], CITY=[$4], EMPID=[$5], AGE=[$6], PUBLIC_KEY=[$7], SLACKER=[$8], MANAGER=[$9])'
'  FilterRel(condition=[=($2, ?0)])'
'    TableAccessRel(table=[[LOCALDB, SALES, EMPS]])'

Note the ? in the FilterRel condition. This is a dynamic parameter that corresponds to the subquery result. The dynamic parameter only appears in the EXPLAIN output as a a placeholder for the subquery result. During the actual execution of the query, dynamic parameters play no role. I.e., rather than displaying the result of the scalar subquery (which would not be possible if at the time EXPLAIN is run, the subquery returns more than one row), we instead just use the dynamic parameter to indicate that the subquery will be replaced with a constant if the query were to be executed.

This optimization will also be done for EXISTS expressions on non-correlated subqueries. In this case, the subquery does not have to be a scalar subquery. It's simply evaluated and replaced with a boolean constant indicating whether the subquery returned zero, or one or more rows. In the case of the former, the boolean is set to false, and in the case of the latter, it's set to true. The EXPLAIN output would look like the following:

explain plan without implementation for
select name from emps where exists (select * from depts);
'column0'
'ProjectRel(NAME=[$1])'
'  FilterRel(condition=[?0])'
'    TableAccessRel(table=[[LOCALDB, SALES, EMPS]])'

Again, the ?0 in the FilterRel condition is merely a placeholder dynamic parameter corresponding to the boolean literal. The actual query plan will not contain this dynamic parameter. Note also that at runtime, the optimizer will perform additional reductions for queries like the one above. Specifically, if the subquery returns no rows, then there's no need to execute the outer query.

These optimizations do not apply when the subquery contains dynamic parameters. To completely disable them, run

alter session set "reduceNonCorrelatedSubqueries" = false;

Enforcing Scalar Subquery During Execution

Scalar subquery is a property (or rather, a constraint) detected during sql translation, and can only be checked when the subquery is evaluated, as it depends on the input data set. You normally won't be able to see this in the query plan because the scalar subquery is replaced with a dynamic parameter, as noted in the previous section. But if you were to disable subquery reduction, the plan will have the special aggregate SINGLE_VALUE(), which throws an exception if the aggregate sees multiple input rows:

        ProjectRel(EMPNO=[$0])
           FilterRel(condition=[=($6, $10)])
              JoinRel(condition=[true], joinType=[left])
                TableAccessRel(table=[[LOCALDB, SALES, EMPS]])
                AggregateRel(groupCount=[0], agg#0=[SINGLE_VALUE(0)])
                    ProjectRel(AGE=[$6])
                    TableAccessRel(table=[[LOCALDB, SALES, EMPS]])

Efficient Decorrelation

The generic decorrelation algorithm using value generators can result in suboptimal plans (Examples here). In essence the algorithm inner joins all the outer relations (the "value generators") with the subquery that references them to produce all the qualifying rows (forming the "lookup table"), which is then outer joined with the outer relation to produce the result set that might be further filtered. A minimum of two joins with the outer relation are required per subquery. When the outer query result set is big, or if there are nested correlations, the joins can take up a significant amount of time. However, it is possible to cut down the number of joins. Depending on the types of correlated references, and the property of the relations in the subquery and outer query, certain subqueries can be the "lookup" tables themselves and can be evaluated with just one join with the outer query, eliminating the join with the "value generator". LucidDB recognizes some of these subqueries and transforms them to joins directly.

Scalar Subquery

Correlated scalar subquery which projects an expression in its select list.

Correlation in Projection

A query where the only correlation is in the projected expression.

       Q14
       select deptno, (select depts.name from emps)
       from depts;

has the following final plan with just one join:

        ProjectRel(DEPTNO=[$0], EXPR$1=[$2])
          ProjectRel(DEPTNO=[$0], NAME=[$1], NAME=[CASE(IS NULL($12), null:$1)])
            JoinRel(condition=[true], joinType=[left])
              FtrsIndexScanRel(table=[[LOCALDB, SALES, DEPTS]])
              AggregateRel(groupCount=[0], agg#0=[SINGLE_VALUE(0)],...agg#10=[SINGLE_VALUE(10)])
                ProjectRel(EMPS.*=[$0..$9], nullIndicator=[true])
                  FtrsIndexScanRel(table=[[LOCALDB, SALES, EMPS]])

Correlation in Join Filter

The subquery contains a join filter that references the outer query. There might be correlated reference in the select list also. If the filter conditions are all equality comparisons AND'ed together, and the comparison fields form a unique key for the subquery, as this query below:

       Q15
       select deptno, (select emps.name from depts where emps.deptno + 1 = depts.deptno)
       from emps;

It has a plan like this one:

        ProjectRel(DEPTNO=[$2], EXPR$1=[$10])
          ProjectRel(EMPS.*=[$0...$9], NAME=[CASE(IS NULL($10), null:$1)])
            JoinRel(condition=[=(+($2, 1), $10)], joinType=[left])
              FtrsIndexScanRel(table=[[LOCALDB, SALES, EMPS]])
              FtrsIndexScanRel(table=[[LOCALDB, SALES, DEPTS]])

Notice the SINGLE_VALUE() aggregates are gone because the uniqueness property of the join key from the subquery. For any outer row, it is guaranteed that the inner query returns at most one row.

Scalar Subquery with Aggregates

Correlated scalar subquery which contains aggregates or expressions of aggregates in its select list.

Correlation in Inputs to Aggregates

The only correlation is in the select list; or more precisely in the input to the aggregate. If the outer query contains a unique key, for example,

       Q16
       select deptno, (select count(depts.name) from emps)
       from depts;

is evaluated as an aggregate on the unique key over the result of only one join.

        ProjectRel(DEPTNO=[$0], EXPR$1=[$2]
          ProjectRel($f0=[$0], $f1=[$1], $f2=[CAST($2):BIGINT])
            AggregateRel(groupCount=[2], agg#0=[COUNT(3)])
              ProjectRel($f0=[$0], $f1=[$1], $f2=[$1], $f3=[$12])
                JoinRel(condition=[true], joinType=[left])
                  FtrsIndexScanRel(table=[[LOCALDB, SALES, DEPTS]])
                  ProjectRel(EMPS.*=[$0...$9], nullIndicator=[true])
                    FtrsIndexScanRel(table=[[LOCALDB, SALES, EMPS]])

Note that the final count is on a generated field "nullIndicator" because count() needs to produce value 0 if the inner query returns an empty set. If the aggregate is sum/min/max, then it will be on the original field. The group by keys are simply all the outer query output fields. In the query above, it is grouping on DEPTS.deptno, DEPTS.name.

Correlation in Join Filter

The subquery contains a join filter that references the outer query. There might be correlated reference in the select list also. If the filter conditions are all equality comparisons AND'ed together, and the comparison fields form a unique key for the outer query, as this query below:

       Q17
       select deptno, (select count(depts.name) from emps where emps.deptno = depts.deptno)
       from depts;

It has a plan similar to Q16, except for the join condition:

        ProjectRel(DEPTNO=[$0], EXPR$1=[$2])
          ProjectRel($f0=[$0], $f1=[$1], $f2=[CAST($2):BIGINT])
            AggregateRel(groupCount=[2], agg#0=[COUNT(3)])
              ProjectRel($f0=[$0], $f1=[$1], $f2=[$1], $f3=[$12])
                JoinRel(condition=[=($4, $0)], joinType=[left])
                  FtrsIndexScanRel(table=[[LOCALDB, SALES, DEPTS]])
                  ProjectRel(EMPS.*=[$0..$9], nullIndicator=[true])
                    FtrsIndexScanRel(table=[[LOCALDB, SALES, EMPS]])

There are other types of correlated subquery that can be rewrritten without using value generators, for example, IN subqueries, EXISTS subqueries. Efficient decorrelation is not implemented for these subquery usages yet.

Summary of LucidDB Subquery Support

The table below summarizes the subquery support in LucidDB. In addition to the restrictions on decorrelation mentioned in the list, decorrelation is not supported if the correlated variables are used in a subquery that is the operand to any set operation. Multiset operators which are translated to correlated subqueries are not decorrelated, because the translation introduces set operations. Subquery decorrelation uses outer joins (or semi joins). Without outer join support, many decorrelated logical plans cannot be implemented. In LucidDB, all the decorrelated plans are implementable because outer joins and semi joins are supported by hash join.

This trace event dumps out the correlated logical plan and the decorrelated plan:

       org.eigenbase.sql2rel.level=FINE


Subquery Usage/Subquery Type scalar, uncorrelated scalar, correlated table, uncorrelated table, correlated
select list Y Y not allowed not allowed
input to aggregates (including count(distinct ....) Y Y not allowed not allowed
predicates
comparison with scalar value Y except when compared against aggregates in the HAVING clause or when used in the ON clause of joins. Work around is to use WHERE clause if possible. Y with same restrictions as the uncorrelated scalar subquery. not allowed not allowed
operators testing set properties
IN/NOT IN, EXISTS/NOT EXISTS N/A N/A Y Y
ANY, ALL, SOME N/A N/A N N
lateral derived tables N/A N/A Y Y except that multiset table, which is inherently correlated, is not decorrelated

Future Work

Subqueries usually can be rewritten into equivalent joins. However, those queries tend to be hard to understand, especially if there are correlations and nested subqueries. Because a general purpose decorrelation scheme tends to produce poor query plans, as shown in LucidDB Subquery Rewrite, it is still useful to add execution support for subqueries without decorrelation.

During execution of the query plan, the RelNode tree representing the subquery is connected to the main query via a special RelNode(call it "CorrRel"). The subquery part of the CorrRel input is restarted for each fetch of the main RelNode tree, and the subquery tree is evaluated with a new outer query row as the context. This CorrRel needs to support restartable input, as well as passing in the outer query context for each restart. CorrRel fetches from the subquery tree and evaluates the expression that references the subquery, for example, the exists() condition in Q6.