LucidDbAppLib CONTIGUOUS VALUE INTERVALS

From Eigenpedia

Jump to: navigation, search

NOTES: this UDX is being developed; the contents of this page is not final.

Contents

Syntax

select * from table(
    contiguous_value_intervals(
        cursor( SQL_QUERY ), 
        row( PARTITIONING_COLUMNS ),
        row( TIMESTAMP_COLUMN )));

Purpose

The Contiguous Value Intervals UDX takes in an input table that has partitioning column(s), a timestamp column, and a clumping column. The input table must be sorted in ascending order of partitioning column(s) and timestamp column. The UDX output a table that has all input columns plus five extra columns (see details below). Within one input partition, adjacent rows having the same value of clumping column are clumped into one output row which has the value of current partition, current clump, timestamps at which the clump starts and ends. The output row also contains information about the value and timestamp of previous and next clump within the partition.

Parameters

Input:

  • SQL_QUERY: - rows to be aggregated, presorted on PARTITIONING_COLUMNS plus TIMESTAMP_COLUMN
  • PARTITIONING_COLUMNS: - one or more columns of the input which are used to break the input set into independent partitions
  • TIMESTAMP_COLUMN: - one column from the input with TIMESTAMP type

Output:

  • a table with the same columns as the input table, plus the five additional columns below
    • UNTIL_TIMESTAMP[TIMESTAMP]: indicates when current clumping ends
    • PREVIOUS_CLUMP[VARCHAR]: the previous clumping value
    • PREVIOUS_FROM_TIMESTAMP[TIMESTAMP]: when the previous clumping starts
    • NEXT_CLUMP[VARCHAR]: the next clumping value
    • NEXT_UNTIL_TIMESTAMP[TIMESTAMP]: when the next clumping ends

Calculation:

Let the input column which is not a PARTITIONING_COLUMNS or TIMESTAMP_COLUMN be referred to as the clumping column. For each distinct key within PARTITIONING_COLUMNS in the input set, identify a corresponding set of contiguous rows having that key as a partition. Within a partition, adjacent rows are clumped into contiguous runs all having the same value for the clumping column. For each partition, the result set contains a corresponding partition with number of output rows equal to corresponding number of input clumps. Ordering of partitions is preserved from input to output, as is the clumping column value within a partition.

  • For PARTITIONING_COLUMNS, the output values for a given partition are the same as the input values (which are the same for the entire partition).
  • For the clumping column, the output value for a given row (after the clumping operation described above) is the same as the input value. There should be one row of output per clump. If a partition goes through the same clumping values more than once (meaning the same value appears in non-adjacent rows), there will be multiple output rows. E.g. if an opportunity goes through stages COLD for 2 days, WARM for 3 days, COLD for 10 more days, HOT for 4 days, we'll produce two different rows for COLD since the original input rows aren't adjacent.
  • For the TIMESTAMP_COLUMN, the output value for a given row is equal to the timestamp from the first row in the corresponding clump of input rows
  • For the UNTIL_TIMESTAMP column, the output value for a given row is equal to the timestamp from the next row following the corresponding clump of input rows, or NULL for the last clump within a partition
  • For the PREVIOUS_CLUMP column, the output value for a given row is equal to the clumping column value from the previous clump within a partition, or NULL if it is the first clump within a partition
  • For the PREVIOUS_FROM_TIMESTAMP column, the output value for a given row is equal to the timestamp column value from the previous clump within a partition, or NULL if it is the first clump within a partition
  • For the NEXT_CLUMP column, the output value for a given row is equal to the clumping column value from the next clump within a partition, or NULL if it is the last clump within a partition
  • For the NEXT_UNTIL_TIMESTAMP column, the output value for a given row is equal to the until timestamp column value from the next clump within a partition, or NULL if it is the last clump within a partition

Boundary cases:

  • Empty input will return an empty table
  • There is no special handing for null values, other than for the timestamp column
    • This means that if a clumping column value is null, it will be outputted as is
    • The same goes for nulls in partitioning columns - null is a valid partition
  • Comparisons will be done using net.sf.farrago.syslib.FarragoSyslibUtil.compareKeyUsingGroupBySemantics
    • this means that empty string and a string of spaces are considered the same
  • Same Timestamp: For a partition and the clumping column, there could be multiple rows with the same Timestamp. In that case, only one output row is produced. Note that multiple rows with same timestamp but different clump values are treated as separated clumps following evaluation rules above
  • The input table can have more than one columns of type timestamp. One or more of these columns can be in PARITIONING_COLUMNS, but only one of these can be used as the TIMESTAMP_COLUMN.

Exceptions:

A fatal run-time exception is thrown under any of the following conditions:

  • Input table does not have exactly one clumping column (number of columns in the input table does not equal to the number of partitioning columns + 1 (for the timestamp column) + 1 (for clumping column))
  • The extra column in the input table (ie, the clumping column), is not of type CHAR or VARCHAR. This restriction may be lifted after enhancements has been made to the framework as specified in FarragoUdxResultTypeDerivation
  • Input table is not presorted correctly in ascending order of PARTITIONING_COLUMNS plus TIMESTAMP_COLUMN (as defined by SQL:2003 ORDER BY semantics, with NULLS first, and PAD SPACE collation characteristics for partitioning columns which are of VARCHAR datatype
    • for rules related to NULLS FIRST ordering, see SQL:2003 Part 2 Section 10.10 General Rule 1.e
    • for specifics of VARCHAR and PAD SPACE, see SQL:2003 Part 2 Section 4.2.2
  • TIMESTAMP_COLUMN is not a row of exactly one column that is of type TIMESTAMP
  • Any of the values in the timestamp column contains a NULL
  • The column in TIMESTAMP_COLUMN or the CLUMPING COLUMN is in PARTITIONING_COLUMNS

Example

select * from table(
    contiguous_value_intervals(
        cursor(
            select "OpportunityId", "OpportunityChangeTimestamp", "Stage" from opportunity_change_history
            order by "OpportunityId", "OpportunityChangeTimestamp"
        ), 
        row("OpportunityId"),
        row("OpportunityChangeTimestamp")));

INPUT_TABLES:

+ Input
OpportunityId OpportunityChangeTimestamp Stage
(null) 1800-01-01 00:00:00 (null)
(null) 1802-01-01 00:00:00 (null)
A 2007-10-12 12:23:41 F
A 2007-10-13 00:00:00 F
A 2007-10-14 00:00:00 F
A 2007-12-01 09:00:00 G
A 2007-12-01 15:00:00 F
A 2007-12-29 11:45:00 H
B 2007-09-09 05:00:23 F
C 2008-01-01 07:27:00 G
D 2007-12-04 16:56:09 F
D 2007-12-05 00:00:00 G
D 2007-12-05 01:00:00 H
D 2007-12-05 05:00:00 H
+ Output
OpportunityId OpportunityChangeTimestamp Stage UNTIL__TIMESTAMP PREV_CLUMP PREV_FROM_TIMESTAMP NEXT_CLUMP NEXT_UNTIL_TIMESTAMP
(null) 1800-01-01 00:00:00 (null) (null) (null) (null) (null) (null)
A 2007-10-12 12:23:41 F 2007-12-01 09:00:00 (null) (null) G 2007-12-01 15:00:00
A 2007-12-01 09:00:00 G 2007-12-01 15:00:00 F 2007-10-12 12:23:41 F 2007-12-29 11:45:00
A 2007-12-01 15:00:00 F 2007-12-29 11:45:00 G 2007-12-01 09:00:00 H (null)
A 2007-12-29 11:45:00 H (null) F 2007-12-01 15:00:00 (null) (null)
B 2007-09-09 05:00:23 F (null) (null) (null) (null) (null)
C 2008-01-01 07:27:00 G (null) (null) (null) (null) (null)
D 2007-12-04 16:56:09 F 2007-12-05 00:00:00 (null) (null) G 2007-12-05 01:00:00
D 2007-12-05 00:00:00 G 2007-12-05 01:00:00 F 2007-12-04 16:56:09 H (null)
D 2007-12-05 01:00:00 H (null) G 2007-12-05 00:00:00 (null) (null)

Source Code

Personal tools