LucidDbWarehouseLabels

From Eigenpedia

Jump to: navigation, search

Contents

Introduction

This page provides an overview of an upcoming feature in LucidDB, warehouse labels. The feature is targeted for release 0.8; it's currently under development, so if you want to try it out, you'll need to build from Perforce.

A warehouse label is an identifier that corresponds to database state at some particular point in time, such as after successful completion of an ETL job. By applying warehouse labels to LucidDB sessions, those sessions can execute point-in-time queries, where the point-in-time corresponds to when the specified label was created. That way, sessions can query the database based on an ETL job that finished the day before, even though there is a new ETL job running concurrently.

Note that the database state only applies to LucidDB column store tables. External, foreign tables such as those created through the LucidDB flat file wrapper are excluded; likewise, catalog information is excluded.

Also, a label does not prevent a table from being dropped if that table exists at the time the label is created. In other words, if session A runs with a label setting where it can access table X, and session B subsequently drops table X, then session A will no longer be able to access table X.

This feature is possible because of LucidDB's underlying support for page versioning.

Usage Example

The following is an example of how labels can be used by a datawarehousing application.

  • On Monday evening, a nightly ETL job runs successfully, producing warehouse label, L1.
  • All subsequent report sessions that are initiated following the nightly run will use warehouse label L1, therefore seeing a copy of the warehouse corresponding to Monday night's ETL run. This continues throughout the next day, Tuesday.
  • At the end of the day on Tuesday, the nightly ETL job fires off again. Any report sessions that are running while this ETL job is running can continue to run based on label L1.
  • The Tuesday evening ETL job completes, creating a new warehouse label, L2.
  • All new report sessions will now start to use warehouse label L2 throughout the day on Wednesday.

Creating and Dropping Warehouse Labels

The syntax for creating and dropping warehouse labels is as follows:

CREATE [ OR REPLACE ] LABEL <labelname> [ FROM LABEL <parentLabelName> ]
    [ DESCRIPTION label-description ]
DROP LABEL <labelname> [ CASCADE | RESTRICT ]

Create

If the optional FROM clause in the CREATE LABEL statement is not specified, then a base label is created. The point-in-time associated with the base label corresponds to the time of the last committed transaction, at the time the label is created.

Without the optional clause, a logical label (or label alias) is created. The point-in-time associated with that label is the point-in-time associated with the parent label that the logical label serves as an alias for.

Label names must be unique across LucidDB. Information on active labels will be available through a new system view, DBA_LABELS.

Drop

Base labels can only be dropped if the label is not currently in-use by active LucidDB sessions. The label is in-use because either the base label or one of its aliases is being referenced. In other words, when an alias is referenced, LucidDB directly maps the alias to the corresponding base label. Therefore, label aliases can be dropped at any time, even if they are currently in-use by an active session. This makes it possible to replace the alias with a later point-in-time for new sessions, even if existing sessions are still using the original alias.

It will be the responsibility of LucidDB clients to keep track of when a base label is no longer in-use.

If a label has aliases defined on top of, then it cannot be dropped unless the CASCADE option is specified. The default behavior is RESTRICT.

Replace

Replacing an existing label either results in establishing a new point in time for a base label or defining a new parent label for an alias. If an existing label has aliases defined on top of it, then replacing that label results in the aliases following the replaced label. As a result, it is possible to inadvertently define a circular label chain. This is detected and disallowed.

To utilize label aliases, you would make use of the REPLACE option as follows:

  • ETL from Sunday evening completes. Define a base label and a label alias on top off of it, representing the last good nightly ETL.
CREATE LABEL "2008-Aug-03 ETL";
CREATE OR REPLACE LABEL "CURRENT_NIGHTLY" FROM LABEL "2008-Aug-03 ETL";
  • All report sessions utilize the CURRENT_NIGHTLY label and see the data as of the Sunday evening ETL run.
  • ETL from Monday evening completes. Define a new base label and then reset the label alias to point to the new base label.
CREATE LABEL "2008-Aug-04 ETL";
CREATE OR REPLACE LABEL "CURRENT_NIGHTLY" FROM LABEL "2008-Aug-04 ETL";
  • Note that any sessions that were running at the time of the Monday ETL run will continue to see data as of the Sunday ETL run; but any new sessions initiated after the Monday ETL run completed will see data as of the Monday ETL run. Again, all report sessions are using the CURRENT_NIGHTLY label.

Setting Warehouse Labels

By default, LucidDB sessions will run without a warehouse label setting. I.e., sessions will read the latest, committed snapshot of all data.

Labels can be set in one of two ways:

  1. By passing the label name as a property in the LucidDB JDBC connect string (either as part of the URL, or in the property map). The property name is "label", all in lower-case.
  2. By explicitly executing the following SQL statement:
ALTER SESSION SET "label" = {'<labelname>' | NULL }

The specified label can either be a base label or a label alias. In the case of the latter, LucidDB will map the alias to the underlying base label. Specifying NULL reverts the session back to the default of no label setting.

When a session runs with an explicit label setting, committed data as of the timestamp when the base label was created will be the data read.

Setting a label also effectively makes the session read-only. Attempts to execute DML (upsert, insert, delete) or DDL (create, drop) statements will result in errors. This includes DDL/DML statements executed within user-defined routine (UDR) calls. UDR calls are allowed, provided they don't result in the execution of any DDL or DML statements.

It is not appropriate to set a warehouse label in an ETL session.

In addition, when a label is set, this also controls the set of column store tables visible to the session. For example, if a new column store table is created after a label has been established, any session using that earlier label will not be able to access the new table.

The system view, DBA_SESSIONS, shows each session's current label setting. DBA_OBJECTS_IN_USE will not show labels that are in-use. To see the current session's label setting:

SELECT * FROM SYS_ROOT.USER_SESSION_PARAMETERS WHERE PARAM_NAME = 'label';

Space Reclamation

The semantics of the ALTER SYSTEM DEALLOCATE OLD statement is impacted if labels have been created. See LucidDbDataStorageAndAccess#Page_Versioning for a discussion of the semantics if no labels have been created. When there are active labels, a page cannot be deallocated unless it is older than the timestamp associated with the oldest active base label.

Typically, once an ETL job completes, creating a new label, the previous label can be dropped, and then ALTER SYSTEM DEALLOCATE OLD can be run to reclaim old pages. However, if the label that we would like to drop is still in-use because of a long-running report, even though a new ETL job has just completed, it will not be possible to reclaim the data pages that have been versioned by the just completed ETL job. The old pages cannot be reclaimed until the next execution of ALTER SYSTEM DEALLOCATE OLD, after the old base label has been dropped. And the old base label cannot be dropped until the long-running report has completed. Furthermore, even if the long-running report session completes, if there are other active base labels with even earlier timestamps, then the pages cannot be reclaimed either.

Data Statistics

If warehouse labels are not in use, then only one set of statistics are kept per table, column, and index, representing statistics gathered during the last execution of ANALYZE TABLE on a table; or in the case of row count stats, the latest DML statement executed on the table. With labels, there's now the need to keep multiple sets of statistics.

Typically, only one set of stats will be generated per ETL run. However, if multiple ETL sessions for different warehouses are running concurrently within the same LucidDB instance, it is possible for more than one set of stats to be generated during a single ETL run. That's because ETL sessions don't run with label settings and therefore, their only point of reference is the timestamp of the latest label. If that latest label is associated with some other warehouse, then a new set of stats will be created unnecessarily for the current warehouse. This shouldn't be a problem unless the number of concurrent ETL sessions is high.

Because DML sessions also update row count stats, they will follow the same conventions as ANALYZE TABLE in terms of which row count stat instance is updated by the DML operation.

Since each ETL run will generate at least one set of stats, it will be important for LucidDB to clean up inactive statistics. This will be done whenever a base label is dropped or replaced. Any stats associated with that base label will be removed, unless the stat is the latest. E.g., if there is only one base label, and therefore only one set of stats, dropping the label will not remove the only set of sets.

The newest stats generated earlier than the timestamp associated with a session's warehouse label setting will be the stats that are used by the query optimizer.

Although multiple sets of stats will be stored by LucidDB, the system views that contain statistics (DBA_STORED_TABLES, DBA_COLUMN_STATS, DBA_COLUMN_HISTOGRAMS) will only return the latest set of stats.