LucidDbEtlTutorial

From Eigenpedia

Jump to: navigation, search

This page provides a tutorial for using the extract/transform/load (ETL) features built into LucidDB. It walks through a very simple example (ignoring a lot of the real complexity of ETL, like slowly changing dimensions). The example data and scripts are provided as part of LucidDB under the examples/etl directory.

If you have used standalone ETL tools in the past, the LucidDB approach will probably seem unnatural to you at first. LucidDB expresses the entire ETL process as bulk-operation SQL dataflow, without any of the usual concepts of pulling individual rows from sources and writing them into targets. The benefits of the LucidDB approach become apparent once you've tried it:

  • simplicity (everything is either pure SQL or Java transformations pipelined directly within SQL)
  • expressive power (mix standard relational operators such as join, union, and aggregation with custom Java transformation logic, with any level of nesting)
  • modularity (SQL views can be used to build and debug complex pipelined transformations step by step)
  • convenience (ETL processes are never forced to arbitrarily materialize data into temporary tables)
  • performance (batch operations can be orders of magnitude faster than row-by-row operations)

Note that if you try to use a standalone ETL tool directly with LucidDB, it may not work well, because LucidDB's column-store architecture cannot implement single-row updates efficiently. So...work through this tutorial and see if the SQL-everything approach is right for you. If you still prefer a standalone ETL tool, that's OK; see LucidDbTalend for an integration example. One simple approach is to use a tool like that to generate flat files, and then bulk-load those into LucidDB (referring to the examples below for accessing flat files); this way you can still use LucidDB as your warehouse DBMS.

Contents

Scenario

In our example, we're starting with two sources of external data:

  • a flat file containing employee timesheet data (under the examples/etl/filedata directory)
  • an hsqldb database containing human resources information (under the examples/etl/jdbcdata directory)

What we would like to do is extract this data, transform it and load it into a warehouse schema, and then run analysis queries. One of the queries we want to run is "Which employees work on weekends?", so in addition to the external table data, we are going to use Java to generate a calendar table which can be used to correlate dates in a year with weekends.

Setup

The paths in this example are relative to the installation root. For them to work, start both the server and client with the installation root as the working directory. For example, to start the server:

cd luciddb-0.6.0
bin/lucidDbServer

and then the client in another shell:

cd luciddb-0.6.0
bin/sqllineClient

(For real applications, it's better to alway use absolute paths so that you don't have to rely on the working directory from which the server was started.)

If you have problems running the tutorial, check LucidDbUserFaq.

Define Sources

The first step is to define links to the sources within LucidDB. In SQL/MED terms, these are "foreign servers". From sqlline, you can run the source creation script as follows:

!run examples/etl/sql/defineSources.sql

Let's walk through the output one step at a time.

0: jdbc:luciddb:> !run examples/etl/sql/defineSources.sql
create server jdbc_link
foreign data wrapper sys_jdbc
options(
driver_class 'org.hsqldb.jdbcDriver',
url 'jdbc:hsqldb:examples/etl/jdbcdata/scott',
user_name 'SA',
table_types 'TABLE');

The first command creates a link to the hsqldb database. It specifies a predefined driver ("foreign data wrapper" in SQL/MED terms) named sys_jdbc, and configures it with the correct JDBC driver and properties.

The next command does the same thing, except using the flatfile foreign data wrapper instead of JDBC:

create server file_link
foreign data wrapper sys_file_wrapper
options (
directory 'examples/etl/filedata/',
control_file_extension 'bcp',
file_extension 'txt',
with_header 'NO',
log_directory 'trace/',
field_delimiter '\t');

The next few commands create a schema to hold extracted metadata, and then use the IMPORT FOREIGN SCHEMA command to perform the metadata extraction. This creates table definitions in LucidDB's catalog corresponding to the file structure and hsqldb JDBC table structures. For simplicity, we extract all metadata into a single schema rather than keeping it separate by source. (For management purposes, it is usually better to keep sources separate in real applications.)

create schema extraction_schema;
import foreign schema sales
from server jdbc_link
into extraction_schema;
import foreign schema bcp
from server file_link
into extraction_schema;

Once IMPORT FOREIGN SCHEMA has been run, we can query LucidDB's own administrative views to see the table definitions (see LucidDbSystemViews):

select table_name, column_name
from sys_root.dba_columns
where schema_name='EXTRACTION_SCHEMA'
order by table_name,ordinal_position;
+-------------+---------------+
| TABLE_NAME  |  COLUMN_NAME  |
+-------------+---------------+
| DEPT        | DEPTNO        |
| DEPT        | DNAME         |
| DEPT        | LOC           |
| EMP         | EMPNO         |
| EMP         | ENAME         |
| EMP         | JOB           |
| EMP         | MGR           |
| EMP         | HIREDATE      |
| EMP         | SAL           |
| EMP         | COMM          |
| EMP         | DEPTNO        |
| EMP2        | EMPNO         |
| EMP2        | ENAME         |
| EMP2        | JOB           |
| EMP2        | MGR           |
| EMP2        | HIREDATE      |
| EMP2        | SAL           |
| EMP2        | COMM          |
| EMP2        | DEPTNO        |
| SALGRADE    | GRADE         |
| SALGRADE    | LOSAL         |
| SALGRADE    | HISAL         |
| TIMESHEET   | EMPNO         |
| TIMESHEET   | WORKDAY       |
| TIMESHEET   | HOURS_WORKED  |
+-------------+---------------+

So, we have departments, two copies of the employees table, salary grades, and timesheets. How much data are we dealing with?

set schema 'extraction_schema';
select count(*) from timesheet;
+---------+
| EXPR$0  |
+---------+
| 4060    |
+---------+
select count(*) from emp;
+---------+
| EXPR$0  |
+---------+
| 14      |
+---------+
select count(*) from dept;
+---------+
| EXPR$0  |
+---------+
| 4       |
+---------+

Not very much! Finally, the script retrieves the contents of the smallest table:

select * from dept;
+---------+-------------+-----------+
| DEPTNO  |    DNAME    |    LOC    |
+---------+-------------+-----------+
| 10      | ACCOUNTING  | NEW YORK  |
| 20      | RESEARCH    | DALLAS    |
| 30      | SALES       | CHICAGO   |
| 40      | OPERATIONS  | BOSTON    |
+---------+-------------+-----------+

Note that no data has actually been loaded into LucidDB tables yet; instead, LucidDB is acting as a SQL gateway to the hsqldb database.

Define Transforms

The next script defines a transformation schema containing views which adapt the source data into the form in which we want it to be loaded into the warehouse schema:

0: jdbc:luciddb:> !run examples/etl/sql/defineTransforms.sql
create schema transform_schema;
set schema 'transform_schema';
create view timesheet_view as
select * from extraction_schema.timesheet;
create view emp_view as
select e.empno, e.ename, d.dname, e.job
from extraction_schema.emp e inner join extraction_schema.dept d
on e.deptno=d.deptno;
create view calendar_view as
select
time_key as calendar_date,
case when weekend='Y' then true else false end as is_weekend
from table(applib.time_dimension(2005, 1, 1, 2005, 12, 31));
  • The timesheet view is easy; it's "fact" data, so we're going to pass it straight through.
  • The employees view is a little more complicated, because we are denormalizing the source schema into a star schema; employee will be a "dimension" with department name as an attribute. That way, we won't have to join to a department table every time we want to analyze by an employee's department.
  • Finally, the calendar view demonstrates the ability of a LucidDB user-defined transform (UDX) to make up data out of thin air. In this case, the time_dimension table function (provided in the prepackaged applib schema) generates one row for each day in the year 2005. The output columns have various useful attributes such as week number, month, day name, etc., but all we are interested in at the moment are which days fall on the weekend.

The next part of the script verifies the row counts after transformation:

select count(*) from timesheet_view;
+---------+
| EXPR$0  |
+---------+
| 4060    |
+---------+
select count(*) from emp_view;
+---------+
| EXPR$0  |
+---------+
| 14      |
+---------+
select count(*) from calendar_view;
+---------+
| EXPR$0  |
+---------+
| 365     |
+---------+

Good, the employee/department join didn't lose any rows (otherwise we should have used an outer join to preserve employees without departments!) and 2005 was not a leap year.

The end of the script reads the contents of the joined employees view:

select * from emp_view;
+--------+---------+-------------+------------+
| EMPNO  |  ENAME  |    DNAME    |    JOB     |
+--------+---------+-------------+------------+
| 7369   | SMITH   | RESEARCH    | CLERK      |
| 7499   | ALLEN   | SALES       | SALESMAN   |
| 7521   | WARD    | SALES       | SALESMAN   |
| 7566   | JONES   | RESEARCH    | MANAGER    |
| 7654   | MARTIN  | SALES       | SALESMAN   |
| 7698   | BLAKE   | SALES       | MANAGER    |
| 7782   | CLARK   | ACCOUNTING  | MANAGER    |
| 7788   | SCOTT   | RESEARCH    | ANALYST    |
| 7839   | KING    | ACCOUNTING  | PRESIDENT  |
| 7844   | TURNER  | SALES       | SALESMAN   |
| 7876   | ADAMS   | RESEARCH    | CLERK      |
| 7900   | JAMES   | SALES       | CLERK      |
| 7902   | FORD    | RESEARCH    | ANALYST    |
| 7934   | MILLER  | ACCOUNTING  | CLERK      |
+--------+---------+-------------+------------+

Again, no data has actually been loaded yet; LucidDB is retrieving the contents of the two tables from hsqldb and then hash-joining it on the fly, returning the results to sqlline.

Define Warehouse

The next script creates the warehouse schema which will be used by our real analysis queries:

0: jdbc:luciddb:> !run examples/etl/sql/defineWarehouse.sql
create schema warehouse;
set schema 'warehouse';
create table employee_dimension(
    emp_key int generated always as identity not null primary key,
    empno int not null,
    ename varchar(128) not null,
    dname varchar(128) not null,
    job varchar(128) not null,
unique(empno)
);
create table calendar_dimension(
    calendar_key int generated always as identity not null primary key,
    calendar_date date not null,
    is_weekend boolean not null,
unique(calendar_date)
);
create index calendar_weekend_idx on calendar_dimension(is_weekend);
create table timesheet_fact(
    timesheet_key int generated always as identity not null primary key,
    emp_key int not null,
    workday_key int not null,
    hours_worked smallint not null,
unique(emp_key, workday_key)
);
create index timesheet_workday_idx on timesheet_fact(workday_key);
create index timesheet_hours_worked_idx on timesheet_fact(hours_worked);

Points to note:

  • Standard practice in ETL is to assign each row in the warehouse a "surrogate key" instead of relying on the "natural keys" from the source system. In our example, emp_key is the surrogate key (which is almost always the PRIMARY KEY as well), and empno is the natural key. LucidDB provides the GENERATED column feature to make it easy to populate tables with surrogate keys. Natural keys are declared via UNIQUE constraints. See LucidDbUniqueConstraints for more on this topic.
  • LucidDB does not yet support declaration of foreign key constraints, but if it did, they would state that the references in the fact table are to the surrogate keys of the dimension table (NOT to the natural keys). For example, timesheet_fact.emp_key corresponds to employee_dimension.emp_key, not empno. We'll see how this correspondence is created when we run loads in the next section.
  • We create indexes on columns of large tables which are likely to be used in predicates and as join keys. LucidDB automatically adapts indexes to data distributions, so the indexes on low-cardinality columns such as is_weekend and hours_worked will be bitmap indexes, whereas the indexes implicitly created to enforce UNIQUE constraints will be standard B-trees. No extra index is needed for timesheet_fact.emp_key, since it is the first column in the UNIQUE constraint, but one is needed for workday_key, since it is the second column, and we will be using it to join to calendar_dimension.

Run Full Load

OK, now we're finally ready to pump in some data and store it:

0: jdbc:luciddb:> !run examples/etl/sql/runFullLoad.sql
set schema 'warehouse';
insert into employee_dimension(empno, ename, dname, job)
select * from transform_schema.emp_view;
insert into calendar_dimension(calendar_date, is_weekend)
select * from transform_schema.calendar_view;
insert into timesheet_fact(emp_key, workday_key, hours_worked)
select
    e.emp_key,
    c.calendar_key,
    t.hours_worked
from
    transform_schema.timesheet_view t,
    employee_dimension e,
    calendar_dimension c
where
    t.empno=e.empno and t.workday=c.calendar_date;

That was easy. Just more SQL statements; there's no separate bulk-loader utility to run.

In the INSERT target lists, we don't mention the surrogate keys, causing them to get filled in with an autoincrement sequence. The transform views already have everything else ready. However, the fact load has to join to the preloaded dimensions in order to convert natural keys into surrogate keys. (In a real ETL process, there would usually be a lot more complexity here, including an extra layer of staging tables.)

LucidDB can handle bulk-updates to indexes directly as part of bulk-loads to tables, so there's never a need to drop and recreate indexes as a load step. (Some other DBMS products require this for efficiency.)

After loading data, it's a good idea to run ANALYZE TABLE to update the statistics in the catalog; otherwise, LucidDB's optimizer won't operate effectively:

analyze table employee_dimension estimate statistics for all columns;
analyze table calendar_dimension estimate statistics for all columns;
analyze table timesheet_fact estimate statistics for all columns;

Finally, another sanity check to verify that the data was loaded correctly:

select count(*) from employee_dimension;
+---------+
| EXPR$0  |
+---------+
| 14      |
+---------+
select count(*) from calendar_dimension;
+---------+
| EXPR$0  |
+---------+
| 365     |
+---------+
select count(*) from timesheet_fact;
+---------+
| EXPR$0  |
+---------+
| 4060    |
+---------+
select * from employee_dimension;
+----------+--------+---------+-------------+------------+
| EMP_KEY  | EMPNO  |  ENAME  |    DNAME    |    JOB     |
+----------+--------+---------+-------------+------------+
| 0        | 7369   | SMITH   | RESEARCH    | CLERK      |
| 1        | 7499   | ALLEN   | SALES       | SALESMAN   |
| 2        | 7521   | WARD    | SALES       | SALESMAN   |
| 3        | 7566   | JONES   | RESEARCH    | MANAGER    |
| 4        | 7654   | MARTIN  | SALES       | SALESMAN   |
| 5        | 7698   | BLAKE   | SALES       | MANAGER    |
| 6        | 7782   | CLARK   | ACCOUNTING  | MANAGER    |
| 7        | 7788   | SCOTT   | RESEARCH    | ANALYST    |
| 8        | 7839   | KING    | ACCOUNTING  | PRESIDENT  |
| 9        | 7844   | TURNER  | SALES       | SALESMAN   |
| 10       | 7876   | ADAMS   | RESEARCH    | CLERK      |
| 11       | 7900   | JAMES   | SALES       | CLERK      |
| 12       | 7902   | FORD    | RESEARCH    | ANALYST    |
| 13       | 7934   | MILLER  | ACCOUNTING  | CLERK      |
+----------+--------+---------+-------------+------------+

Note the surrogate key sequence in the emp_key column.

Run Analysis Queries

At last we're ready to run real queries:

0: jdbc:luciddb:> !run examples/etl/sql/runAnalytics.sql
set schema 'warehouse';
-- average hours worked by employee
select ename, avg(hours_worked) as avg_hours
from timesheet_fact t, employee_dimension e
where t.emp_key=e.emp_key
group by ename;
+---------+------------+
|  ENAME  | AVG_HOURS  |
+---------+------------+
| BLAKE   | 7          |
| MARTIN  | 11         |
| CLARK   | 7          |
| SMITH   | 8          |
| FORD    | 11         |
| MILLER  | 7          |
| ALLEN   | 13         |
| WARD    | 10         |
| KING    | 4          |
| SCOTT   | 11         |
| JONES   | 7          |
| ADAMS   | 7          |
| TURNER  | 11         |
| JAMES   | 7          |
+---------+------------+

Hmmm...that's a wide range. Does it have anything to do with the job they're doing?

-- average hours worked by job
select job, avg(hours_worked) as avg_hours
from timesheet_fact t, employee_dimension e
where t.emp_key=e.emp_key
group by job;
+------------+------------+
|    JOB     | AVG_HOURS  |
+------------+------------+
| SALESMAN   | 11         |
| ANALYST    | 11         |
| CLERK      | 7          |
| MANAGER    | 7          |
| PRESIDENT  | 4          |
+------------+------------+

Someone should check the president's golf logs instead.

-- who works on weekends?
select distinct ename, job
from timesheet_fact t, employee_dimension e, calendar_dimension c
where t.emp_key=e.emp_key
and t.workday_key=c.calendar_key
and c.is_weekend;
+---------+-----------+
|  ENAME  |    JOB    |
+---------+-----------+
| ALLEN   | SALESMAN  |
| MARTIN  | SALESMAN  |
| TURNER  | SALESMAN  |
| WARD    | SALESMAN  |
+---------+-----------+

Guess those salespeople must have had crazy quotas in 2005!

Run Incremental Load

The last script gives a quick look at LucidDB's upsert feature:

0: jdbc:luciddb:> !run examples/etl/sql/runIncrementalLoad.sql
select * from extraction_schema.emp2;
+--------+--------+-------------+------+-------------+-------+-------+---------+
| EMPNO  | ENAME  |     JOB     | MGR  |  HIREDATE   |  SAL  | COMM  | DEPTNO  |
+--------+--------+-------------+------+-------------+-------+-------+---------+
| 7839   | REX    | EL SUPREMO  |      | 1981-11-17  | 5000  |       | 40      |
| 7999   | KEATS  | ANALYST     |      | 1981-09-07  | 3200  |       | 10      |
+--------+--------+-------------+------+-------------+-------+-------+---------+

The EMP2 source table has two rows. The first one corresponds to an existing empno, the employee formerly known as KING. Apparently he is suffering from delusions of grandeur, and has also been reassigned to a different department. The second row is a new one which didn't appear in the original table at all.

To perform an incremental load, the script first creates a new transform view to represent the extra rows. (In a real application it would usually just be reloading from an updated state of the EMP table, but we have to fake it here with a separate table.)

create view transform_schema.emp2_view as
select e.empno, e.ename, d.dname, e.job
from extraction_schema.emp2 e inner join extraction_schema.dept d
on e.deptno=d.deptno;

Next, the actual upsert:

merge into warehouse.employee_dimension tgt
using transform_schema.emp2_view src
on src.empno=tgt.empno
when matched then
update set ename=src.ename, dname=src.dname, job=src.job
when not matched then
insert(empno, ename, dname, job)
values(src.empno, src.ename, src.dname, src.job);

The merge statement tells LucidDB what to do based on whether the incoming row matches an existing one on empno; if it does, update the existing row (leaving the empno and emp_key alone), if it doesn't, insert it as a new row (assigning it a brand new surrogate key):

select * from warehouse.employee_dimension;
+----------+--------+---------+-------------+-------------+
| EMP_KEY  | EMPNO  |  ENAME  |    DNAME    |     JOB     |
+----------+--------+---------+-------------+-------------+
| 0        | 7369   | SMITH   | RESEARCH    | CLERK       |
| 1        | 7499   | ALLEN   | SALES       | SALESMAN    |
| 2        | 7521   | WARD    | SALES       | SALESMAN    |
| 3        | 7566   | JONES   | RESEARCH    | MANAGER     |
| 4        | 7654   | MARTIN  | SALES       | SALESMAN    |
| 5        | 7698   | BLAKE   | SALES       | MANAGER     |
| 6        | 7782   | CLARK   | ACCOUNTING  | MANAGER     |
| 7        | 7788   | SCOTT   | RESEARCH    | ANALYST     |
| 9        | 7844   | TURNER  | SALES       | SALESMAN    |
| 10       | 7876   | ADAMS   | RESEARCH    | CLERK       |
| 11       | 7900   | JAMES   | SALES       | CLERK       |
| 12       | 7902   | FORD    | RESEARCH    | ANALYST     |
| 13       | 7934   | MILLER  | ACCOUNTING  | CLERK       |
| 8        | 7839   | REX     | OPERATIONS  | EL SUPREMO  |
| 14       | 7999   | KEATS   | ACCOUNTING  | ANALYST     |
+----------+--------+---------+-------------+-------------+

KING/REX kept his EMP_KEY and EMPNO, whereas KEATS was assigned a new EMP_KEY. But why did KING's position in the table move? Because LucidDB physically implements UPSERT via a combination of DELETE and INSERT, and rows are always inserted at the end.

Exercise

To get a better understanding of everything above, it might help to try to make some changes on your own.

Here's your challenge: find out the average number of hours worked by employee per week rather than per day.

Hint: the applib.time_dimension function produces a column called WEEK_NUMBER_IN_YEAR.

Error Handling

Often during ETL, a few "bad apple" rows will cause the whole operation to fail. LucidDB provides facilities for dealing with this situation; see LucidDbErrorHandling.

OLAP

Once you've got your data into a LucidDB star schema, read about LucidDbOlap to learn how to analyze it using a multi-dimensional query language.

Personal tools