LucidDbTalend

From Eigenpedia

Jump to: navigation, search

Contents

Introduction

This page describes experimental integration of LucidDB with Talend Open Studio. Talend looks like a good match for LucidDB since it supports an "ELT" model (pushing the execution down to the DBMS), but some contributions to Talend will be needed to make it aware of LucidDB and to take advantage of advanced capabilities such as UDX's and upserts.

Readers are encouraged to first work through the LucidDbEtlTutorial to get an idea for the LucidDB approach to data integration before moving on to usage of a generic system such as Talend.

Setup

Download a copy of Talend Open Studio and browse through the documentation and tutorials. The version of Talend used for the experiments described in this page was 2.3.0M2, Build id: r7640-20071218-1716 (the latest milestone build available at the time).

Start the LucidDB server as described in LucidDbGettingStarted, and then using sqllineClient, populate some sample metadata as below:

create schema talend_test;
create view talend_test.date_view as
select
time_key as calendar_date,
case when weekend='Y' then true else false end as is_weekend
from table(applib.time_dimension(2005, 1, 1, 2005, 12, 31));
create table talend_test.date_list(
    calendar_key int generated always as identity not null primary key,
    calendar_date date not null,
    is_weekend boolean not null,
unique(calendar_date)
);

Connect

Currently, Talend does not officially support usage of arbitrary JDBC drivers for extracting metadata. The real solution to this would be to make Talend aware of LucidDB, but in the meantime, a trick is required.

First, make sure you are using LucidDB 0.7.3 or later (the JDBC driver was modified as part of making the trick work).

Next, edit the configuration for the launcher script you use to start Talend. On Linux, TalendOpenStudio-linux-gtk-x86.ini looks like this by default:

-vmargs
-Xms40m
-Xmx500m
-XX:MaxPermSize=128m

Edit it as follows:

-vmargs
-Xbootclasspath/a:/path/to/where/you/downloaded/modified/version/of/LucidDbClient.jar
-Xms40m
-Xmx500m
-XX:MaxPermSize=128m
-Djdbc.drivers=com.lucidera.jdbc.LucidDbRmiDriver

This puts the LucidDB client driver on the bootstrap classpath, and tells the JDBC Driver Manager to load it automatically. The reason it has to be on the bootstrap classpath (rather than the normal system classpath) is that Talend is Eclipse-based, and Eclipse loads each plugin in a private classloader which can't see the system classpath--but it can see the bootstrap classpath. There's probably a cleaner way to accomplish this if you know Eclipse well.

Now, close Talend if it's already open, and then re-launch it. Create a new project, or open an existing one.

In the tree, open the Metadata node, and then right-click on the Db Connections node and choose Create connection. Fill in "LucidDB Integration" for the connection name, and then click Next.

Fill in the info as in the following screenshot. Start by choosing Generic ODBC for DB Type, and then override the String of Connection to use a LucidDB-specific URL rather than the ODBC bridge driver. Be sure to include the port number (5434) and schema name. Click the Check button to confirm that the connection is working successfully.

Click Finish to complete the creation of the connection in the repository.

Extract Metadata

Now, extract the metadata for the table and view we created earlier in LucidDB.

Right-click on the new LucidDB Integration 0.1 connection node in the tree, and choose Retrieve Schema. Click Next past the first empty filter screen (not sure what that is for or why it's not working). On the next screen, in the Name Filter box, type in DATE_ and you should see the list narrow to just our DATE_LIST and DATE_VIEW. Check the two boxes and then hit Next. On the next screen, you should see the correct column names and types for the two objects. Click Finish and the two objects will be added to the tree (one under Table schemas, and one under View schemas).

Try out the query editor; right-click LucidDB Integration 0.1 again, and this time click Edit queries. You can type in a SQL-query in the text box, and it should provide auto-complete based on the available metadata. Click the little running-man icon to execute your query:

You can also try the graphical query designer by clicking on the Designer tab.

When you're done, click OK to save your query, and give it the name Lots of Dates. It should appear in the tree under the Queries node.

Create an ETL Job

Now, create a job which will read rows from the query we just defined, transform each row by computing its CRC value and appending that as an extra column, and then write the result to a flat file.

In the tree, right-click Job Designs, choose Create job, and name the new job ComputeCRC.

The first step (reading rows from the query) is harder than it should be due to missing LucidDB awareness. You can try dragging the metadata you extracted before (the DATE_LIST view) into the job designer view, and it will create something automatically, but unfortunately it still thinks it is supposed to be using the ODBC bridge (despite our override of the URL). So don't do this.

  1. Instead, in the palette on the right, click Databases, and then DB JDBC. Select the tJDBCInput tool, and then click to the left in the job designer view to instantiate it. Now you have to fill in the URL (jdbc:luciddb:rmi://localhost:5434;schema=TALEND_TEST), class name (com.lucidera.jdbc.LucidDbRmiDriver), driver JAR (use the file chooser to browse wherever you saved it) and user name (sa) redundantly. The Driver JAR box may not auto-refresh; don't worry about it.
  2. For Schema, choose Repository, and then choose DATE_VIEW from the new combo box which appears.
  3. For Query Type, choose Repository, and then choose Lots of Dates from the new combo box which appears.

Next, create the CRC generator. It's tAddCRCRow in the palette under Data Quality. Instantiate it in the design view. Link up its input by right-clicking on the JDBC input, choosing Row and then Main, and dragging the link over to the CRC generator. Click on the CRC generator to view its properties, and click Sync columns to get it to read its input metadata from the query. (You may have to click away and back to get it to refresh.) Select CALENDAR_DATE and IS_WEEKEND for the Use in CRC property.

At this point, you can actually run the job already (from the Run tab below the designre), although the output isn't going anywhere yet. Talend has a cool feature for watching the rows as they are computed; click the Traces checkbox before running to see this in action. It slows things down to a crawl, so if you try it with this job, you'll probably want to kill the job after a bit.

For the final step of writing the rows out to a file, browse File and Output in the palette, choose tFileOutputDelimited (or whatever you want), and link it up from the output of the CRC generator. Set the properties on the CRC generator to the desired output location, run the job, and verify that the file got created and the contents are what you'd expect. Here's Talend in action:

Note that you would not want to use tJDBCOutput since row-at-a-time inserts into LucidDB are not an efficient way to load data.

Create an ELT Job

Now let's try to tell LucidDB to move some data via ELT. Talend doesn't have any generic JDBC ELT, so we'll fake it with the Teradata ELT instead.

Create a new job, and using the ELT palette, add the input, map, and output from under Teradata to the design view. These are fairly straightforward to set up, so we won't go into the details here; for the input, use the repository to pick DATE_VIEW, and for the output, pick DATE_LIST (the table we'll populate). Add dataflow links from input to map to output, giving them names as requested; make sure you match the actual object names DATE_VIEW and DATE_LIST.

Now, double-click on the map, and in the left-hand pane, click the green plus button to select the input. Hit the Auto map! button to set up the mapping. Then, select the CALENDAR_KEY in the bottom-right schema pane and click the delete key; we want to leave this column unmapped so that LucidDB can fill it in automatically (it is a generated identity column).

Unfortunately, now you're stuck, because if you click on the Code tab to browse the generated code, you'll notice that the code has been generated to use a Teradata JDBC URL instead of LucidDB. And the code is read-only since it's generated. So at this point, if you actually want to run the job, you have to figure out how to modify and compile the generated code. (Or copy the generated SELECT statement out, prepend the INSERT portion from the generated Java code, and execute it via sqlline.)

Well, we got pretty close. Here's a screenshot:

Here's what the generated SQL would do:

INSERT INTO DATE_LIST (CALENDAR_DATE,IS_WEEKEND)
SELECT
DATE_VIEW.CALENDAR_DATE, DATE_VIEW.IS_WEEKEND
FROM
 DATE_VIEW

Improvements Needed

Clearly, some basic improvements are needed for this integration to be usable; at a bare minimum:

  • Add LucidDB to Talend as a supported DBMS
  • Clone the Teradata ELT palette for use with LucidDB (or better, reuse a common factor if it exists)

Beyond that:

  • Look into DBMS schema support (in the sense of table name qualifier) for Talend. Currently, Talend doesn't seem to have such a concept, but SQL:2003 does, and LucidDB does too. Without this, everything needs to be under the same qualifier (that's why we used the TALEND_TEST schema in the LucidDB JDBC URL). Note that when Talend refers to a schema, it just means the column names and types of a particular table or view.
  • Contribute support for upsert ELT mappings to Talend. Upsert is very important for incremental loads; LucidDB has support for it, as do others such as Oracle.
  • Improve support for complex nested queries; LucidDB's optimizer excels at processing long-chain dataflow. This kind of pipelining is much more efficient than executing many individual mappings with intermediate tables.
  • Add LucidDB palette items for the UDX transformations available in LucidDbAppLib, e.g. SPLIT_STRINGS, plus other operations such as ANALYZE, TRUNCATE, REBUILD, and DEALLOCATE OLD.
  • Expose SQL expression-builder metadata for UDF's available in LucidDbAppLib, e.g. STR_REPLACE.
  • Find a way to make it easy to establish SQL/MED connectivity to foreign data servers.

It may also be possible to create a UDX coupled with a palette tool for piping row-at-a-time output from normal Talend ETL operators (such as CRC generation) into a corresponding LucidDB bulk-load. This would allow developers to choose between ETL and ELT based on the best available tool for the job. A simpler alternative is to have the ETL dump to flat files, and then map those into LucidDB via ELT.

Personal tools