LucidDbUdxJavaHowto

From Eigenpedia

Jump to: navigation, search

Contents

Introduction to Table Functions

Java UDF's are very powerful, but eventually you are likely to require a transformation that can't be expressed as a simple scalar function. LucidDB supports an advanced feature known as a "UDX" or user-defined transformation. A UDX is invoked in the FROM clause rather than in the SELECT/WHERE clauses, meaning that it can combine rows from different tables and produce new rows. Before trying out a UDX, be sure to work through LucidDbUdfJavaHowto first.

Fetch from the Web

Let's start with a UDX which operates as a "leaf" in a query, producing new rows without consuming existing rows. To make the example interesting, we'll show how a UDX can fetch external data, namely a page off of the web. (If you don't have a network connection, you can use the same UDX to read from a local file.)

Here's the source code from examples/udx/com/yoyodyne/UrlTextFetchUdx.java:

package com.yoyodyne;
import java.net.*;
import java.io.*;
import java.sql.*;
public class UrlTextFetchUdx
{
    public static void execute(
        String urlString,
        PreparedStatement resultInserter)
        throws Exception
    {
        URL url = new URL(urlString);
        InputStream inputStream = null;
        try {
            inputStream = url.openStream();
            InputStreamReader reader = new InputStreamReader(inputStream);
            LineNumberReader lineReader = new LineNumberReader(reader);
            for (;;) {
                String line = lineReader.readLine();
                if (line == null) {
                    return;
                }
                int lineNumber = lineReader.getLineNumber();
                resultInserter.setInt(1, lineNumber);
                resultInserter.setString(2, line);
                resultInserter.executeUpdate();
            }
        } finally {
            if (inputStream != null) {
                inputStream.close();
            }
        }
    }
}

The calling convention for a UDX is non-obvious, but once you understand it you'll see that makes life very easy for the author of the UDX. The implementation of the table function is a Java method which will be invoked by LucidDB to produce rows. LucidDB gives the UDX a "pipe" into which to write the output rows; this is the resultInserter parameter to the method. (The urlString is a normal input parameter which tells the UDX the source for the fetch.) You can think of the pipe interface as a JDBC PreparedStatement something like "INSERT INTO result_pipe values (?, ?, ...)" with one ? per output column. (LucidDB has already done the work of preparing this statement by the time the UDX is invoked; the pipe is just a memory buffer which connects the UDX thread with the rest of the invoking SQL statement.)


In this case, the UDX opens a connection to the specified URL, sets up a reader to process its content as lines of text, and then for each line fetched, converts it into a row and pumps this into LucidDB. The output row has two columns: the first one is the line number, and the second one is a line of text. By using a try/finally construct, the UDX is careful to clean up its resources even if an exception occurs. (This is very important for a plugin which will be loaded into a long-running server such as LucidDB!)

Build and Install

To test it out, first compile and package the plugin by following the same steps as for a UDF (the wildcards will take care of the next example too):

cd examples/udx
javac com/yoyodyne/*.java
jar cf ../../plugin/urlfetch-udx.jar com/yoyodyne/*.class

Then install it in SQL (full syntax):

create schema web;
set schema 'web';
call sqlj.install_jar('file:${FARRAGO_HOME}/plugin/urlfetch-udx.jar','udx_jar',0);
create function url_text_fetch(url varchar(1024))
returns table(line_num int, text_line varchar(65535))
language java
parameter style system defined java
no sql
external name 'web.udx_jar:com.yoyodyne.UrlTextFetchUdx.execute';

Notes:

  • The resultInserter parameter to the Java method isn't mentioned in the DDL; its presence is implied by the fact that the function returns a table. When present, it must always come at the end of the Java parameter list.
  • The return type is a table rather than a builtin type. This defines the output row, and must match the pipe calls (setInt, setString) inside of the UDX implementation.
  • We use varchar(65535) because that's the maximum supported by LucidDB, and Java strings have dynamic length.
  • The clause parameter style system defined java flags the calling convention as non-standard, since the LucidDB UDX mechanism isn't covered by SQL:2003.

Invoke from SQL

Now, invoke the UDX with a Web URL parameter, using the SQL substring function to limit the output row width to 40 characters:

select line_num,substring(text_line, 1, 40) as text_line_start
from table(web.url_text_fetch('http://www.google.com'));

The result will look something like this:

0: jdbc:luciddb:> select line_num,substring(text_line, 1, 40) as text_line_start
. . . . . . . . > from table(web.url_text_fetch('http://www.google.com'));
+-----------+-------------------------------------------+
| LINE_NUM  |              TEXT_LINE_START              |
+-----------+-------------------------------------------+
| 1         | <html><head><meta http-equiv="content-ty  |
| 2         | body,td,a,p,.h{font-family:arial,sans-se  |
| 3         | .h{font-size: 20px;}                      |
| 4         | .q{color:#0000cc;}                        |
| 5         | -->                                       |
| 6         | </style>                                  |
| 7         | <script>                                  |
| 8         | <!--                                      |
| 9         | function sf(){document.f.q.focus();}      |
| 10        | // -->                                    |
| 11        | </script>                                 |
| 12        | </head><body bgcolor=#ffffff text=#00000  |
| 13        | <form action=/search name=f><table borde  |
+-----------+-------------------------------------------+

Now try it on a local file:

select *
from table(web.url_text_fetch('file:../COPYING'));

If you started LucidDB from the bin directory, you should see the text of the GPL in the result set. Otherwise, change the statement to use an absolute path according to the file URL conventions for your operating system.

Tokenize the HTML

How about a UDX which processes input rows? Suppose we want to tokenize the HTML returned by Google, breaking each line down into multiple rows, one for each token. The following UDX (from examples/udx/com/yoyodyne/HtmlTokenizerUdx.java) is a basic implementation:

package com.yoyodyne;
import java.util.*;
import java.sql.*;
public class HtmlTokenizerUdx
{
    public static void execute(
        ResultSet lineInput,
        PreparedStatement resultInserter)
        throws Exception
    {
        while (lineInput.next()) {
            int lineNumber = lineInput.getInt(1);
            String line = lineInput.getString(2);
            StringTokenizer tokenizer = new StringTokenizer(
                line,
                "<>/-'\"{}(); \t\n\r\f");
            int tokenNumber = 0;
            while (tokenizer.hasMoreTokens()) {
                String token = tokenizer.nextToken();
                resultInserter.setInt(1, lineNumber);
                resultInserter.setInt(2, tokenNumber);
                resultInserter.setString(3, token);
                resultInserter.executeUpdate();
                ++tokenNumber;
            }
        }
    }
}

This time, the UDX takes a JDBC ResultSet parameter. This respresents the input set of rows to be processed. The UDX breaks down each line into tokens and writes one output row per token. It preserves the original line number and also provides the offset of each token within the original line.

Invoke with Cursor Input

We already built the UDX above together with the first one, so it's ready to be installed from SQL:

create function tokenize_html(input_lines cursor)
returns table(line_num int, token_num int, token_text varchar(65535))
language java
parameter style system defined java
no sql
external name 'web.udx_jar:com.yoyodyne.HtmlTokenizerUdx.execute';

The only difference from the previous example is that we declare that the function takes one cursor as input, corresponding to the ResultSet parameter in the Java method. Here's how this looks in SQL (pipelining the two transformations):

select * 
from table(web.tokenize_html(
    cursor(
        select *
        from table(web.url_text_fetch('http://www.google.com'))
    )));

The cursor construct converts the inner select into a ResultSet which can be passed into the outer transformation. Here are the first few rows of an example result:

0: jdbc:luciddb:> select *
. . . . . . . . > from table(web.tokenize_html(
. . . . . . . . >     cursor(
. . . . . . . . >         select *
. . . . . . . . >         from table(web.url_text_fetch('http://www.google.com'))
. . . . . . . . >     )));
+-----------+------------+----------------------------------+
| LINE_NUM  | TOKEN_NUM  |            TOKEN_TEXT            |
+-----------+------------+----------------------------------+
| 1         | 0          | html                             |
| 1         | 1          | head                             |
| 1         | 2          | meta                             |
| 1         | 3          | http                             |
| 1         | 4          | equiv=                           |
| 1         | 5          | content                          |
| 1         | 6          | type                             |
| 1         | 7          | content=                         |
| 1         | 8          | text                             |
| 1         | 9          | html                             |
| 1         | 10         | charset=ISO                      |
| 1         | 11         | 8859                             |
| 1         | 12         | 1                                |
| 1         | 13         | title                            |
| 1         | 14         | Google                           |
| 1         | 15         | title                            |
| 1         | 16         | style                            |
| 1         | 17         | !                                |
| 2         | 0          | body,td,a,p,.h                   |
| 2         | 1          | font                             |
| 2         | 2          | family:arial,sans                |
| 2         | 3          | serif                            |
| 3         | 0          | .h                               |
| 3         | 1          | font                             |
| 3         | 2          | size:                            |
| 3         | 3          | 20px                             |
| 4         | 0          | .q                               |
| 4         | 1          | color:#0000cc                    |
| 6         | 0          | style                            |
| 7         | 0          | script                           |
| 8         | 0          | !                                |
| 9         | 0          | function                         |
| 9         | 1          | sf                               |
| 9         | 2          | document.f.q.focus               |
| 11        | 0          | script                           |
...

Analyze This

The power of a UDX is that custom transformation logic can be combined effortlessly with the power of standard SQL constructs such as filters, joins, unions, and aggregates. Let's count the occurrences for each token and sort them in descending order:

select token_text, count(*) as occurrence_count
from (
    select * 
    from table(web.tokenize_html(
        cursor(
            select *
            from table(web.url_text_fetch('http://www.google.com'))
        )))
    )
group by token_text
order by occurrence_count desc;

Top few results:

0: jdbc:luciddb:> select token_text, count(*) as occurrence_count
. . . . . . . . > from (
. . . . . . . . >     select *
. . . . . . . . >     from table(web.tokenize_html(
. . . . . . . . >         cursor(
. . . . . . . . >             select *
. . . . . . . . >             from table(web.url_text_fetch('http://www.google.com'))
. . . . . . . . >         )))
. . . . . . . . >     )
. . . . . . . . > group by token_text
. . . . . . . . > order by occurrence_count desc;
+----------------------------------+-------------------+
|            TOKEN_TEXT            | OCCURRENCE_COUNT  |
+----------------------------------+-------------------+
|                              | 32                |
| a                                | 28                |
| href=                            | 14                |
| font                             | 12                |
| td                               | 12                |
| tr                               | 8                 |
| br                               | 7                 |
| Google                           | 6                 |
| table                            | 6                 |
| class=q                          | 6                 |
| input                            | 5                 |
| size=                            | 5                 |
| 1                                | 5                 |
| b                                | 4                 |
...

Oops, we forgot to treat the equals sign as a delimiter. Maybe we should have made the delimiter string a parameter to the UDX...

Personal tools