LucidDbFlatFileWrapper

From Eigenpedia

Jump to: navigation, search

Contents

LucidDB plug-in SQL/MED File Foreign Wrapper

Introduction

LucidDB provides a plugin implementing SQL/MED (Management of External Data) access to text files. Each data file is accessed as a single table. A data file may be associated with a control file that contains column descriptions. A set of data files found in a file system directory can be accessed as an external schema through the "import schema" command.

Format

The major format for flat files is a Delimiter Field Separated Value format (such as comma or tab separated values format). Each record is separated from other records by a record delimiter (usually the newline character). Within a record, columns are separated by column delimiters (usually commas or tabs). Delimiters may appear within a value, if the value is quoted (usually with double quotes). A quote character may appear in a value if it is preceded by an escape character. Usually the double quote character is also the escape character, in which case, it only escapes itself. Optionally the first line of a file may be a header, containing column names. For example:

Figure 1. Typical flat file data

ID,NAME,EXTRA_FIELD
123,"""S,""",F
123,Jonathan Ackerman,A
234,Grady O'Neil,B
456,"Susan, Peter and Dave",C
789,"Amelia ""meals"" Maurice",E
234,"Peter ""peg leg"", Jimmy & Samantha ""Sam""",G

Rows 1, 4, and 6 include quoted column delimiters (commas). Rows 1, 5, and 6 have values with embedded quote characters. Note that empty quotes ("") represents the empty string, while a completely empty value () represents a null. (For the effect of this distinction on the interpretation of non-character data, see LDB-167.)

An alternative format for flat files is position based. In this format, columns are declared as fixed width character columns and are identified by character position, rather than by delimiters. This format makes use of row delimiters, but should not use column delimiters or quote characters.

Figure 2. Fixed width flat file data

123     "S,"                                     F
123     Jonathan Ackerman                        A
234     Grady O'Neil                             B
456     Susan, Peter and Dave                    C
789     Amelia "meals" Maurice                   E
234     Peter "peg leg", Jimmy & Samantha "Sam"  G

Column Descriptions

LucidDB provides several options for defining column descriptions for flat files. An effective method is to provide column descriptions through a control file. Control files are specified in SqlServer's bcp (bulk copy) style format file. For example:

Figure 3. Sample control file

9.0
3
1 SQLCHAR 0 7 "\t" 1 ID ""
2 SQLCHAR 0 100 "\t" 2 NAME SQL_Latin1_General_CP1_CI_AS
3 SQLCHAR 0 24 "\r\n" 3 EXTRA_FIELD ""

The first two values in the control file are the version number and the number of columns in the table. The version number is unused by LucidDB.

Note that many of the columns are unused by LucidDB. The columns used are: (2) data type, (4) data length, (7) identifier, and possibly (8) collation/format. A data file usually has the extension ".txt" while a matching control file (with the same name) usually has the extension ".bcp"

If a control file is not detected, the wrapper will try to detect column types. It does so by scanning a selected number of rows. It performs a simple analysis, identifying basic types such as integer and floating point. If the wrapper cannot identify the type of a column, it defaults to the character type. It labels the columns as "COL1", "COL2", etc. or uses column names found in the data file header, if they are supplied.

Whether using control files or implicit typing, it is possible to override column descriptions, by (explicitly) specifying column descriptions in a "create foreign table" command. (See below.)

Extensions. LucidDB introduces several extensions to support a greater variety of data types in BCP files. LucidDB supports the following BCP types including new types for date and time: SQLCHAR, SQLVARCHAR, SQLDATE, SQLTIME, SQLTIMESTAMP, SQLDECIMAL, SQLINT, SQLBIGINT, SQLSMALLINT, SQLTINYINT, SQLREAL, SQLFLT8, SQLFLT4. Since BCP does not support scale for decimal types, LucidDB has additional support for scale to be specified in column 9. For clarity, precision can be repeated in column 8. For a mapping of BCP data types to LucidDB data types see mapping table.

Usage

The flat file wrapper is accessed via the standard SQL/MED interface.

0. First, a data wrapper is defined. A data wrapper represents a program module for accessing external data. The user does not have to perform this step, because the flat file data wrapper is predefined in the system catalog as sys_file_wrapper.

1. Second, a foreign server is defined. In this case, a server describes a directory of data files, corresponding to a single schema. A server also contains parameters for parsing files.

2. Finally, data files may be accessed in one of two ways.
2a. One can create a foreign table, corresponding to a single data file
2b. One can import the foreign schema, which corresponds with the entire directory

Eventually, the data wrapper will be able to support international data. For now, however, it only supports single-byte ISO-8859-1 data (a superset of ASCII).

For more information related to SQL/MED, please see Farrago design docs and reference [Ref 1]. The rest of this document covers SQL syntax, errors, and use cases.

Flat File Foreign Data Wrapper Definition

The plugin is included as part of the LucidDB distribution; a corresponding foreign data wrapper instance named SYS_FILE_WRAPPER is predefined by LucidDB initialization scripts. Normally, there is no need to define additional instances.

Flat File Foreign Server

A File foreign data wrapper provides the capability to mediate access to file foreign servers. You can define a flat file foreign server to access files in a particular directory. The following server-level options can be specified:

Option Default Description
DIRECTORY OS directory path accessible by LucidDB, this directory contains a collection of files which all have the same content format (the format related attributes are specified in this table e.g. FIELD_DELIMITER etc.). Each data file in the directory is typically has a corresponding control file where the latter describes the information about each column in the data file. The control file must have the same name but with different extension. For example, if the data file is customer.dat then the control file is customer.bcp. The syntax of the control file is based on MS BCP file format.
FILE_EXTENSION txt The default file name extension e.g. "txt", "dat". If this option is not specified then the wrapper will assume the default file extension as "txt".
CONTROL_FILE_EXTENSION bcp The file extension of the control file. The default control file extension is ‘.bcp’.
FIELD_DELIMITER , delimiter character e.g. "," (comma) or '\t' (tab) or " " (space). The default is ","
LINE_DELIMITER \n Line delimiter character e.g. "\n". If this option is not specified, the default is "\n"
QUOTE_CHAR " Character used to quote e.g. Single quote (') or double quote ("). The default character is ".
ESCAPE_CHAR " Character used to escape the special characters, Quote char, field delimiter, line delimiter, escape char. The default character is ".
WITH_HEADER YES Attribute which indicates whether the data file has a header record, this option must have a value either "YES" or "NO". The default is "YES"
NUM_ROWS_SCAN 5 The number of rows to scan in order to deduce the data type of the columns. This is only applicable if a control file does not exist for a data file.
TRIM YES Whether to trim column values before reading them. If column values are quoted, whitespace inside quotes is preserved.
LENIENT YES Whether to be lenient when parsing rows of data. When lenient, missing columns at the end of rows are filled in with null. Extra columns at the end of rows are discarded.
MAPPED NO Whether to match source columns in a flat file with target columns specified by a control file or foreign table definition. If columns are to be mapped, then the data file must have a header, and the server must be lenient. Source columns, which are named in the header, are matched with target columns using a case-insensitive string comparison. By default columns are mapped by position.
DATE_FORMAT This string specifies a custom datetime format for date columns. For more details on date format, see LucidDbDateFormat.
TIME_FORMAT This string specifies a custom datetime format for time columns. For more details on date format, see LucidDbDateFormat.
TIMESTAMP_FORMAT This string specifies a custom datetime format for timestamp columns. For more details on date format, see LucidDbDateFormat.

Create Foreign Server DDL Syntax

Example:

CREATE SERVER file_dump_staging
FOREIGN DATA WRAPPER SYS_FILE_WRAPPER
OPTIONS (
DIRECTORY '/staging/file/data',
FILE_EXTENSION 'txt',
CONTROL_FILE_EXTENSION 'bcp',
FIELD_DELIMITER ',',
LINE_DELIMITER '\n',
QUOTE_CHAR '"',
WITH_HEADER 'YES')

The server options are to identify storage location and to specify the content format information of the files.

Drop Foreign Server

A foreign server can be dropped or altered. The DDL syntax is as follows:

DROP SERVER server_name drop_behavior
Where:
drop_behavior ::= RESTRICT | CASCADE

Alter Foreign Server

The standard ALTER SERVER syntax is not currently supported, but you can use CREATE OR REPLACE instead to change the options.

Flat File Foreign Table

You can define foreign tables that are managed by a foreign server. It is similar to defining normal SQL table except that you have to specify the foreign server name and table options. A foreign table is a schema object. You can create it under a schema.

Create Foreign Table

The DDL syntax for creating a file foreign table is as follows:

CREATE FOREIGN TABLE table_name
[ left_paren basic_column_definition_list right_paren ]
SERVER server_name [ table_options ]

For file foreign table, the only major table option supported is the file name. All other information related to the content format are inherited from the file foreign server options.

(It is also possible to specify log_filename in order to override the default error log filename.)

Example:

Create Schema FILE_STAGING;
CREATE FOREIGN TABLE FILE_STAGING.Customer_Stage (
name varchar(128),
dnb_number varchar(50),
industry varchar(50),
revenue numeric(8))
SERVER file_dump_staging
OPTIONS (filename 'customer_dump');
-- the foreign table can now be queried via standard SQL
Select * from FILE_Staging.Customer_stage;


Flat File Foreign Schema

LucidDB’s file wrapper plugin supports metadata extraction from bcp control files or it can guess column metadata when bcp control files are not available. The metadata from a file foreign server can be imported into LucidDB’s catalog, using the fixed foreign schema name 'BCP'. The DDL syntax is as follows:

IMPORT FOREIGN SCHEMA BCP
[import_qualification]
FROM SERVER foreign_server_name
INTO local_schema_name
Where:
import_qualification ::=
LIMIT TO | EXCEPT
(table_name_list) | TABLE_NAME LIKE expression
table_name_list ::= table_name [ { comma table_name } ...]

Error Handling

Foreign server definition

This involves specifying a data directory and metadata and parsing options. The following errors/issues may arise during a definition.

  • Parsing for boolean fields. The following case insensitive values are accepted for true: “1”, “t”, “true”, “yes”, “on”. All other values are interpreted as false
  • Directory cannot be read. The server will try to open the directory and will return this error if it does not exist or cannot be read.
  • Extension does not match any file in directory. No tables will be detected.

Metadata related errors

As well as through SQL, column descriptions for a foreign table may be specified by a control file, through column headers, or may be derived (guessed) from the data itself. The following issues may arise.

  • Invalid control (bcp) file. The control file is similar to a flat file, but has a short header and has certain restrictions: the column delimiter is a tab, and the row delimiter is a newline. These special characters are not allowed within quoted values.
    • Empty control file.
    • Unsupported version. Version of control file, specified in header, is unsupported.
    • Invalid number of columns. Also specified in header, must be at least one, less than maximum.
    • Unbalanced quotes (while reading column attributes)
    • Invalid column definition
    • Incorrect number of column definitions. Number of column definitions in file does not match the number specified in header
    • Unsupported data type. User specified a type not supported by the server
  • Unable to gather metadata. Usually, it is acceptable to have an empty data file (for an empty table). However, data is required when inferring the type of data.
    • Data file doesn't exist. The file should exist, if reading column headers or deriving data types
    • Column name header missing. If a foreign server is defined as having files with headers, each data file should have a single line header containing column names.
    • Unable to derive column types. If no control file exists to describe column types, then the type of columns must be derived. It is impossible to derive column types when there is no data
  • Row parsing errors
    • Maximum size of row is too large. The server does not support the specified row description.
    • No row delimiter in file. The file has data, but does not contain a row delimiter. (Row delimiter may be invalid).
    • No column delimiter in row. Returned when file has data and row is supposed to have more than one column, but it does not contain a column delimiter. (Column delimiter may be invalid.)
    • Row being read has too few columns. Row specification may be invalid or row may be invalid
    • Row being read has too many columns. Row specification may be invalid or row may be invalid
    • Error converting column value. Column text could not be converted into expected data type.
    • Row being read is too large. A row is required to fit within internal buffers. If it does not, it will be interpreted as multiple rows.
  • Errors not returned
    • Improperly quoted column. When a column value is begins with a quote, the parser will scan for the column value until it finds a close quote or until the maximum length of the column has been reached. If the scan is ended due to length, the parser does not look for a closing quote. Also, extra data after the end of a quoted value will be ignored.

Parser behavior

  1. Any column can be quoted, regardless of type. For example, it may be desirable to quote a date value, because it may contain commas, a common field delimiter.
  2. Null values are represented by an empty field. An empty string is represented by a quoted empty value.
  3. Value too large for column description. Parser quietly truncates the data, and skips to the next demiliter.
  4. Fixed length fields without column delimiter. If there is no column delimiter, then the parser enters a special mode for fixed position parsing. This mode does not allow/support a quote character or an escape character. For this mode, the parser interprets columns of data by character column position. (note: similar to unix cut -c)
  5. Unable to derive column types. A user wants to import flat file data. They would like the server to guess the column type. However, no data is available yet. An error is returned saying the server is unable to derive column types
  6. Notify of recoverable errors. If no good row has been returned, and errors have been received, then don't just log errors to a file. Return an error to the top level. The user may have specified an incorrect delimiter and may want to be notified of the mistake without having to search for the log file.

Error logging

When recoverable errors have been encountered, log them to a file and notify the user of the file location. [Note: this is deferred until warnings are enabled.]

  • File location: file-root_timestamp.err
    • file-root is path to data file being read, without the extension (ex: sales.dat becomes sales)
    • timestamp explains when file was created. (ex: 200512051348 for 2005-12-05 at 1:48pm)
    • in the case of multiple writes in the same minute, the file will be appended
  • File contents:
    • two columns, in flat file format: reason, row-text (perfect quoting deferred for now)
    • reason and row-text will be quoted appropriately (deferred for now)

References & Related Work

[Ref 1] Advanced SQL:1999 Understanding Object-Relational and Other Advanced Features, Author: Jim Melton, Morgan Kaufmann, 2003.

Personal tools