4. BigDAWG Query Language

Fundamentally, BigDAWG is middleware that provides a common application programming interface to a collection of distinct storage engines. To the typical user, BigDAWG is viewed as a query engine for the polystore system; hence, understanding how these queries are written is key to understanding BigDAWG.

BigDAWG queries are written with the BigDAWG Query language which uses a functional syntax:

bdrel( ... )

A function token (‘bdrel’ in this case) indicates how the syntax within the parenthesis is interpreted. For example, the ‘bdrel’ function token indicates that this is a query for the relational island and any code between the parenthesis will be interpreted as SQL code.

Five function tokens are defined in BigDAWG. Three function tokens indicate the islands targeted by a query:

  • bdrel – the query targets the relational island and uses PostgreSQL.
  • bdarray – the query targets the array island and uses SciDB’s AFL query language.
  • bdtext – the query targets the text island and uses either SQL or D4M.

The remaining function tokens deal with metadata for the polystore system and the migration of data between islands:

  • bdcatalog – the query targets the BigDAWG catalog using SQL.
  • bdcast – the query is a cast operation for inter-island data migration.

Queries using the ‘bdcast’ function token behave differently than queries based on the other function tokens. A ‘bdcast’ query is always nested inside other queries to indicate migration of data between islands.

In the next few subsections, we summarize operations supported by each island and provide a formal definition of the BigDAWG query syntax. See Example Queries for examples of BigDAWG queries.

4.1. BigDAWG Syntax Definitions

4.1.1. BigDAWG Query

BigDAWG Query Syntax:

BIGDAWG_SYNTAX ::=
  BIGDAWG_RETRIEVAL_SYNTAX | CATALOG_QUERY
BIGDAWG_RETRIEVAL_SYNTAX ::=
  RELATIONAL_ISLAND_QUERY | ARRAY_ISLAND_QUERY | TEXT_ISLAND_QUERY

4.1.2. Catalog Manipulation

Catalog manipulation queries are used to directly view the content of the catalog.

You may find the list of catalog_table_name in Catalog.

CATALOG_QUERY ::=
  { bdcatalog( catalog_table_name { [ column_name ] [, ...] }) }
  | { bdcatalog( full_sql_query_applied_to_the_catalog_database ) }

4.1.3. Inter-Island Cast

The differences between two data models can give rise to ambiguities when migrating data between them. When issuing a Cast that invokes an Inter-Island migration, the user avoids such ambiguities by providing the schema used in the destination island.

Cast Syntax:

BIGDAWG_CAST ::=
  bdcast( BIGDAWG_RETRIEVAL_SYNTAX, name_of_intermediate_result, {
    {, POSTGRES_SCHEMA_DEFINITION, relational}
    | {, SCIDB_SCHEMA_DEFINITION, array}
    | {, TEXT_SCHEMA_DEFINITION, text}} )

4.1.4. Relational Island

The Relational Island follows the relational data model, where data is organized into tables. The rows of a table are termed as tuples and columns simply as columns.

The Relational Island currently supports a subset of SQL used by PostgreSQL. It allows you to issue single-layered SELECT query with filter, aggregation, sort and limit operations.

Relational Island supports the following data types:
integer, varchar, timestamp, double, float

Relational Island Syntax:

RELATIONAL_ISLAND_QUERY ::=
  bdrel( RELATIONAL_SYNTAX )
RELATIONAL_SYNTAX ::=
  SELECT [ DISTINCT ]
  { * | { SQL_EXPRESSION [ [ AS ] output_name ] [, ...] } }
  FROM FROM_ITEM [, ...]
  [ WHERE SQL_CONDITION ]
  [ GROUP BY column_name [, ...] ]
  [ ORDER BY SQL_EXPRESSION [ ASC | DESC ]
  [ LIMIT integer ]
FROM_ITEM ::=
  { table_name | BIGDAWG_CAST } [ [ AS ] alias ]
SQL_EXPRESSION ::=
  SQL_NON_AGGREGATE_EXPRESSION
  | SQL_AGGREGATE
SQL_NON_AGGREGATE_EXPRESSION ::=
  literal
  | column_name
  | { SQL_NON_AGGREGATE_EXPRESSION SQL_BINARY_ALGEBRAIC_FUNCTION  SQL_NON_AGGREGATE_EXPRESSION }
  | { - SQL_EXPRESSION }
  | {( SQL_EXPRESSION )}
  | SQL_CONDITION
SQL_BINARY_ALGEBRAIC_FUNCTION ::=
  + | - | * | / | %
SQL_CONDITION ::=
  { SQL_NON_AGGREGATE_EXPRESSION SQL_CONDITION_OPERATOR
       SQL_NON_AGGREGATE_EXPRESSION }
  | { SQL_NON_AGGREGATE_EXPRESSION SQL_BINARY_LOGICAL_OPERATOR
       SQL_NON_AGGREGATE_EXPRESSION }
SQL_CONDITION_OPERATOR ::=
  = | < | > | <= | >= | !=
SQL_BINARY_LOGICAL_OPERATOR ::=
  AND
SQL_AGGREGATE ::=
  { SQL_AGGREGATE_NAME( [ DISTINCT ] SQL_NON_AGGREGATE_EXPRESSION [ , ... ] ) }
  | { count( { * | SQL_NON_AGGREGATE_EXPRESSION } )}
  | { width_bucket( SQL_NON_AGGREGATE_EXPRESSION, double_precision_number,
      double_precision_number, integer ) }
SQL_AGGREGATE_NAME ::=
  sum | avg | min | max
POSTGRES_SCHEMA_DEFINITION ::=
  ( { column_name sql_data_type POSTGRES_COLUMN_CONSTRAINT } [, ...] )
POSTGRES_COLUMN_CONSTRAINT ::=
  { [ PRIMARY KEY ]
    | [ REFERENCES table_name [( column_of_table_referenced )] ] }
  [ [ NOT ] NULL ]

4.1.5. Array Island

The Array Island follows an array data model, where data is organized into arrays. Arrays are multi-dimensional grids, where each cell in the grid contains a number of fields. Each dimension of an array is referred to as a dimension and each field in a cell is termed an attribute. Dimensions assume unique values whereas attributes are allowed duplicates. A combination of dimension values across all dimensions in an array uniquely identify an individual cell of attributes.

The Array Island currently supports a subset of SciDB’s Array Functional Language (AFL). It allows for project, aggregation, cross_join, filter and schema reform. Array Island also allows attribute sorting; however, at the moment, only sort in ascending order is supported.

Array Island supports the following data Types:
string, int64, datetime, double, float

Array Island Syntax:

ARRAY_ISLAND_QUERY ::=
  bdarray( ARRAY_SYNTAX )
ARRAY_SYNTAX ::=
  { scan( array_name ) }
  | { project( ARRAY_ISLAND_DATA_SET [, attribute ] [...]) }
  | { filter( ARRAY_ISLAND_DATA_SET, SCIDB_EXPRESSION ) }
  | { aggregate( ARRAY_ISLAND_DATA_SET, SCIDB_AGGREGATE_CALL [, ...] [, dimension] [...] ) }
  | { apply( ARRAY_ISLAND_DATA_SET {, new_attribute, SCIDB_NON_AGGREGATE_EXPRESSION} [...] ) }
  | { cross_join( ARRAY_ISLAND_DATA_SET [ as left-alias], ARRAY_ISLAND_DATA_SET [ as right-alias ] [, [left-alias.]left_dim1, [right-alias.]right_dim1] [...] ) }
  | { redimension( ARRAY_ISLAND_DATA_SET, { array_name | SCIDB_SCHEMA_DEFINITION } ) }
  | { sort( ARRAY_ISLAND_DATA_SET [, attribute] [...] } ) }
ARRAY_ISLAND_DATA_SET ::=
  array_name | ARRAY_ISLAND_SYNTAX | BIGDAWG_CAST
SCIDB_EXPRESSION ::=
  SCIDB_AGGREGATE_CALL
  | SCIDB_NON_AGGREGATE_EXPRESSION
SCIDB_BINARY_ALGEBRAIC_FUNCTION ::=
  + | - | * | / | %
SCIDB_CONDITION ::=
  { SCIDB_NON_AGGREGATE_EXPRESSION SCIDB_CONDITION_OPERATOR SCIDB_NON_AGGREGATE_EXPRESSION }
  | { SCIDB_NON_AGGREGATE_EXPRESSION SCIDB_BINARY_LOGICAL_OPERATOR SCIDB_NON_AGGREGATE_EXPRESSION }
  | { regex( { attribute_name | dimension_name }, 'regex_expression') }
  | { iif ( SCIDB_BINARY_PREDICATE, SCIDB_ALGEBRAIC_EXPRESSION, SCIDB_ALGEBRAIC_EXPRESSION) }
SCIDB_NON_AGGREGATE_EXPRESSION ::=
  literal
  | dimension
  | attribute
  | { SCIDB_NON_AGGREGATE_EXPRESSION SCIDB_BINARY_ALGEBRAIC_FUNCTION SCIDB_NON_AGGREGATE_EXPRESSION }
  | { - SCIDB_EXPRESSION }
  | {( SCIDB_EXPRESSION )}
  | SCIDB_CONDITION
SCIDB_CONDITION_OPERATOR ::=
  = | < | > | <= | >= | !=
SCIDB_BINARY_LOGICAL_OPERATOR ::=
  AND
SCIDB_AGGREGATE_CALL ::=
  SCIDB_AGGREGATE_FUNCTION( dimension )
SCIDB_AGGREGATE_FUNCTION ::=
  sum | avg | min | max
SCIDB_SCHEMA_DEFINITION ::=
  <{attribute_name: data_type} {, ...}>
  \[ { dimension_name = { integer_lower_bound | * } : { integer_upper_bound | * } , integer_cell_size, integer_overlap} [, ...] \];

4.1.6. Text Island

The Text Island logically organizes data in tables, and retrieves data in a key-value fashion. This is modeled after the data model of the Accumulo engine. When queried for a certain table, it returns a list of key-value pairs. The key contains row label, column family label, column qualifier label, and a time stamp. The value is a string.

The Text Island query syntax adopts a JSON format using single-quote for labels and entries. The user can issue full table scan or range retrieval queries.

Text Island supports the following data Types:
string

Text Island Syntax:

TEXT_ISLAND_QUERY ::=
  bdtext( TEXT_ISLAND_SYNTAX )
TEXT_ISLAND_SYNTAX ::=
  { 'op' : 'TEXT_OPERATOR', 'table' : '(table_name | BIGDAWG_CAST)' [, 'range' : { TEXT_ISLAND_RANGE }] }
TEXT_ISLAND_RANGE ::=
  TEXT_ISLAND_RANGE_START_KEY
  | TEXT_ISLAND_RANGE_END_KEY
  | (TEXT_ISLAND_RANGE_START_KEY, TEXT_ISLAND_RANGE_END_KEY )
TEXT_ISLAND_RANGE_START_KEY ::=
  'start' : \['start_row','[start_column_family]','[start_column_qualifier]'\]
TEXT_ISLAND_RANGE_END_KEY ::=
  'end' : \['end_row','[end_column_family]','[end_column_qualifier]'\]
TEXT_OPERATOR ::=
  scan
TEXT_SCHEMA_DEFINITION ::=
  ()