This section provides some tips on how you can adapt the BigDAWG system for your own data. Specifically, we describe how to use the administrative web interface, add your own database engine, add your own tables/databases and tips on how to construct your own island. Some of these may require some level of expertise so please do not hesitate to contact us if you have any questions!

5. Personalizing the setup

5.1. Administrative Web Interface:

A very basic administrative web interface is included with this release, which will let you see the status of the BigDAWG cluster of databases, start and stop containers, and view the Catalog objects table.

You can view a video demonstration here

_images/admin_ui_status.png

Container Status and Start/Stop Interface

_images/admin_ui_catalog.png

Catalog Objects Interface

5.2. Formulating Example Queries:

todo: (Add information about writing other queries)

5.3. Adding your own data:

You can register a new database with a BigDAWG cluster by adding information about the database to the Catalog. Once the Catalog is updated, the Middleware is aware of the new database and can perform all island-compatible queries on it.

For example, assume that you can add a simple relational database named inventory with a table named products consisting of product information such as the following:

ItemNumber ItemName Price
1 Banana 0.99
2 Apple 1.25
3 Carrot 1.30

There are 3 parts of the Catalog that must be updated. Recall that the Catalog itself is a Postgres database named bigdawg_catalog.

1.) The databases table requires the following fields:

  • dbid: serial integer for referring to the database by ID
  • engine_id: serial integer for referring to the type of engine that this database corresponds to. This ID should be read from the eid value of the
  • engines table in the Catalog.
  • name: name of the database. In this example, this value would be “inventory”.
  • userid: the username used to log into the new database
  • password: the password used to log into the new database

For example, an INSERT statement would look like this:

INSERT INTO catalog.databases values(8, 0, inventory, postgres, test);

2.) The objects table requires the following fields:

  • oid: serial integer for referring to the new table.
  • name: name for the new data object. In this example, the value would be “products”
  • fields: A comma-separated string of column names in the products table
  • logical_db: An ID referencing the database ID from the databases table
  • physical_db: An ID referencing the database ID from the databases table

For example, an INSERT statement would look like this:

INSERT INTO catalog.objects values(52, products, ItemNumber,ItemName,Price, 8, 8);

3.) bigdawg_schemas table:

CREATE TABLE products (ItemNumber integer, ItemName varchar(40), Price real);

5.4. Adding your own engine

This guide provides you a starting point to integrate a database with JDBC driver into the BigDAWG middleware. For other types of databases, please reach out to us and we will work with you.

  1. Find the associated JDBC driver, and add it as a dependency to pom.xml
  1. Create the associated ConnectionInfo, DBHandler, DBInstance, etc. classes for the database engine. (See Postgresql package for reference.)
  1. Create a new query generator if existing ones are not fully compatible. Also might need some sort of utility class to convert datatype names to some common representation (e.g. Postgresql datatypes – see)
  1. Modify islands.TheObjectThatResolvesAllTheDifferencesAmongTheIslands.java - EngineEnum, getQConnectionInfo(), getQueryGenerator(), and anywhere else that would be appropriate.
  1. Create Export and Load classes for the Database engine (under migration)
  1. Create migrators to/from Postgres (or any other engines you want to migrate to/from)
  1. Register the new migrators in Migrator.java
  1. When setting up your BigDAWG instance, make sure to add an entry to the catalog to let it know your database engine exists. Also add entries for the schemas for tables stored on that index.

5.5. Connecting to existing databases

Use can use the middleware distributed in this release to connect to an existing database. For this example, we assume that you have an existing Postgres instance that you would like to connect to. Let’s assume that the database name if foo and that this database has two tables foo_table1 and foo_table2.

  1. Clone the git repository to a system that can connect to the Postgres database (from https://github.com/bigdawg-istc/bigdawg):
  2. In the Postgres database, create two new databases: 1) bigdawg_catalog with schema catalog and 2) bigdawg_schemas. The bigdawg_catalog database contains a variety of information such as connection properties, names of tables and schema. Look at /provisions/cluster_setup/postgres-catalog/bdsetup/catalog_inserts.sql for an example of what tables are filled for connecting to the various MIMIC II tables. You will need to add the engine connection information in catalog. In this case, you will add a row to catalog.engines for the existing Postgres database; entries in catalog.databases for the bigdawg_catalog, bigdawg_schemas, and foo databases. You will also need to add information about the tables foo_table1 and foo_table2 to the catalog.objects table.

3) In the bigdawg_schemas database, create empty schemas for the foo database similar to what we did for the MIMIC II database: ./provisions/cluster_setup/postgres-catalog/bdsetup/mimic2_schemas_ddl.sql

  1. Now, you can compile the code you downloaded.

First, you need to edit file profiles/dev/dev-config.properties so that the middleware knows where to look for the Postgres engine. Specifically look at the following lines to modify:

# ==================
# Catalog database
# ==================

postgresql.url=jdbc:postgresql://host:port
postgresql.user=XXXXXXXXX
postgresql.password=XXXXXXXXX

Once you are done editing this file, close and save it and you are ready to package the JAR in the root directory using the following command:

mvn package -P mit -DskipTests -f pom.xml -q
  1. Now that you have packaged the jar, you should be ready to execute it using the following command:
mvn exec:java -f pom.xml -P mit -q

The above command will start the bigdawg instance on the current node you are running on.

  1. If you are running the Postgres engine on another host, you need to launch the middleware on that host as well. For example, you can ssh into that node and use the same command as above to run it.
ssh node
mvn exec:java -f pom.xml -P mit -q
  1. Now, you should be ready to issue a query
curl -X POST -d "bdrel(select * from foo.table1);" http://localhost:8080/bigdawg/query

5.6. Adding your own island

This guide provides a road-map for adding new islands to the BigDAWG system. Creating an island involves four general steps: determine the language and functionalities supported by the island, implement supports for the island language and logical representations of the functionalities in the BigDAWG context, creating shims between the island and the database engines, and create a front-end support for other BigDAWG components. We will elaborate on these steps using the current Text Island as an example.

  1. Determining the language and functionalities
We model our island on the functionalities of Apache Accumulo. It is therefore by design to support only complete or ranged table scans. Therefore, we need to only support one operation: Scan, with optional range parameters. Consequently, there will not be nested expressions. As with other islands, we will not reformat the results.
  1. Implement supports for the language and its functionalities in the BigDAWG context

For query optimization purposes, functionalities of an island are represented by implementation of Java interface Operator and its extensions, such as SeqScan, or sequential scan. In our case, we want to implement a Text Island operator that scans a table, with optional specification of ranges. Therefore, we want to create the class TextScan that implements SeqScan interface.

Note that to retain extendibility for the Text Island, we first created a parent abstract class named TextOperator that implements the Operator interface; we extended the TextOperator class to create our TextScan.

Language support entails parsing user query into an Abstract Syntax Tree (AST) with Operator nodes. In our case, each query will consist of a single TextScan and there will not be branches.

We therefore use the JSON to implement our language. In a JSON object, we require the user to provide a field of table name and an optional JSON object to specify range in the query. We use the org.json.simple.parser.JSONParser in our language parser to create TextScan operators.

  1. Creating shims for BigDAWG Query Executor
At the moment, we only want to connect Accumulo to the Text Island. Therefore we implement the Shim Java interface to create our shim, TextToAccumulo shim. The virtual functions listed in Shim provides a very good guideline of what needs to be done to connect Accumulo to the Text Island.
  1. Creating planner and executor facing front

We begin by creating the TextIsland interface used by the Planner and Executor. The TextIsland class implements Java interface Island. In the TextIsland class, we need to define the default database to which an inter-island intermediate result could be migrated. This is done by looking up the database’s dbid in the Catalog. The setup and tear down virtual functions are intended for creating and destroying temporary tables used for inter-island query execution. The virtual function for creating Literal and Constant Signature asks for a list of constants, therefore we return a list of values used in the range specification.

We then implement IntraIslandQuery Java interface to create the logical intra-island execution plan of the Text Island. Here, we make use of the setup and tear down functions created in TextIsland to create support for new tables migrated from another island.

In other islands, an operator such as a Join could take multiple table inputs. The intra-island execution plan needs to create ‘cut points’ in the AST to divide the AST into containers – sub-queries using naturally co-located tables – and a remainder – a skeleton AST that executes with migrated intermediate results. The traverse virtual function is designated to recursively mark natural locations of a table or sub-query and create containers out of any sub-query whose children are not co-located. pruneChild is used to mark a node in an AST so that a sub-query starting from the node is used to create a container. It is hinted that a remainderLoc with a positive value indicate all input tables co-locate and no containers are constructed; a zero value indicate that at least two containers exist.

getQEPs function lists all viable Query Execution Plans (QEPs) composed from permutations of the query. A permutation produces the same result as does the original query, yet it has a different order for Joins. The different permutations are run used by the monitor, which then records performance information with regard to each permutation. getQEP (without s) is used to extract a specific QEP.

At last, we modify IslandAndCastResolver to finish the integration, and add new entries to the BigDAWG Catalog to make them usable.