2. Getting Started with BigDAWG¶

This section describes how to start a BigDAWG cluster, load an example dataset, and run several example queries.

BigDAWG Cluster Components

A BigDAWG cluster consists of the Middleware, Query Endpoint, Catalog, and multiple database engines. You can learn more about these components in the BigDAWG Middleware Internal Components section.

The purpose of this section is to guide you through the process of setting up a BigDAWG cluster with Docker, the open-source technology that allows you to deploy applications inside software containers. You will pull baseline images from our Dockerhub repository, run images as instantiated containers, and then run scripts to populate the engines with test data. The current release of BigDAWG includes images for PostgreSQL, SciDB, and Accumulo.

A video demonstration of these steps is also available to watch.

2.1. Prerequisites¶

To complete this guide, you will need basic knowledge of working with your computer’s command prompt/terminal, Docker, and Linux commands. You will also need your computer’s port 8080 available and will need administrator privileges on your system to install Docker.

Compatible Docker Installation

To follow the steps in this section, you will need to first install Docker on your system. If your system is running Mac OSX or Windows, you should install Docker Toolbox. Follow the download and installation steps from the Docker website.

Note

BigDAWG has been tested on these versions of Docker:

Docker version 1.11.1, build 5604cbe (Tested on Ubuntu 14.04)
Docker version 1.12.1, build 6f9534c (Tested on Docker Toolbox for Mac, version 0.8.1, build 41b3b25)
Docker version 1.12.6, build 78d1802 (Tested on Docker Toolbox for Mac)

Note

Do not use “Docker for Mac” or “Docker for Windows”, which are two alternative Docker applications, because of known networking limitations that interfere with this example. If your system is runnig Linux, then install Docker for Linux.

BigDAWG source code

Obtain the source code by cloning the git repository:

git clone https://github.com/bigdawg-istc/bigdawg.git

Alternatively, download the code directly from the website https://github.com/bigdawg-istc/bigdawg.git

2.2. BigDAWG Cluster Setup Steps¶

(Mac and Windows only) Open a Quickstart Terminal to Execute Docker Commands

Launch the Docker Quickstart Terminal application, which was installed when installing Docker Toolbox (this initialization can take some time). Launching this application will run a Docker host VM and open an initialized terminal window. Without this terminal, you will not be able to execute docker commands.

Docker Quickstart Terminal Successfully Initialized

The status shown above means that Docker was started successfully.

Navigate to the “provisions” directory of the source code root

The source code root is a directory called “bigdawg”. All scripts executed in this tutorial assume that you are in the bigdawg/provisions directory.

Run the Docker setup script:

./setup_bigdawg_docker.sh

This script take will start a BigDAWG cluster using Docker containers. It can take up to 15-30 minutes to complete depending on your computer resources and internet connection. The script works in the following stages:

Create a Docker network called bigdawg that allows the containers to communicate with each other.
Pull “base” docker images from Docker Hub that encapsulate the database engines but contain no data.
Run the images as instantiated containers.
Download publically-available MIMIC II data. The BigDAWG project does not ship with any of data itself, so all data is downloaded from external sources.
Execute scripts on the contianers to insert data into the engines.
Start the BigDAWG Middleware on each container, and accept queries on the bigdawg-postgres-catalog container.

After the setup script completes, you will get a message:

Starting HTTP server on: http://bigdawg-postgres-catalog:8080/bigdawg/
2017-03-21 14:17:01,873 2767 istc.bigdawg.network.NetworkIn.receive(NetworkIn.java:39) [pool-2-thread-1] null DEBUG istc.bigdawg.network.NetworkIn - tcp://*:9991
2017-03-21 14:17:02,072 2966 istc.bigdawg.network.NetworkIn.receive(NetworkIn.java:43) [pool-2-thread-1] null DEBUG istc.bigdawg.network.NetworkIn - Wait for the next request from a client ...
Mar 21, 2017 2:17:23 PM org.glassfish.grizzly.http.server.NetworkListener start
INFO: Started listener bound to [bigdawg-postgres-catalog:8080]
Jersey app started with WADL available at http://bigdawg-postgres-catalog:8080/bigdawg/application.wadl
Hit enter to stop it...

If you hit any key, the Middleware execution will quit. Therefore, make sure to run any additional commands in a separate termainal window.

Optional setup verification

As an optional step, you can verify that the images were pulled successfully and check their running status.

To do this, create a separate Docker Quickstart terminal and run the following commands:

Check the status of all images:

docker images

user@local:~$ docker images
REPOSITORY         TAG      IMAGE ID       CREATED          SIZE
bigdawg/accumulo   latest   804fa44f5eb4   2 seconds ago    1.656 GB
bigdawg/scidb      latest   c1b578c504bb   8 seconds ago    1.237 GB
bigdawg/postgres   latest   1a2600f05cbb   12 seconds ago   1.086 GB

You should see the three images as shown above if the pull (phase 2 above) was successful.

Check the status of all running containers:

docker ps

user@local:~$ docker ps
CONTAINER ID   IMAGE              STATUS        PORTS                                              NAMES
ef66f13c4694   bigdawg/accumulo   Up 1 minute   0.0.0.0:42424->42424/tcp                           bigdawg-accumulo-proxy
3e02a26c9da5   bigdawg/accumulo   Up 1 minute   0.0.0.0:9999->9999/tcp, 0.0.0.0:50095->50095/tcp   bigdawg-accumulo-master
13deae26bff7   bigdawg/accumulo   Up 1 minute   0.0.0.0:9997->9997/tcp                             bigdawg-accumulo-tserver0
c6e6b8185d7f   bigdawg/accumulo   Up 1 minute   0.0.0.0:2181->2181/tcp                             bigdawg-accumulo-zookeeper
7d3135d17a7e   bigdawg/accumulo   Up 1 minute                                                      bigdawg-accumulo-namenode
3b1710639c09   bigdawg/scidb      Up 1 minute   0.0.0.0:1239->1239/tcp                             bigdawg-scidb-data
4d119d50458c   bigdawg/postgres   Up 1 minute   0.0.0.0:5402->5402/tcp                             bigdawg-postgres-data2
626ba8425e5b   bigdawg/postgres   Up 1 minute   0.0.0.0:5401->5401/tcp                             bigdawg-postgres-data1
e4fe27b0c8ed   bigdawg/postgres   Up 1 minute   0.0.0.0:5400->5400/tcp, 0.0.0.0:8080->8080/tcp     bigdawg-postgres-catalog

You should see all the containers running as shown above if the run (phase 3 above) was successful.

2.3. Run Example Queries¶

Warning

These commands will not work if you are using a VPN connection or cannot access the Docker host IP address. If VPN is necessary for your system, contact us for tips that you may be able to use to work around this.

Warning

Your system must have port 8080 available for the Middleware to initialize successfully.

Once the containers are running, the Catalog container will run the Query Endpoint (a simple HTTP server) listening on port 8080. The container is configured to publish its port 8080 to the Docker VM’s port 8080, so that queries sent to that port will be routed to the Query Endpoint. You can then submit queries to this port like so:

$ curl -X POST -d "bdrel(select * from mimic2v26.d_patients limit 4;)" http://192.168.99.100:8080/bigdawg/query/

Here, we are using curl, a shell command, to handle requests and responses to and from a web server, in this case the Query Endpoint, over the HTTP protocol.

2.3.1. Example Queries¶

In this section, we describe a few queries on the MIMIC II dataset that you can execute once you have successfully completed the above steps.

All queries use the following syntax:

$ curl -X POST -d "<query-goes-here>" http://192.168.99.100:8080/bigdawg/query/

We are making a POST request to send the query string as data to the Query Endpoint at the resource /bigdawg/query/. The IP address 192.168.99.100 is used by the Docker host VM, which is forwarding its port 8080 to the container running the Query Endpoint.

1) postgres only

bdrel(select * from mimic2v26.d_patients limit 4)

This query uses the relational island (bdrel) to select 4 entries from the table mimic2v26.d_patients.