Exporting data to Databricks using GCS staging

Introduction

Data from a Darwinium node can be configured to export to Databricks tables. This page describes the steps required to followed to enable a data sink that represents a Databricks table and uses Google Cloud Storage (GCS) as the intermediate staging solution.

High level architecture

Darwinium event data is exported into Databricks table by using a staging approach.

Points to note about this architecture:

The export process is configured using a Google Project dedicated to manage Darwinium exports.
The Google Cloud project would need to be associated with a Google service account definition. The service account definition is stored inside a vault in Darwinium.
The service account is assigned permissions to write to the GCS bucket.
The data is exported at regular intervals into the GCS bucket configured by the customer. This bucket is henceforth referred to as the staging bucket.
The staging bucket is configured to be a Databricks storage external location and is processed at regular intervals to ingest data into the Databricks streaming table.

The following diagram depicts the high level architecture diagram for data export into a Databricks sink using GCS as the staging location.

Points to note about exported data

The following are the salient features of the export process:

The data is exported using parquet format.
Data is exported to a folder under the following path in the GCS staging bucket: dataexport/events/
Within the above folder, a variable path name is generated (based on the portal configuration) - <node_id>/<dataset_name>. The value of the node id can be obtained from the node settings section of portal.
All times are UTC aligned on the file layouts.
No backfilling of data is supported. Data is pushed to databricks from the point of time it is configured in the Darwinium portal.
Data is expected to be visible in Databricks tables in hourly intervals.
There is a possibility of some duplicates occurring in some scenarios. Please consider this while designing downstream consumptions. The best way to resolve duplicates downstream is to use the "identifier" column and "update_ver" columns.
It may be noted that only ASCII characters are allowed as dataset names in the portal configuration.
A dataset name may not be repeated for a given node across different sinks.
The BYOS storage bucket can be reused for the data export process. This is because the data export is written to its own prefix path.
GDPR controls for staging and Databricks tables are not supported by Darwinium and has to be managed by the customers.
The mechanism to expire staging data is to be managed by the customer teams.

High level overview of the configuration process

The following is a high level summary of the configuration steps that need to be followed. The steps assume there is an GCS bucket created and is accessible from the GCP project. One can choose to use an existing bucket provided there is no prefix path collision across other apps and appropriate permissions are configured.

Configure GCS bucket to allow Darwinium writes
Create a configuration in Darwinium portal to enable a Databricks export using GCS
Configure Databricks to pull data from the GCS bucket
1. Wait for some data to accumulate
2. Create a new Databricks table definition

Step 1: Create and configure GCS bucket to allow for Darwinium writes.

Create a Darwinium specific Google project [Optional but highly recommended].

We first create a Google project that will help in controlling the permissions. An existing project could be reused as well if it fits the organizations administrative policies.

Create a service account.

In the IAM console , traverse to Service Account section in the left. Create a new service account. The service account needs to be given the following roles to write to the bucket. (Any permission that allows the service account to write to the bucket will suffice).

StorageAdmin

Create a GCS bucket[Optional but recommended]

Choose the geo location and other factors that suits your organisational policies for data management. It is highly advised that the bucket be allocated for Darwinium export processes only (and databricks ingestion) as that will keep the permission model simple and clean. For the rest of this documentation page, we will assume the bucket name being used is gcp-sb-dwn-databricks-stage.

Note down the name of the bucket as the name of the bucket is required while registering the data sink in the Darwinium portal.

Grant permissions on the bucket.

Next we grant permissions on the bucket to the service account principal. For this, we traverse to the Google cloud console Cloud Storage section and traverse to the bucket. In the bucket configuration screens, click on grant access and then in the principals, type in the name of the service account that was created above. Also assign the role "Storage Admin" to the granted roles list. The following screen shots will give a figurative representations of the list.

The grant access configuration screen looks like this:

Export the private key for the service account

Traverse to the IAM Google cloud console and then to Service Accounts section. Create a new key using JSON as the format of the new key. Store this JSON file securely and we will need this in the Darwinium Portal configuration section. A figurative representation of the key export screen is given below

Step 2: Create a configuration in Darwinium portal

The following information is needed to complete this step.

A name that you have chosen to represent the exported dataset. Ex: darwinium_all_events
The name of the staging gcs bucket. Ex: gcp-sb-dwn-databricks-stage.
The project ID that is being used to manage the permissions.
The service account private as downloaded in the previous section.

In the Darwinium portal, navigate to Admin > Nodes. Then select the "Data Storage" tab at the top of the screen.

Fill in the values by using the information mentioned above.

We are done with the configurations from Darwinium writes capability perspective. We now proceed to configure Databricks to consume from this bucket.

Step 3 : Configure Databricks to consume from GCS bucket

A high level overview of how to configure Databricks with GCS as a staging location is documented here:

Configure Databricks external location using GCS

Please register the GCS bucket as an external location by following the steps as documented in the above link. Note that this step is entirely specific to the ways you would like to integrate databricks into a GCS account. At the end of this step, it is expected that the following are configured and ready to go:

Credentials configured in the Unity Catalog
An external location that maps to the Google Cloud Storage that Darwinium dataset is being written to.

Step 3 (1): Darwinium data accumulation

This step is needed to allow for easier schema management and also to get around the limitations of databricks table auto schema generation.

Darwinium data is expected to be present in the GCS staging bucket by an hours time after step 2 above is completed. (After the Darwinium portal configuration is saved)

Step 3(2):

We next create a streaming table definition that makes use of the external location defined in the previous sections.

In the below example command: Please replace values accordingly.

Bucket Name: gcp-sb-dwn-databricks-stage
instance_id : 7776bd33-9ef3-4846-a5f9-0db4a34da9aa (Please obtain this value from the Darwinium Portal Node settings )
Dataset name: darwinium_all_events
Table Name: darwiniumevents

CREATE OR REFRESH STREAMING TABLE darwiniumevents PARTITIONED BY (p_year, p_month, p_day, p_hour) SCHEDULE EVERY 1 HOUR AS SELECT * FROM STREAM read_files ('gs://gcp-sb-dwn-databricks-stage/dataexport/events/7776bd33-9ef3-4846-a5f9-0db4a34da9aa/darwinium_all_events',  format => 'parquet')

This completes the configuration for using Darwinium events data inside Databricks using GCS as an intermediate staging location.

Documentation Index