Exporting data to Snowflake using GCS staging

Introduction

Data from a Darwinium node can be configured to export to Snowflake tables. This page describes the steps required to followed to enable a data sink that represents a Snowflake table and uses Google Cloud Storage (GCS) as the intermediate staging solution.

High level architecture

Darwinium event data is exported into Snowflake table by using a staging approach.

Points to note about this architecture:

The export process is configured using a Google Project dedicated to manage Darwinium exports.
The Google Cloud project would need to be associated with a Google service account definition.
The service account is assigned permissions to write to the GCS bucket.
The data is exported at regular intervals into the GCS bucket configured by the customer. This bucket is henceforth referred to as the staging bucket.
The staging bucket is configured to be a Snowflake storage integration and is processed at regular intervals to ingest data into the Snowflake streaming table.

The following diagram depicts the high level architecture diagram for data export into a Snowflake sink using GCS as the staging location.

Points to note about exported data

The following are the salient features of the export process:

The data is exported using parquet format.

Data is exported to a folder under the following path in the GCS staging bucket: dataexport/events/
Within the above folder, a variable path name is generated (based on the portal configuration) - <node_id>/<dataset_name>. The value of the node id can be obtained from the node settings section of portal.
All times are UTC aligned on the file layouts.
No backfilling of data is supported. Data is pushed to snowflake from the point of time it is configured in the Darwinium portal.
Data is expected to be visible in Snowflake tables in hourly intervals.
There is a possibility of some duplicates occurring in some scenarios. Please consider this while designing downstream consumptions. The best way to resolve duplicates downstream is to use the "identifier" column and "update_ver" columns.
It may be noted that only ASCII characters are allowed as dataset names in the portal configuration.
A dataset name may not be repeated for a given node across different sinks.
The BYOS storage bucket can be reused for the data export process. This is because the data export is written to its own prefix path.
GDPR controls are not supported by Darwinium and has to be managed by the customers.
The mechanism to expire staging data is to be managed by the customer teams.

High level overview of the configuration process

The following is a high level summary of the configuration steps that need to be followed. The steps assume there is an GCS bucket created and is accessible from the GCP project. One can choose to use an existing bucket provided there is no prefix path collision across other apps and appropriate permissions are configured.

Configure GCS bucket to allow Darwinium writes
Create a configuration in Darwinium portal to enable a Snowflake export
Configure Snowflake to pull data from the GCS bucket
1. Wait for some data to accumulate
2. Create a data format definition
3. Create a snowflake staging definition
4. Create a snowflake schema definition and a table definition
5. Create a snowpipe definition to enable auto ingestion into snowflake table.

Step 1: Create and configure GCS bucket to allow for Darwinium writes.

Create a Darwinium specific Google project [Optional but recommended].

We first create a Google project that will help in controlling the permissions. An existing project could be reused as well if it fits the organizations administrative policies.

Create a service account.

In the IAM console , traverse to Service Account section in the left. Create a new service account. The service account needs to be given the following roles to write to the bucket.

StorageAdmin

Create a GCS bucket[Optional but recommended]

Choose the geo location and other factors that suits your organisational policies for data management. It is highly advised that the bucket be allocated for Darwinium export processes only (and snowflake ingestion) as that will keep the permission model simple and clean. For the rest of this documentation page, we will assume the bucket name being used is gcp-sb-dwn-snowflake-stage.

Note down the name of the bucket as the name of the bucket is required while registering the data sink in the Darwinium portal.

Grant permissions on the bucket.

Next we grant permissions on the bucket to the service account principal. For this, we traverse to the Google cloud console Cloud Storage section and traverse to the bucket. In the bucket configuration screens, click on grant access and then in the principals, type in the name of the service account that was created above. Also assign the role "Storage Admin" to the granted roles list. The following screen shots will give a figurative representations of the list.

The grant access configuration screen looks like this:

Export the private key for the service account

Traverse to the IAM Google cloud console and then to Service Accounts section. Create a new key using JSON as the format of the new key. Store this JSON file securely and we will need this in the Darwinium Portal configuration section. A figurative representation of the key export screen is given below

Step 2: Create a configuration in Darwinium portal

The following information is needed to complete this step.

A name that you have chosen to represent the exported dataset. Ex: darwinium_all_events
The name of the staging gcs bucket. Ex: gcp-sb-dwn-snowflake-stage.
The project ID that is being used to manage the permissions.
The service account private as downloaded in the previous section.

In the Darwinium portal, navigate to Admin > Nodes. Then select the "Data Storage" tab at the top of the screen.

Fill in the values by using the information mentioned above.

We are done with the configurations from Darwinium writes capability perspective. We now proceed to configure Snowflake to consume from this bucket.

Step 3 : Configure Snowflake to consume from GCS bucket

A high level overview of how to configure Snowflake with GCS as a staging location is documented here:

Enabling Auto ingest to Snowflake from GCS

Step 3 (1): Darwinium data accumulation

This step is needed to allow for easier schema management and also to get around the limitations of snowflake table auto schema generation.

Darwinium data is expected to be present in the GCS staging bucket by an hours time after step 2 above is completed. (After the Darwinium portal configuration is saved)

Step 3 (2): Create a data format definition

A data format definition needs to be created to ingest Darwinium data.

Run the following command to create the data format definition:

CREATE FILE FORMAT IF NOT EXISTS DWN_DATA_FORMAT TYPE = PARQUET USE_LOGICAL_TYPE = TRUE USE_VECTORIZED_SCANNER = TRUE ;

Step 3 (3)- Create a Snowflake GCS storage integration

Snowflake allows data to be ingested using GCS as a staging location. Please follow the steps mentioned in section and only complete the steps until Step 2.

Auto ingesting from the GCS bucket

Here is a representative command that can be run to create the storage integration

CREATE STORAGE INTEGRATION darwinium_data_export_intg
  TYPE = EXTERNAL_STAGE
  STORAGE_PROVIDER = 'GCS'
  ENABLED = TRUE
  STORAGE_ALLOWED_LOCATIONS = ('gcs://gcp-sb-dwn-snowflake-stage/dataexport')

Step 3(4): Create a snowflake staging definition

We then create a stage using the information given in this page. Please refer to step 3 in this page:
Auto ingesting from the GCS bucket

Use the following information as a guidance to create the stage. To complete this step you will need to have 3 values:

The name of the Snowflake Database and the Schema into which we are going to ingest the data.
The name of the GCS bucket.
The Darwinium node (INSTANCE_ID) for which you are ingesting the data for. You can get the node value from your portal administration node settings screen. Alternately you can contact your Darwinium Customer representative.
The dataset name. The dataset name was configured in step 2 above in this page. Ex: darwinium_all_events
The name of the storage integration as created in step 3(2): In the example below it is shown as darwinium_data_export_intg

Please replace the values (marked in <>) in the command below before running the command in snowflake.

-- Ensure you are replacing the nodeid and the dataset and bucket names in the below command. 
USE SCHEMA <DATABASE>.<SCHEMA>;

CREATE STAGE IF NOT EXISTS darwinium_events_gcs_stage URL = 'gcs://<GCS_BUCKET_NAME>/dataexport/events/<INSTANCE_ID>/<DARWINIUM_DATASET_NAME>' STORAGE_INTEGRATION = darwinium_data_export_intg;

Step 3(5): Create schema and table definition

Then a table definition needs to be created. Since there are many columns it is easier to create the table definition using introspection. Use the following script to create the table definition. The following information is needed to complete this step.

Path to a few collection of parquet files that are already present into the GCS bucket (as Darwinium export process must have already created a few parquet files by this time). Snowflake does not support evolving the table schema more than a handful columns at a time. Hence we try to create the table with as many columns as possible at the time of table creation. Alternately contact snowflake directly to remove this limit of number of new columns. Choose a few files that are of the maximum sizes in the staging location as that could be an indicator that a lot of columns are defined in that file. This will ensure that we are creating the initial table with the maximum number of columns possible.
Use the following criteria to shortlist a few files that are chosen to create the schema:
A couple files that have the maximum size (The rationale being the large files will perhaps cover all columns that were ever being used)
A couple of files that are the most recent (That will ensure that the latest date model is being created right at the beginning)
A couple of files that are at the beginning of the time (Optional but might be a good idea)
Lets assume the name of the parquet files as a.parquet, b.parquet, c.parquet, d.parquet, e.parquet for the below command.
Ensure the person executing the commands below have the right permissions to create tables.

-- REPLACE DB NAME AND SCHEMA NAME accordingly.
USE <DATABASE>.<SCHEMA>;

CREATE TABLE DARWINIUM_EVENTS
    USING TEMPLATE (
        SELECT ARRAY_AGG(OBJECT_CONSTRUCT(*))
          WITHIN GROUP (ORDER BY order_id)
        FROM TABLE(
            INFER_SCHEMA(
            LOCATION=>'@darwinium_events_gcs_stage/'
            , FILE_FORMAT => 'DWN_DATA_FORMAT'
            , FILES => ('a.parquet', 'b.parquet', 'c.parquet', 'd.parquet', 'e.parquet')
        )
    ))
ENABLE_SCHEMA_EVOLUTION = TRUE;

-- Below Adds a few metadata columns and is used in the pipe definition in the later sections
ALTER TABLE DARWINIUM_EVENTS
ADD COLUMN
  FILENAME string,
  FILE_LAST_MODIFIED timestamp_ntz,
  FILE_SCAN_TIME timestamp_ltz;

-- Below adds a clustering key for efficient scanning. Please alter the below to suit your scan performance needs.
ALTER TABLE DARWINIUM_EVENTS CLUSTER BY (to_date("timestamp"))

Step 3 (6) : Configure Snowflake for Auto ingestion

Create the Auto ingestion Pipe. For more details please refer to step 4 in this page
Auto ingesting from the GCS bucket

You will need of the subscription integration as defined in step 3(3). Please substitute that value in the command below.

Please use the following command as a reference

CREATE PIPE DARWINIUM_GCS_EVENTS_PIPE
  AUTO_INGEST = TRUE
  INTEGRATION = '<SUBSCRIPTION_NOTIFICATION_INTEGRATION_CREATED_IN_STEP3(3)>' 
  AS
  COPY INTO DARWINIUM_EVENTS
  FROM @darwinium_events_gcs_stage  
  FILE_FORMAT = (FORMAT_NAME = 'DWN_DATA_FORMAT' ) MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE
  ON_ERROR = CONTINUE
  TRUNCATECOLUMNS = TRUE
  INCLUDE_METADATA = (FILE_LAST_MODIFIED=METADATA$FILE_LAST_MODIFIED,  FILE_SCAN_TIME=METADATA$START_SCAN_TIME, FILENAME = METADATA$FILENAME);

  -- Below ensures pipe is unpaused for continuous ingestion
  ALTER PIPE DARWINIUM_GCS_EVENTS_PIPE SET PIPE_EXECUTION_PAUSED = FALSE;

This marks the completion of all the configuration steps. Darwinium data would now continuously flow into the above table at regular intervals (maximum of hourly intervals)

Documentation Index