Exporting data to Snowflake using AWS S3 staging
  • 20 Nov 2025
  • 8 Minutes to read
  • Contributors
  • Dark
    Light

Exporting data to Snowflake using AWS S3 staging

  • Dark
    Light

Article summary

Introduction

Data from a Darwinium node can be configured to export to Snowflake tables. This page describes the steps required to followed to enable a data sink that represents a Snowflake table.

High level architecture

Darwinium event data is exported into Snowflake table by using a staging approach. The data is exported at regular intervals into an S3 bucket configured by the customer. This bucket is henceforth referred to as the staging bucket. The staging bucket is configured to be a Snowflake storage integration and is processed at regular intervals to ingest data into the Snowflake streaming table.

The following diagram depicts the high level architecture diagram for data export into a Snowflake sink using AWS s3 as the staging location.

SnowflakeExport.png

Points to note

The following are the salient features of the export process:

The data is exported using parquet format.

  • Data is exported to a folder under the following path in the S3 staging bucket: dataexport/events/
  • Within the above folder, a variable path name is generated (based on the portal configuration) - <node_id>/<dataset_name>. The value of the node id can be obtained from the node settings section of portal.
  • All times are UTC aligned on the file layouts.
  • No backfilling of data is supported. Data is pushed to snowflake from the point of time it is configured in the Darwinium portal.
  • Data is expected to be visible in Snowflake tables in hourly intervals.
  • There is a possibility of some duplicates occurring in some scenarios. Please consider this while designing downstream consumptions. The best way to resolve duplicates downstream is to use the "identifier" column and "update_ver" columns.
  • It may be noted that only ASCII characters are allowed as dataset names in the portal configuration.
  • A dataset name may not be repeated for a given node across different sinks.
  • The BYOS storage bucket can be reused for the data export process. This is because the data export is written to its own prefix path.
  • GDPR controls are not supported by Darwinium and has to be managed by the customers.
  • The mechanism to expire staging data is to be managed by the customer teams.

High level overview of the configuration process

The following is a high level summary of the configuration steps that need to be followed. The steps assume there is an S3 bucket created. One can choose to use an existing bucket provided there is no prefix path collision across other apps. Also if you are already using BYOS S3 bucket for portal data , the same bucket can be used.

  1. Configure S3 bucket to allow Darwinium writes
    1. Create a Role and attach a policy for S3 Writes on the chosen bucket
  2. Create a configuration in Darwinium portal to enable a Snowflake export
  3. Complete S3 trust relationship
    1. Attach a trust policy to the role created in step 1 to allow Darwinium to assume role.
  4. Configure snowflake to allow S3 reads
    1. Create a storage integration in snowflake to read from S3
  5. Configure Snowflake
    1. Wait for some data to accumulate
    2. Create a data format definition
    3. Create a snowflake staging definition
    4. Create a snowflake schema definition and a table definition
    5. Create a snowpipe definition to enable auto ingestion into snowflake table.

Step 1: Configure S3 bucket to allow for Darwinium writes.

First we create an AWS role and attach an S3 write policy on the staging bucket.

The ARN of the S3staging bucket which will be used to stage the data export. The ARN is referred to as "S3BUCKETARN" in the example below. An example S3 bucket ARN looks like this: arn:aws:s3:::dwn-customer-snowflake-bucket

Create an AWS policy that looks similar to the following. This policy is allowing anyone who assumes this role to write to the S3 bucket. Replace the value S3BUCKETARN in the snippet below with the s3 bucket ARN of the staging bucket.

{
    "Statement": [
        {
            "Action": [
                "s3:PutObject"
            ],
            "Effect": "Allow",
            "Resource": "S3BUCKETARN/dataexport/*"
        },
        {
            "Action": [
                "s3:GetBucketLocation"
            ],
            "Effect": "Allow",
            "Resource": "S3BUCKETARN"
        }
    ],
    "Version": "2012-10-17"
}

Every role has an ARN associated with it. An example role ARN looks like the following: arn:aws:iam::123456789012:role/iamrole-darwinium-data-export. Please note the ARN of the role as created in this step.This role ARN is required in the next step.

Step 2: Create a configuration in Darwinium portal

The following information is needed to complete this step.

A name that you have chosen to represent the exported dataset. Ex: darwinium_all_events
The name of the staging s3 bucket. Ex: s3-bucket-for-darwinium-data-exports
The region of the staging bucket. Ex: us-east-2
The role ARN that Darwinium will be allowed to assume (and is obtained in the previous step).. Ex: arn:aws:iam::123456789012:role/iamrole-darwinium-data-export
In the Darwinium portal, navigate to Admin > Nodes. Then select the "Data Storage" tab at the top of the screen.

SnowflakePortalConfiguration.png

Fill in the values by using the information mentioned above. The configuration screen creates an external ID. Please treat this external ID as a secret.

**The external ID will not be visible once the configuration is saved and hence needs to be noted down in a secure location. **

Step 3: Complete S3 trust relationship

We will need to attach a trust relationship to the role defined in step 1. The trust relationship will allow Darwinium account ID roles to assume the role created in step 1.

The following information is needed to complete this step.

  • The external ID as obtained in the previous step. Ex: 76e7ec0e-152b-40cc-b88b-579357b84754
  • The Role ARN that was used in Step 2 of the configuration screen.
  • Darwinium Account ID. Please get this from your Darwinium Customer Service contact. Ex: 123456789012
  • Here is an example Trust attachment policy that attaches to the above role (created in step 1). Please remove the DWNACCOUNTID (E.g. 123456789012) and the EXTERNALID (E.g: 76e7ec0e-152b-40cc-b88b-579357b84754) in the example below with the values that you have.
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::DWNACCOUNTID:root"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "sts:ExternalId": [
                        "EXTERNALID"
                    ]
                }
            }
        }
    ]
}

Step 4- Create a Snowflake S3 storage integration

Snowflake allows data to be ingested using AWS s3 as a staging location. Please follow the instructions as documented by Snowflake for creating a storage integration.

**Please note the storage integration name that is being configured as part of this step **

The name DARWINIUM_S3_STORAGE is assumed as the storage integration name in the rest of this document.

Create a snowflake s3 storage integration

Step 5 (1): Darwinium data accumulation

This step is needed to allow for easier schema management and also to get around the limitations of snowflake table auto schema generation.

Darwinium data is expected to be present in the s3 staging bucket by an hours time after step 3 above is completed. (After S3 trust policy is created)

Step 5 (2): Create a data format definition

A data format definition needs to be created to ingest Darwinium data.

Run the following command to create the data format definition:

CREATE FILE FORMAT IF NOT EXISTS DWN_DATA_FORMAT TYPE = PARQUET USE_LOGICAL_TYPE = TRUE USE_VECTORIZED_SCANNER = TRUE ;

Step 5(3): Create a snowflake staging definition

To complete this step you will need to have 3 values:

  • The name of the S3 bucket. Ex: Assuming the name of the S3 bucket is dwn-customer-snowflake-bucket
  • The Darwinium node_id for which you are ingesting the data for. You can get the node_id value from your portal administration node settings screen. Alternately you can contact your Darwinium Customer representative. Ex: node_id in example below.
  • The dataset name. The dataset name was configured in step 2 above in this page. Ex: eventsdata

Please replace the values (marked in <>) in the command below before running the command in snowflake.

-- Ensure you are replacing the nodeid and the dataset and bucket names in the below command. 
CREATE STAGE IF NOT EXISTS DARWINIUM_STAGE_EVENTS URL = '<s3://dwn-customer-snowflake-bucket>/dataexport/events/<node_id>/<eventsdata>' STORAGE_INTEGRATION = DARWINIUM_S3_STORAGE;

Step 5 (4): Create schema and table definition

Optionally create a new schema if you wish to ingest Darwinium data (if one does not exist yet). An existing schema can be used if desired.

The rest of the documentation assumes the name of the database is eventsdb and schema is called public. Please replace the values according to your setup where you want to create the table.

Then a table definition needs to be created. Since there are many columns it is easier to create the table definition using introspection. Use the following script to create the table definition. The following information is needed to complete this step.

  • Path to a few collection of parquet files that are already present into the S3 bucket (as Darwinium export process must have already created a few parquet files by this time). Snowflake does not support evolving the table schema more than a handful columns at a time. Hence we try to create the table with as many columns as possible at the time of table creation. Alternately contact snowflake directly to remove this limit of number of new columns. Choose a few files that are of the maximum sizes in the staging location as that could be an indicator that a lot of columns are defined in that file. This will ensure that we are creating the initial table with the maximum number of columns possible. Lets assume the name of the parquet files as a.parquet, b.parquet, c.parquet for the below command.
  • Ensure the person executing the commands below have the right permissions to create tables.
-- REPLACE DB NAME AND SCHEMA NAME accordingly.
USE eventsdb.public

CREATE TABLE EVENTS
    USING TEMPLATE (
        SELECT ARRAY_AGG(OBJECT_CONSTRUCT(*))
          WITHIN GROUP (ORDER BY order_id)
        FROM TABLE(
            INFER_SCHEMA(        
            LOCATION=>'@DARWINIUM_STAGE_EVENTS/'
            , FILE_FORMAT => 'DWN_DATA_FORMAT'
            , FILES => ('a.parquet', 'b.parquet','c.parquet')
        )
    ))
ENABLE_SCHEMA_EVOLUTION = TRUE;

-- Below Adds a few metadata columns and is used in the pipe definition in the later sections
ALTER TABLE EVENTS 
ADD COLUMN 
  FILENAME string, 
  FILE_LAST_MODIFIED timestamp_ntz, 
  FILE_SCAN_TIME timestamp_ltz;

-- Below adds a clustering key for efficient scanning. Please alter the below to suit your scan performance needs.
ALTER TABLE EVENTS CLUSTER BY (to_date("timestamp"))

Step 5 (5) : Create a snow pipe for managing auto-ingest

The last step of the process is to define a snow pipe that will automatically ingest data as new data files are written to the S3 bucket by the Darwinium export process.

Please use the following command to allow for a snowpipe creation.



CREATE PIPE DARWINIUM_EVENTS_PIPE
  AUTO_INGEST = TRUE
  AS
  COPY INTO EVENTS
  FROM @DARWINIUM_STAGE_EVENTS 
  FILE_FORMAT = (FORMAT_NAME = 'DWN_DATA_FORMAT' ) MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE 
  ON_ERROR = CONTINUE 
  TRUNCATECOLUMNS = TRUE
  INCLUDE_METADATA = (FILE_LAST_MODIFIED=METADATA$FILE_LAST_MODIFIED,  FILE_SCAN_TIME=METADATA$START_SCAN_TIME, FILENAME = METADATA$FILENAME);
  
  -- Below ensures pipe is unpaused for continuous ingestion
  ALTER PIPE DARWINIUM_EVENTS_PIPE SET PIPE_EXECUTION_PAUSED = FALSE;
  

This marks the completion of all the configuration steps. Darwinium data would now continuously flow into the above table at regular intervals (maximum of hourly intervals)


Was this article helpful?

What's Next
Changing your password will log you out immediately. Use the new password to log back in.
First name must have atleast 2 characters. Numbers and special characters are not allowed.
Last name must have atleast 1 characters. Numbers and special characters are not allowed.
Enter a valid email
Enter a valid password
Your profile has been successfully updated.
ESC

Eddy AI, facilitating knowledge discovery through conversational intelligence