Exporting data to AWS S3
  • 19 Dec 2025
  • 5 Minutes to read
  • Contributors
  • Dark
    Light

Exporting data to AWS S3

  • Dark
    Light

Article summary

Introduction

Data from a Darwinium node can be configured to export to S3 Object stores. This page describes the steps required to followed to enable a data sink that represents an S3 object store.

High level architecture

The data is exported at regular intervals into an S3 bucket configured by the customer. This bucket is henceforth referred to as the s3 export bucket. The export process is triggered at regular intervals from the Darwinium VPC to push the data into this bucket.

The following are the salient features of the export process:

  • The data is exported using parquet format.
  • The dataset is organized into prefix/folder structure using the following layout inside the s3 bucket.
    • A fixed folder called dataexport/events
    • A variable component (based on the portal configuration) - <node_id>/<dataset_name>. The value of the node id can be obtained from the node settings section of portal.
    • A hive style partitioning layout (described in the next point)
  • The "folder" layout on S3 is a hive-partitioned style using an hourly format as the lowest dimension of granularity.
  • The partition columns are based on time dimension and are fixed to be the following hierarchy: p_year, p_month, p_day, p_hour
    • An example partition folder structure will look like the following for all events generated between 4 PM and 5 PM UTC on the 25th of August 2025. p_year=2025/p_month=8/p_day=25/p_hour=16
  • All times are UTC aligned on the file layouts.
  • Assuming the node id is "eb1658c2-b947-49fe-b5b8-fbfe614b01ae" and the dataset was named "darwinium_all_events" in portal configuration, the data for the above example time frame can be located in the following path in the S3 bucket.
dataexport/events/eb1658c2-b947-49fe-b5b8-fbfe614b01ae/darwinium_all_events/p_year=2025/p_month=8/p_day=25/p_hour=16/

Points to note

The following points need to be considered while configuring the data sinks inside the portal.

  • Data is expected to be visible in the export bucket in a maximum hourly intervals. (typically available within 15 minutes time intervals under regular processing patterns).
  • There is a possibility of some duplicates occurring in some scenarios. Please consider this while designing downstream consumptions. The best way to resolve duplicates downstream is to use the "identifier" column and "update_ver" columns.
  • It may be noted that only ASCII characters are allowed as dataset names in the portal configuration.
  • A dataset name may not be repeated for a given node across different sinks.
  • The BYOS storage bucket can be reused for the data export process. This is because the data export is written to its own prefix path.
  • GDPR controls are not supported by Darwinium and has to be managed by the customers.
  • The mechanism to expire staging data is to be managed by the customer teams.

High level overview of the configuration process

The following is a high level summary of the configuration steps that need to be followed. The steps assume there is an S3 bucket created. One can choose to use an existing bucket provided there is no prefix path collision across other apps. Also if you are already using BYOS S3 bucket for portal data , the same bucket can be used.

  1. Create a Role and attach a policy for S3 Writes on the chosen bucket
  2. Create a configuration in Darwinium portal to enable a S3 export
  3. Attach a trust policy to the role created in step 1 to allow Darwinium to assume role.

Step 1 - Create a Role and attach s3 write policy

First we create an AWS role and attach an S3 write policy on the s3 export bucket.

  • The ARN of the S3staging bucket which will be used to stage the data export. The ARN is referred to as "S3BUCKETARN" in the example below. An example S3 bucket ARN looks like this: arn:aws:s3:::dwn-customer-databricks-bucket

Create an AWS policy that looks similar to the following. This policy is allowing anyone who assumes this role to write to the S3 bucket. Replace the value S3BUCKETARN in the snippet below with the s3 bucket ARN of the staging bucket.

{
    "Statement": [
        {
            "Action": [
                "s3:PutObject"
            ],
            "Effect": "Allow",
            "Resource": "S3BUCKETARN/dataexport/*"
        },
        {
            "Action": [
                "s3:GetBucketLocation"
            ],
            "Effect": "Allow",
            "Resource": "S3BUCKETARN"
        }
    ],
    "Version": "2012-10-17"
}

Every role has an ARN associated with it. An example role ARN looks like the following: arn:aws:iam::123456789012:role/iamrole-darwinium-data-export. Please note the ARN of the role as created in this step.This role ARN is required in the next step.

Step 2 - Create Data Sink configuration in Portal

The following information is needed to complete this step.

  • A name that you have chosen to represent the exported dataset. Ex: darwinium_all_events
  • The name of the s3 export bucket. Ex: s3-bucket-for-darwinium-data-exports
  • The region of the staging bucket. Ex: us-east-2
  • The role ARN that Darwinium will be allowed to assume (and is obtained in the previous step).. Ex: arn:aws:iam::123456789012:role/iamrole-darwinium-data-export

In the Darwinium portal, navigate to Admin > Nodes. Then select the "Data Storage" tab at the top of the screen.

Image

Fill in the values by using the information mentioned above. The configuration screen creates an external ID. Please treat this external ID as a secret.

**The external ID will not be visible once the configuration is saved and hence needs to be noted down in a secure location. **

Step 3 - Attach trust policy

We will need to attach a trust relationship to the role defined in step 1. The trust relationship will allow Darwinium account ID roles to assume the role created in step 1.

The following information is needed to complete this step.

  • The external ID as obtained in the previous step. Ex: 103a59e1-a89c-4d73-8287-271f0a7314f2
  • The Role ARN that was used in Step 2 of the configuration screen.
  • Darwinium Account ID. Please get this from your Darwinium Customer Service contact. Ex: 123456789012

Here is an example Trust attachment policy that attaches to the above role (created in step 1). Please remove the DWNACCOUNTID (E.g. 123456789012) and the EXTERNALID (E.g: 103a59e1-a89c-4d73-8287-271f0a7314f2) in the example below with the values that you have.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::DWNACCOUNTID:root"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "sts:ExternalId": [
                        "EXTERNALID"
                    ]
                }
            }
        }
    ]
}

This completes all the configuration steps to create a generic s3 export definition. Please save the configuration by hitting the update button on the screen. Data is expected to be populated at regular intervals (typically between 15 minutes to an hour). All data is aligned with UTC time formats.


Was this article helpful?

Changing your password will log you out immediately. Use the new password to log back in.
First name must have atleast 2 characters. Numbers and special characters are not allowed.
Last name must have atleast 1 characters. Numbers and special characters are not allowed.
Enter a valid email
Enter a valid password
Your profile has been successfully updated.
ESC

Eddy AI, facilitating knowledge discovery through conversational intelligence