S3 Data Lake
This connector is in early access, and should be used for production workloads.
We're interested in hearing about your experience! See Github for more information on joining the beta.
This page guides you through the process of setting up the S3 Data Lake destination connector.
This connector writes the Iceberg table format to S3, or an S3-compatible storage backend. Currently it supports the REST, AWS Glue, and Nessie catalogs.
Setup Guide
S3 Data Lake requires configuring two components: S3 storage, and your Iceberg catalog.
S3 Setup
The connector needs certain permissions to be able to write Iceberg-format files to S3:
s3:ListAllMyBuckets
s3:GetObject*
s3:PutObject
s3:PutObjectAcl
s3:DeleteObject
s3:ListBucket*
Iceberg Catalog Setup
Different catalogs have different setup requirements.
AWS Glue
In addition to the S3 permissions, you should also grant these Glue permissions:
glue:TagResource
glue:UnTagResource
glue:BatchCreatePartition
glue:BatchDeletePartition
glue:BatchDeleteTable
glue:BatchGetPartition
glue:CreateDatabase
glue:CreateTable
glue:CreatePartition
glue:DeletePartition
glue:DeleteTable
glue:GetDatabase
glue:GetDatabases
glue:GetPartition
glue:GetPartitions
glue:GetTable
glue:GetTables
glue:UpdateDatabase
glue:UpdatePartition
glue:UpdateTable
Set the "warehouse location" option to s3://<bucket name>/path/within/bucket
.
The "Role ARN" option is only usable in cloud.
REST catalog
You will need the URI of your REST catalog.
If using the Dremio catalog, set the "warehouse location" option to <Dremio warehouse name>/path
.
Nessie
You will need the URI of your Nessie catalog, and an access token to authenticate to that catalog.
Set the "warehouse location" option to s3://<bucket name>/path/within/bucket
.
Iceberg schema generation
The top-level fields of the stream will be mapped to Iceberg fields. Nested fields (objects, arrays, and unions) will be
mapped to STRING
columns, and written as serialized JSON. This is the full mapping between Airbyte types and Iceberg types:
Airbyte type | Iceberg type |
---|---|
Boolean | Boolean |
Date | Date |
Integer | Long |
Number | Double |
String | String |
Time with timezone | Time |
Time without timezone | Time |
Timestamp with timezone | Timestamp with timezone |
Timestamp without timezone | Timestamp without timezone |
Object | String (JSON-serialized value) |
Array | String (JSON-serialized value) |
Union | String (JSON-serialized value) |
Note that for the time/timestamp with timezone types, the value is first adjusted to UTC, and then written into the Iceberg file.
Schema Evolution
This connector supports limited schema evolution. Outside of refreshes/clears, the connector will never rewrite existing data files. This means that we can only handle specific schema changes:
- Adding/removing a column
- Widening columns
- Changing the primary key
If your source goes through an unsupported schema change, you must manually edit the table schema.
Deduplication
This connector uses a merge-on-read strategy to support deduplication:
- The stream's primary keys are translated to Iceberg's identifier columns.
- An "upsert" is an equality-based delete on that row's primary key, followed by an insertion of the new data.
Assumptions
The S3 Data Lake connector assumes that one of two things is true:
- The source will never emit the same primary key twice in a single sync attempt
- If the source emits the same PK multiple times in a single attempt, it will always emit those records in cursor order (oldest to newest)
If these conditions are not met, you may see inaccurate data in the destination (i.e. older records
taking precendence over newer records). If this happens, you should use the append
or overwrite
sync mode.
Reference
Config fields reference
Changelog
Expand to review
Version | Date | Pull Request | Subject |
---|---|---|---|
0.3.2 | 2025-02-04 | #52690 | Handle special characters in stream name/namespace when using AWS Glue |
0.3.1 | 2025-02-03 | #52633 | Fix dedup |
0.3.0 | 2025-01-31 | #52639 | Make the database/namespace a required field |
0.2.23 | 2025-01-27 | #51600 | Internal refactor |
0.2.22 | 2025-01-22 | #52081 | Implement support for REST catalog |
0.2.21 | 2025-01-27 | #52564 | Fix crash on stream with 0 records |
0.2.20 | 2025-01-23 | #52068 | Add support for default namespace (/database name) |
0.2.19 | 2025-01-16 | #51595 | Clarifications in connector config options |
0.2.18 | 2025-01-15 | #51042 | Write structs as JSON strings instead of Iceberg structs. |
0.2.17 | 2025-01-14 | #51542 | New identifier fields should be marked as required. |
0.2.16 | 2025-01-14 | #51538 | Update identifier fields if incoming fields are different than existing ones |
0.2.15 | 2025-01-14 | #51530 | Set AWS region for S3 bucket for nessie catalog |
0.2.14 | 2025-01-14 | #50413 | Update existing table schema based on the incoming schema |
0.2.13 | 2025-01-14 | #50412 | Implement logic to determine super types between iceberg types |
0.2.12 | 2025-01-10 | #50876 | Add support for AWS instance profile auth |
0.2.11 | 2025-01-10 | #50971 | Internal refactor in AWS auth flow |
0.2.10 | 2025-01-09 | #50400 | Add S3DataLakeTypesComparator |
0.2.9 | 2025-01-09 | #51022 | Rename all classes and files from Iceberg V2 |
0.2.8 | 2025-01-09 | #51012 | Rename/Cleanup package from Iceberg V2 |
0.2.7 | 2025-01-09 | #50957 | Add support for GLUE RBAC (Assume role) |
0.2.6 | 2025-01-08 | #50991 | Initial public release. |