# Amazon S3 Amazon Simple Storage Service (S3) is a scalable cloud storage service offered by Amazon Web Services (AWS). It allows users to store and retrieve extensive amounts of data from anywhere on the web. `omniload` supports Amazon S3 as both a data source and destination. ## URI Format The URI for connecting to Amazon S3 is structured as follows: ```text s3://?access_key_id=&secret_access_key= ``` **URI Parameters:** * `access_key_id`: Your AWS access key ID. * `secret_access_key`: Your AWS secret access key. * `endpoint_url`: URL of an S3-Compatible API Server (optional, for S3-compatible storage like Minio) * `layout`: Layout template (optional, destination only) These credentials are required to authenticate and authorize access to your S3 buckets. The `--source-table` parameter specifies the S3 bucket and file pattern using the following format: ``` / ``` ## Setting up an S3 Integration To integrate `omniload` with Amazon S3, you need an `access_key_id` and a `secret_access_key`. For guidance on obtaining these credentials, refer to the dltHub documentation on [AWS credentials](https://dlthub.com/docs/dlt-ecosystem/verified-sources/filesystem#get-credentials). Once you have your credentials, you can configure the S3 URI. The `bucket_name` and `path_to_files` (file glob pattern) are specified in the `--source-table` argument. ### Example: Loading data from S3 Let's assume the following details: * `access_key_id`: `AKC3YOW7E` * `secret_access_key`: `XCtkpL5B` * S3 bucket name: `my_bucket` * Path to files within the bucket: `students/students_details.csv` The following command demonstrates how to copy data from the specified S3 location to a DuckDB database: ```sh omniload ingest \ --source-uri 's3://?access_key_id=AKC3YOW7E&secret_access_key=XCtkpL5B' \ --source-table 'my_bucket/students/students_details.csv' \ --dest-uri duckdb:///s3_data.duckdb \ --dest-table 'processed_students.student_details' ``` This command will create a table named `student_details` within the `processed_students` schema (or equivalent grouping) in the DuckDB database file located at `s3_data.duckdb`. ### Example: Uploading data to S3 For this, example we'll assume that: * `records.db` is a duckdb database. * has a table called `public.users`. * the S3 credentials are the same as the example above. The following command demonstrates how to copy data from a local duckdb database to S3: ```sh omniload ingest \ --source-uri 'duckdb:///records.db' \ --source-table 'public.users' \ --dest-uri 's3://?access_key_id=AKC3YOW7E&secret_access_key=XCtkpL5B' \ --dest-table 'my_bucket/records' ``` This will result in a file structure like the following: ``` my_bucket/ └── records ├── _dlt_loads ├── _dlt_pipeline_state ├── _dlt_version └── users └── ..parquet ``` The value of `load_id` and `file_id` is determined at runtime. The default layout creates a folder with the same table name as the source and places the data inside a parquet file. This layout is configurable using the `layout` parameter. For example, if you would like to create a parquet file with the same name as the source table (as opposed to a folder) you can set `layout` to `{table_name}.{ext}` in the commandline above: ```sh omniload ingest \ --source-uri 'duckdb:///records.db' \ --source-table 'public.users' \ --dest-uri 's3://?layout={table_name}.{ext}&access_key_id=AKC3YOW7E&secret_access_key=XCtkpL5B' \ --dest-table 'my_bucket/records' ``` Result: ``` my_bucket/ └── records ├── _dlt_loads ├── _dlt_pipeline_state ├── _dlt_version └── users.parquet ``` List of available Layout variables is available [here](https://dlthub.com/docs/dlt-ecosystem/destinations/filesystem#available-layout-placeholders) ### Working with S3-Compatible object stores `omniload` supports S3-compatible storage services like [Minio](https://min.io/), Digital Ocean [Spaces](https://www.digitalocean.com/products/spaces) and Cloudflare [R2](https://developers.cloudflare.com/r2/). You can set the `endpoint_url` in your URI to read from or write to these object stores. For example, if you're running Minio on `localhost:9000`, you can read data from it: ```sh omniload ingest \ --source-uri 's3://?endpoint_url=http://localhost:9000&access_key_id=minioadmin&secret_access_key=minioadmin' \ --source-table 'my_bucket/data.csv' \ --dest-uri 'duckdb:///local.duckdb' \ --dest-table 'public.my_data' ``` Or write data to it: ```sh omniload ingest \ --source-uri 'duckdb:///records.db' \ --source-table 'public.users' \ --dest-uri 's3://?endpoint_url=http://localhost:9000&access_key_id=AKC3YOW7E&secret_access_key=XCtkpL5B' \ --dest-table 'my_bucket/records' ``` ### File Glob Pattern Examples: ::: warning Glob patterns only apply when loading data from S3 as source. ::: The `` in the `--source-table` argument allows for flexible file selection. Here are some common patterns and their descriptions: | Pattern | Description | | :------------------------------------------ | :--------------------------------------------------------------------------------------------------------- | | `bucket/**/*.csv` | Retrieves all CSV files recursively from `s3://bucket`. | | `bucket/*.csv` | Retrieves all CSV files located at the root level of `s3://bucket`. | | `bucket/myFolder/**/*.jsonl` | Retrieves all JSONL files recursively from the `myFolder` directory and its subdirectories in `s3://bucket`. | | `bucket/myFolder/mySubFolder/users.parquet` | Retrieves the specific `users.parquet` file from the `myFolder/mySubFolder/` path in `s3://bucket`. | | `bucket/employees.jsonl` | Retrieves the `employees.jsonl` file located at the root level of the `s3://bucket`. | ### Working with compressed files `omniload` automatically detects and handles gzipped files in your S3 bucket. You can load data from compressed files with the `.gz` extension without any additional configuration. For example, to load data from a gzipped CSV file: ```sh omniload ingest \ --source-uri 's3://?access_key_id=AKC3YOW7E&secret_access_key=XCtkpL5B' \ --source-table 'my_bucket/logs/event-data.csv.gz' \ --dest-uri duckdb:///compressed_data.duckdb \ --dest-table 'logs.events' ``` You can also use glob patterns to load multiple compressed files: ```sh omniload ingest \ --source-uri 's3://?access_key_id=AKC3YOW7E&secret_access_key=XCtkpL5B' \ --source-table 'my_bucket/logs/**/*.csv.gz' \ --dest-uri duckdb:///compressed_data.duckdb \ --dest-table 'logs.events' ``` ### File type hinting If your files are properly encoded but lack the correct file extension (CSV, JSONL, or Parquet), you can provide a file type hint to inform `omniload` about the format of the files. This is done by appending a fragment identifier (`#format`) to the end of the path in your `--source-table` parameter. For example, if you have JSONL-formatted log files stored in S3 with a non-standard extension: ``` --source-table "my_bucket/logs/event-data#jsonl" ``` This tells `omniload` to process the files as JSONL, regardless of their actual extension. Supported format hints include: - `#csv` - For comma-separated values files with headers - `#csv_headless` - For CSV files without headers - `#jsonl` - For line-delimited JSON files - `#parquet` - For Parquet format files ::: tip File type hinting works with `gzip` compressed files as well. ::: ### CSV files without headers For CSV files that don't have a header row, use the `#csv_headless` format hint. You can optionally provide column names using the `--columns` flag: ```sh # With custom column names omniload ingest \ --source-uri 's3://?access_key_id=AKC3YOW7E&secret_access_key=XCtkpL5B' \ --source-table 'my_bucket/data/raw-data.csv#csv_headless' \ --columns "id:bigint,name:text,value:double" \ --dest-uri duckdb:///local.duckdb \ --dest-table 'public.raw_data' ``` If no column names are provided, columns will be automatically named `unknown_col_0`, `unknown_col_1`, etc.: ```sh # Without column names (auto-generated) omniload ingest \ --source-uri 's3://?access_key_id=AKC3YOW7E&secret_access_key=XCtkpL5B' \ --source-table 'my_bucket/data/raw-data.csv#csv_headless' \ --dest-uri duckdb:///local.duckdb \ --dest-table 'public.raw_data' ```