# Amazon S3

Amazon Simple Storage Service (S3) is a scalable cloud storage service offered by Amazon Web Services (AWS). It allows users to store and retrieve extensive amounts of data from anywhere on the web.

`omniload` supports Amazon S3 as both a data source and destination.

## URI Format

The URI for connecting to Amazon S3 is structured as follows:

```text
s3://?access_key_id=<your_access_key_id>&secret_access_key=<your_secret_access_key>
```

**URI Parameters:**

*   `access_key_id`: Your AWS access key ID.
*   `secret_access_key`: Your AWS secret access key.
*   `endpoint_url`: URL of an S3-Compatible API Server (optional, for S3-compatible storage like Minio)
*   `layout`: Layout template (optional, destination only)

These credentials are required to authenticate and authorize access to your S3 buckets.

The `--source-table` parameter specifies the S3 bucket and file pattern using the following format:

```
<bucket-name>/<file-glob-pattern>
```

## Setting up an S3 Integration

To integrate `omniload` with Amazon S3, you need an `access_key_id` and a `secret_access_key`. For guidance on obtaining these credentials, refer to the dltHub documentation on [AWS credentials](https://dlthub.com/docs/dlt-ecosystem/verified-sources/filesystem#get-credentials).

Once you have your credentials, you can configure the S3 URI. The `bucket_name` and `path_to_files` (file glob pattern) are specified in the `--source-table` argument.

### Example: Loading data from S3

Let's assume the following details:
*   `access_key_id`: `AKC3YOW7E`
*   `secret_access_key`: `XCtkpL5B`
*   S3 bucket name: `my_bucket`
*   Path to files within the bucket: `students/students_details.csv`

The following command demonstrates how to copy data from the specified S3 location to a DuckDB database:

```sh
omniload ingest \
    --source-uri 's3://?access_key_id=AKC3YOW7E&secret_access_key=XCtkpL5B' \
    --source-table 'my_bucket/students/students_details.csv' \
    --dest-uri duckdb:///s3_data.duckdb \
    --dest-table 'processed_students.student_details'
```

This command will create a table named `student_details` within the `processed_students` schema (or equivalent grouping) in the DuckDB database file located at `s3_data.duckdb`.

### Example: Uploading data to S3
For this, example we'll assume that:
* `records.db` is a duckdb database.
* has a table called `public.users`.
* the S3 credentials are the same as the example above.

The following command demonstrates how to copy data from a local duckdb database to S3:
```sh
omniload ingest \
    --source-uri 'duckdb:///records.db' \
    --source-table 'public.users' \
    --dest-uri 's3://?access_key_id=AKC3YOW7E&secret_access_key=XCtkpL5B' \
    --dest-table 'my_bucket/records'
```

This will result in a file structure like the following:
```
my_bucket/
└── records
    ├── _dlt_loads
    ├── _dlt_pipeline_state
    ├── _dlt_version
    └── users
        └── <load_id>.<file_id>.parquet
```

The value of `load_id` and `file_id` is determined at runtime. The default layout creates a folder with the same table name as the source and places the data inside a parquet file. This layout is configurable using the `layout` parameter.

For example, if you would like to create a parquet file with the same name as the source table (as opposed to a folder) you can set `layout` to `{table_name}.{ext}` in the commandline above:

```sh
omniload ingest \
    --source-uri 'duckdb:///records.db' \
    --source-table 'public.users' \
    --dest-uri 's3://?layout={table_name}.{ext}&access_key_id=AKC3YOW7E&secret_access_key=XCtkpL5B' \ 
    --dest-table 'my_bucket/records'
```

Result:
```
my_bucket/
└── records
    ├── _dlt_loads
    ├── _dlt_pipeline_state
    ├── _dlt_version
    └── users.parquet
```

List of available Layout variables is available [here](https://dlthub.com/docs/dlt-ecosystem/destinations/filesystem#available-layout-placeholders)

### Working with S3-Compatible object stores
`omniload` supports S3-compatible storage services like [Minio](https://min.io/), Digital Ocean [Spaces](https://www.digitalocean.com/products/spaces) and Cloudflare [R2](https://developers.cloudflare.com/r2/). You can set the `endpoint_url` in your URI to read from or write to these object stores.

For example, if you're running Minio on `localhost:9000`, you can read data from it:
```sh
omniload ingest \
    --source-uri 's3://?endpoint_url=http://localhost:9000&access_key_id=minioadmin&secret_access_key=minioadmin' \
    --source-table 'my_bucket/data.csv' \
    --dest-uri 'duckdb:///local.duckdb' \
    --dest-table 'public.my_data'
```

Or write data to it:
```sh
omniload ingest \
    --source-uri 'duckdb:///records.db' \
    --source-table 'public.users' \
    --dest-uri 's3://?endpoint_url=http://localhost:9000&access_key_id=AKC3YOW7E&secret_access_key=XCtkpL5B' \
    --dest-table 'my_bucket/records'
```

### File Glob Pattern Examples:

::: warning
Glob patterns only apply when loading data from S3 as source.
:::

The `<file-glob-pattern>` in the `--source-table` argument allows for flexible file selection. Here are some common patterns and their descriptions:

| Pattern                                     | Description                                                                                                |
| :------------------------------------------ | :--------------------------------------------------------------------------------------------------------- |
| `bucket/**/*.csv`                           | Retrieves all CSV files recursively from `s3://bucket`.                                                    |
| `bucket/*.csv`                              | Retrieves all CSV files located at the root level of `s3://bucket`.                                        |
| `bucket/myFolder/**/*.jsonl`                | Retrieves all JSONL files recursively from the `myFolder` directory and its subdirectories in `s3://bucket`. |
| `bucket/myFolder/mySubFolder/users.parquet` | Retrieves the specific `users.parquet` file from the `myFolder/mySubFolder/` path in `s3://bucket`.        |
| `bucket/employees.jsonl`                    | Retrieves the `employees.jsonl` file located at the root level of the `s3://bucket`.                       |


### Working with compressed files

`omniload` automatically detects and handles gzipped files in your S3 bucket. You can load data from compressed files with the `.gz` extension without any additional configuration.

For example, to load data from a gzipped CSV file:

```sh
omniload ingest \
    --source-uri 's3://?access_key_id=AKC3YOW7E&secret_access_key=XCtkpL5B' \
    --source-table 'my_bucket/logs/event-data.csv.gz' \
    --dest-uri duckdb:///compressed_data.duckdb \
    --dest-table 'logs.events'
```

You can also use glob patterns to load multiple compressed files:

```sh
omniload ingest \
    --source-uri 's3://?access_key_id=AKC3YOW7E&secret_access_key=XCtkpL5B' \
    --source-table 'my_bucket/logs/**/*.csv.gz' \
    --dest-uri duckdb:///compressed_data.duckdb \
    --dest-table 'logs.events'
```

### File type hinting

If your files are properly encoded but lack the correct file extension (CSV, JSONL, or Parquet), you can provide a file type hint to inform `omniload` about the format of the files. This is done by appending a fragment identifier (`#format`) to the end of the path in your `--source-table` parameter.

For example, if you have JSONL-formatted log files stored in S3 with a non-standard extension:

```
--source-table "my_bucket/logs/event-data#jsonl"
```

This tells `omniload` to process the files as JSONL, regardless of their actual extension.

Supported format hints include:
- `#csv` - For comma-separated values files with headers
- `#csv_headless` - For CSV files without headers
- `#jsonl` - For line-delimited JSON files
- `#parquet` - For Parquet format files

::: tip
File type hinting works with `gzip` compressed files as well.
:::

### CSV files without headers

For CSV files that don't have a header row, use the `#csv_headless` format hint. You can optionally provide column names using the `--columns` flag:

```sh
# With custom column names
omniload ingest \
    --source-uri 's3://?access_key_id=AKC3YOW7E&secret_access_key=XCtkpL5B' \
    --source-table 'my_bucket/data/raw-data.csv#csv_headless' \
    --columns "id:bigint,name:text,value:double" \
    --dest-uri duckdb:///local.duckdb \
    --dest-table 'public.raw_data'
```

If no column names are provided, columns will be automatically named `unknown_col_0`, `unknown_col_1`, etc.:

```sh
# Without column names (auto-generated)
omniload ingest \
    --source-uri 's3://?access_key_id=AKC3YOW7E&secret_access_key=XCtkpL5B' \
    --source-table 'my_bucket/data/raw-data.csv#csv_headless' \
    --dest-uri duckdb:///local.duckdb \
    --dest-table 'public.raw_data'
```