Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ac docs #187

Merged
merged 3 commits into from
Nov 14, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
File renamed without changes
90 changes: 90 additions & 0 deletions docs/s3_integration/configuring_s3.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# Configuring S3 for Pipeline Execution

To integrate AWS S3 into your CGAT pipeline, you need to configure S3 access to facilitate file handling for reading and writing data. This document explains how to set up S3 configuration for the CGAT pipelines.

## Overview

`configure_s3()` is a utility function provided by the CGATcore pipeline tools to handle authentication and access to AWS S3. This function allows you to provide credentials, specify regions, and set up other configurations that enable seamless integration of S3 into your workflow.

### Basic Configuration

To get started, you will need to import and use the `configure_s3()` function. Here is a basic example:

```python
from cgatcore.pipeline import configure_s3

configure_s3(aws_access_key_id="YOUR_AWS_ACCESS_KEY", aws_secret_access_key="YOUR_AWS_SECRET_KEY")
```

### Configurable Parameters

- **`aws_access_key_id`**: Your AWS access key, used to authenticate and identify the user.
- **`aws_secret_access_key`**: Your secret key, corresponding to your access key.
- **`region_name`** (optional): AWS region where your S3 bucket is located. Defaults to the region set in your environment, if available.
- **`profile_name`** (optional): Name of the AWS profile to use if you have multiple profiles configured locally.

### Using AWS Profiles

If you have multiple AWS profiles configured locally, you can use the `profile_name` parameter to select the appropriate one without hardcoding the access keys in your code:

```python
configure_s3(profile_name="my-profile")
```

### Configuring Endpoints

To use custom endpoints, such as when working with MinIO or an AWS-compatible service:

```python
configure_s3(
aws_access_key_id="YOUR_AWS_ACCESS_KEY",
aws_secret_access_key="YOUR_AWS_SECRET_KEY",
endpoint_url="https://custom-endpoint.com"
)
```

### Security Recommendations

1. **Environment Variables**: Use environment variables to set credentials securely rather than hardcoding them in your scripts. This avoids potential exposure of credentials:

```bash
export AWS_ACCESS_KEY_ID=YOUR_AWS_ACCESS_KEY
export AWS_SECRET_ACCESS_KEY=YOUR_AWS_SECRET_KEY
```

2. **AWS IAM Roles**: If you are running the pipeline on AWS infrastructure (such as EC2 instances), it's recommended to use IAM roles. These roles provide temporary security credentials that are automatically rotated by AWS.

### Example Pipeline Integration

After configuring S3, you can seamlessly use the S3-aware methods within your pipeline. Below is an example:

```python
from cgatcore.pipeline import get_s3_pipeline

# Configure S3 access
configure_s3(profile_name="my-profile")

# Instantiate the S3 pipeline
s3_pipeline = get_s3_pipeline()

# Use S3-aware methods in the pipeline
@s3_pipeline.s3_transform("s3://my-bucket/input.txt", suffix(".txt"), ".processed")
def process_s3_file(infile, outfile):
# Processing logic
with open(infile, 'r') as fin:
data = fin.read()
processed_data = data.upper()
with open(outfile, 'w') as fout:
fout.write(processed_data)
```

### Summary

- Use the `configure_s3()` function to set up AWS credentials and S3 access.
- Options are available to use IAM roles, profiles, or custom endpoints.
- Use the S3-aware decorators to integrate S3 files seamlessly in your pipeline.

## Additional Resources

- [AWS IAM Roles Documentation](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles.html)
- [AWS CLI Configuration and Credential Files](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html)
5 changes: 5 additions & 0 deletions docs/s3_integration/s3_decorators.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# CGATcore S3 decorators

::: cgatcore.pipeline
:members:
:show-inheritance:
74 changes: 74 additions & 0 deletions docs/s3_integration/s3_pipeline.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# S3 Pipeline

The `S3Pipeline` class is part of the integration with AWS S3, enabling seamless data handling in CGAT pipelines that use both local files and S3 storage. This is particularly useful when working with large datasets that are better managed in cloud storage or when collaborating across multiple locations.

## Overview

`S3Pipeline` provides the following functionalities:

- Integration of AWS S3 into CGAT pipeline workflows
- Lazy-loading of S3-specific classes and functions to avoid circular dependencies
- Facilitating operations on files that reside on S3, making it possible to apply transformations and merges without copying everything locally

### Example Usage

The `S3Pipeline` class can be accessed through the `get_s3_pipeline()` function, which returns an instance that is lazy-loaded to prevent issues related to circular imports. Below is an example of how to use it:

```python
from cgatcore.pipeline import get_s3_pipeline

# Instantiate the S3 pipeline
s3_pipeline = get_s3_pipeline()

# Use methods from s3_pipeline as needed
s3_pipeline.s3_transform(...)
```

### Building a Function Using `S3Pipeline`

To build a function that utilises `S3Pipeline`, you can follow a few simple steps. Below is a guide on how to create a function that uses the `s3_transform` method to process data from S3:

1. **Import the required modules**: First, import `get_s3_pipeline` from `cgatcore.pipeline`.
2. **Instantiate the pipeline**: Use `get_s3_pipeline()` to create an instance of `S3Pipeline`.
3. **Define your function**: Use the S3-aware methods like `s3_transform()` to perform the desired operations on your S3 files.

#### Example Function

```python
from cgatcore.pipeline import get_s3_pipeline

# Instantiate the S3 pipeline
s3_pipeline = get_s3_pipeline()

# Define a function that uses s3_transform
def process_s3_data(input_s3_path, output_s3_path):
@s3_pipeline.s3_transform(input_s3_path, suffix(".txt"), output_s3_path)
def transform_data(infile, outfile):
# Add your processing logic here
with open(infile, 'r') as fin:
data = fin.read()
# Example transformation
processed_data = data.upper()
with open(outfile, 'w') as fout:
fout.write(processed_data)

# Run the transformation
transform_data()
```

### Methods in `S3Pipeline`

- **`s3_transform(*args, **kwargs)`**: Perform a transformation on data stored in S3, similar to Ruffus `transform()` but adapted for S3 files.
- **`s3_merge(*args, **kwargs)`**: Merge multiple input files into one, allowing the files to be located on S3.
- **`s3_split(*args, **kwargs)`**: Split input data into smaller chunks, enabling parallel processing, even when the input resides on S3.
- **`s3_originate(*args, **kwargs)`**: Create new files directly in S3.
- **`s3_follows(*args, **kwargs)`**: Indicate a dependency on another task, ensuring correct task ordering even for S3 files.

These methods are intended to be directly equivalent to standard Ruffus methods, allowing pipelines to easily mix and match S3-based and local operations.

## Why Use `S3Pipeline`?

- **Scalable Data Management**: Keeps large datasets in the cloud, reducing local storage requirements.
- **Seamless Integration**: Provides a drop-in replacement for standard decorators, enabling hybrid workflows involving both local and cloud files.
- **Lazy Loading**: Optimised to initialise S3 components only when they are needed, minimising overhead and avoiding unnecessary dependencies.

Loading