Generic orchestration framework for knowledge warehousing workloads the use of Amazon Redshift RSQL

Tens of hundreds of shoppers run business-critical workloads on Amazon Redshift, AWSâs rapid, petabyte-scale cloud knowledge warehouse handing over the most productive price-performance. With Amazon Redshift, you’ll be able to question knowledge throughout your knowledge warehouse, operational knowledge retail outlets, and information lake the use of same old SQL. You’ll additionally combine AWS services and products like Amazon EMR, Amazon Athena, Amazon SageMaker, AWS Glue, AWS Lake Formation, and Amazon Kinesis to profit from the entire analytic features within the AWS Cloud.

Amazon Redshift RSQL is a local command-line consumer for interacting with Amazon Redshift clusters and databases. You’ll connect with an Amazon Redshift cluster, describe database items, question knowledge, and think about question ends up in quite a lot of output codecs. You’ll use Amazon Redshift RSQL to switch current extract, become, load (ETL) and automation scripts, corresponding to Teradata BTEQ scripts. You’ll wrap Amazon Redshift RSQL statements inside of a shell script to copy current capability within the on-premise programs. Amazon Redshift RSQL is to be had for Linux, Home windows, and macOS running programs.

This submit explains how you’ll be able to create a generic configuration-driven orchestration framework the use of AWS Step Purposes, Amazon Elastic Compute Cloud (Amazon EC2), AWS Lambda, Amazon DynamoDB, and AWS Techniques Supervisor to orchestrate RSQL-based ETL workloads. When youâre migrating from legacy knowledge warehouse workloads to Amazon Redshift, you’ll be able to use this technique to orchestrate your knowledge warehousing workloads.

Answer evaluate

Shoppers migrating from legacy knowledge warehouses to Amazon Redshift could have an important funding in proprietary scripts like Fundamental Teradata Question (BTEQ) scripting for database automation, ETL, or different duties. You’ll now use the AWS Schema Conversion Device (AWS SCT) to routinely convert proprietary scripts like BTEQ scripts to Amazon Redshift RSQL scripts. The transformed scripts run on Amazon Redshift with little to no adjustments. To be informed about new choices for database scripting, seek advice from Boost up your knowledge warehouse migration to Amazon Redshift â Section 4.

All the way through such migrations, you may additionally wish to modernize your present on-premises, third-party orchestration gear with a cloud-native framework to copy and strengthen your present orchestration capacity. Orchestrating knowledge warehouse workloads contains scheduling the roles, checking if the pre-conditions were met, operating the enterprise good judgment embedded inside of RSQL, tracking the standing of the roles, and alerting if there are any disasters.

This resolution permits on-premises consumers emigrate to a cloud-native orchestration framework that makes use of AWS serverless services and products corresponding to Step Purposes, Lambda, DynamoDB, and Techniques Supervisor to run the Amazon Redshift RSQL jobs deployed on a chronic EC2 example. You’ll additionally deploy the answer for greenfield implementations. Along with assembly practical necessities, this resolution additionally supplies complete auditing, logging, and tracking of all ETL and ELT processes which might be run.

To verify top availability and resilience, you’ll be able to use more than one EC2 circumstances which might be part of an auto scaling workforce along side Amazon Elastic Record Device (Amazon EFS) to deploy and run the RSQL jobs. When the use of auto scaling teams, you’ll be able to set up RSQL onto the EC2 example as part of the bootstrap script. You’ll additionally deploy the Amazon Redshift RSQL scripts onto the EC2 example the use of AWS CodePipeline and AWS CodeDeploy. For extra main points, seek advice from Auto Scaling teams, the Amazon EFT Person Information, and Integrating CodeDeploy with Amazon EC2 Auto Scaling.

The next diagram illustrates the structure of the orchestration framework.

The important thing parts of the framework are as follows:

Amazon EventBridge is used because the ETL workflow scheduler, and it triggers a Lambda serve as at a preset time table.
The serve as queries a DynamoDB desk for the configuration related to the RSQL activity and queries the standing of the activity, run mode, and restart knowledge for that activity.
After receiving the configuration, the serve as triggers a Step Purposes state mechanical device through passing the configuration main points.
Step Purposes begins operating other levels (like configuration iteration, run kind verify, and extra) of the workflow.
Step Purposes makes use of the Techniques Supervisor SendCommand API to set off the RSQL activity and is going right into a paused state with TaskToken. The RSQL scripts are endured on an EC2 example and are wrapped in a shell script. Techniques Supervisor runs an AWS-RunShellScript SSM record to run the RSQL activity at the EC2 example.
The RSQL activity plays ETL and ELT operations at the Amazon Redshift cluster. When itâs whole, it returns a luck/failure code and standing message again to the calling shell script.
The shell script calls a customized Python module with the luck/failure code, standing message, and the callwait TaskToken that was once gained from Step Purposes. The Python module logs the RSQL activity standing within the activity audit DynamoDB audit desk, and exports logs to the Amazon CloudWatch log workforce.
The Python module then plays a SendTaskSuccess or SendTaskFailure API name in accordance with the RSQL activity run standing. In line with the standing of the RSQL activity, Step Purposes both resumes the glide or stops with failure.
Step Purposes logs the workflow standing (luck or failure) within the DynamoDB workflow audit desk.

Necessities

You’ll have the next necessities:

Deploy AWS CDK stacks

Whole the next steps to deploy your assets the use of the AWS CDK:

Clone the GitHub repo:

git clone https://github.com/aws-samples/amazon-redshift-rsql-orchestration-framework.git

Replace the next the surroundings parameters in cdk.json (this report will also be discovered within the infra listing):
1. ec2_instance_id â The EC2 example ID on which RSQL jobs are deployed
2. redshift_secret_id â The title of the Secrets and techniques Supervisor key that retail outlets the Amazon Redshift database credentials
3. rsql_script_path â Absolutely the listing trail within the EC2 example the place the RSQL jobs are saved
4. rsql_log_path â Absolutely the listing trail within the EC2 example used for storing the RSQL activity logs
5. rsql_script_wrapper â Absolutely the listing trail of the RSQL wrapper script (rsql_trigger.sh) at the EC2 example.
The next is a pattern cdk.json report after being populated with the parameters
```
    "atmosphere": {
      "ec2_instance_id" : "i-xxxx",
      "redshift_secret_id" : "blog-secret",
      "rsql_script_path" : "/house/ec2-user/blog_test/rsql_scripts/",
      "rsql_log_path" : "/house/ec2-user/blog_test/logs/",
      "rsql_script_wrapper" : "/house/ec2-user/blog_test/instance_code/rsql_trigger.sh"
    }
```

Deploy the AWS CDK stack with the next code:

cd amazon-redshift-rsql-orchestration-framework/lambdas/lambda-layer/
sh zip_lambda_layer.sh
cd ../../infra/
python3 -m venv ./venv
supply .venv/bin/turn on
pip set up -r necessities.txt
cdk bootstrap <AWS Account ID>/<AWS Area>
cdk deploy --all

Letâs have a look at the assets the AWS CDK stack deploys in additional element.

CloudWatch log workforce

A CloudWatch log workforce (/ops/rsql-logs/) is created, which is used to retailer, track, and get entry to log information from EC2 circumstances and different resources.

The log workforce is used to retailer the RSQL activity run logs. For each and every RSQL script, all of the stdout and stderr logs are saved as a log move inside of this log workforce.

DynamoDB configuration desk

The DynamoDB configuration desk (rsql-blog-rsql-config-table) is the fundamental construction block of this resolution. All of the RSQL jobs, restart knowledge and run mode (sequential or parallel), and series by which the roles are to be run are saved on this configuration desk.

The desk has the next construction:

workflow_id â The identifier for the RSQL-based ETL workflow.
workflow_description â The outline for the RSQL-based ETL workflow.
workflow_stages â The series of levels inside of a workflow.
execution_type â The kind of run for RSQL jobs (sequential or parallel).
stage_description â The outline for the degree.
scripts â The listing of RSQL scripts to be run. The RSQL scripts should be positioned within the location outlined in a later step.

The next is an instance of an access within the configuration desk. You’ll see the workflow_id is blog_test_workflow and the outline is Take a look at Workflow for Weblog.

It has 3 levels which might be induced within the following order: Schema & Desk Introduction Level, Knowledge Insertion Level 1, and Knowledge Insertion Level 2. The degree Schema & Desk Introduction Level has two RSQL jobs operating sequentially, and Knowledge Insertion Level 1 and Knowledge Insertion Level 2 each and every have two jobs operating in parallel.

{
	"workflow_id": "blog_test_workflow",
	"workflow_description": "Take a look at Workflow for Weblog",
	"workflow_stages": [{
			"execution_flag": "y",
			"execution_type": "sequential",
			"scripts": [
				"rsql_blog_script_1.sh",
				"rsql_blog_script_2.sh"
			],
			"stage_description": "Schema & Desk Introduction Level"
		},
		{
			"execution_flag": "y",
			"execution_type": "parallel",
			"scripts": [
				"rsql_blog_script_3.sh",
				"rsql_blog_script_4.sh"
			],
			"stage_description": "Knowledge Insertion Level 1"
		},
		{
			"execution_flag": "y",
			"execution_type": "parallel",
			"scripts": [
				"rsql_blog_script_5.sh",
				"rsql_blog_script_6.sh"
			],
			"stage_description": "Knowledge Insertion Level 2"
		}
	]
}

DynamoDB audit tables

The audit tables retailer the run main points for each and every RSQL activity throughout the ETL workflow with a novel identifier for tracking and reporting functions. The explanation why there are two audit tables is as a result of one desk retail outlets the audit knowledge at a RSQL activity degree and the opposite retail outlets it at a workflow degree.

The activity audit desk (rsql-blog-rsql-job-audit-table) has the next construction:

job_name â The title of the RSQL script
workflow_execution_id â The run ID for the workflow
execution_start_ts â The beginning timestamp for the RSQL activity
execution_end_ts â The tip timestamp for the RSQL activity
execution_status â The run standing of the RSQL activity (Operating, Finished, Failed)
instance_id â The EC2 example ID on which the RSQL activity is administered
ssm_command_id â The Techniques Supervisor command ID used to set off the RSQL activity
workflow_id â The workflow_id below which the RSQL activity is administered

The workflow audit desk (rsql-blog-rsql-workflow-audit-table) has the next construction:

workflow_execution_id â The run ID for the workflow
workflow_id â The identifier for a selected workflow
execution_start_ts â The beginning timestamp for the workflow
execution_status â The run standing of the workflow or state mechanical device (Operating, Finished, Failed)
rsql_jobs â The listing of RSQL scripts which might be part of the workflow
execution_end_ts â The tip timestamp for the workflow

Lambda purposes

The AWS CDK creates the Lambda purposes that retrieve the config knowledge from the DynamoDB config desk, replace the audit main points in DynamoDB, set off the RSQL scripts at the EC2 example, and iterate via each and every degree. The next is a listing of the purposes:

rsql-blog-master-iterator-lambda
rsql-blog-parallel-load-check-lambda
rsql-blog-sequential-iterator-lambda
rsql-blog-rsql-invoke-lambda
rsql-blog-update-audit-ddb-lambda

Step Purposes state machines

This resolution implements a Step Purposes callback job integration development that permits Step Purposes workflows to ship a token to an exterior machine by way of more than one AWS services and products.

The AWS CDK deploys the next state machines:

RSQLParallelStateMachine â The parallel state mechanical device is induced if the execution_type for a degree within the configuration desk is ready to parallel. The Lambda serve as with a callback token is induced in parallel for each and every of the RSQL scripts the use of a Map state.
RSQLSequentialStateMachine â The sequential state mechanical device is induced if the execution_type for a degree within the configuration desk is ready to sequential. This state mechanical device makes use of a iterator design development to run each and every RSQL activity throughout the degree as consistent with the series discussed within the configuration.
RSQLMasterStatemachine â The main state mechanical device iterates via each and every degree and triggers other state machines in accordance with the run mode (sequential or parallel) the use of a Selection state.

Transfer the RSQL script and example code

Reproduction the instance_code and rsql_scripts directories (provide within the GitHub repo) to the EC2 example. Be certain the framework listing inside of instance_code is copied as neatly.

The next screenshots display that the instance_code and rsql_scripts directories are copied to the similar guardian folder at the EC2 example.

Instance Code EC2 Copy Image

RSQL Script EC2 Copy Image

RSQL script run workflow

To additional illustrate the mechanism to run the RSQL scripts, see the next diagram.

The Lambda serve as, which will get the configuration main points from the configuration DynamoDB desk, triggers the Step Purposes workflow, which plays the next steps:

A Lambda serve as outlined as a workflow step receives the Step Purposes TaskToken and configuration main points.
The TaskToken and configuration main points are handed onto the EC2 example the use of the Techniques Manger SendCommand API name. After the Lambda serve as is administered, the workflow department is going into paused state and waits for a callback token.
The RSQL scripts are run at the EC2 example, which carry out ETL and ELT on Amazon Redshift. After the scripts are run, the RSQL script passes the crowning glory standing and TaskToken to a Python script. This Python script is embedded throughout the RSQL script.
The Python script updates the RSQL activity standing (luck/failure) within the activity audit DynamoDB desk. It additionally exports the RSQL activity logs to the CloudWatch log workforce.
The Python script passes the RSQL activity standing (luck/failure) and the standing message again to the Step Purposes workflow along side TaskToken the use of the SendTaskSuccess or SendTaskFailure API name.
Relying at the activity standing gained, Step Purposes both resumes the workflow or stops the workflow.

If EC2 auto scaling teams are used, then you’ll be able to use the Techniques Supervisor SendCommand to make sure resilience and top availability through specifying a number of EC2 circumstances (which might be part of the automobile scaling workforce). For more info, seek advice from Run instructions at scale.

When more than one EC2 circumstances are used, set the max-concurrency parameter of the RunCommand API name to one, which makes certain that the RSQL activity is induced on just one EC2 example. For additional main points, seek advice from The use of concurrency controls.

Run the orchestration framework

To run the orchestration framework, whole the next steps:

At the DynamoDB console, navigate to the configuration desk and insert the configuration main points supplied previous. For directions on methods to insert the instance JSON configuration main points, seek advice from Write knowledge to a desk the use of the console or AWS CLI.
At the Lambda console, open the rsql-blog-rsql-workflow-trigger-lambda serve as and make a choice Take a look at.
Upload the take a look at occasion very similar to the next code and make a choice Take a look at:
```
{
	"workflow_id": "blog_test_workflow",
	"workflow_execution_id": "demo_test_26"
}
```
At the Step Purposes console, navigate to the rsql-master-state-machine serve as to open the main points web page.
Select Edit, then make a choice Workflow Studio New. The next screenshot displays the principle state mechanical device.
Select Cancel to go away Workflow Studio, then make a choice Cancel once more to go away edit mode. Youâre directed again to the main points web page.
At the Executions tab, make a choice the most recent run.
From the Graph view, you’ll be able to verify the standing of each and every state through opting for it. Each and every state that makes use of an exterior useful resource has a hyperlink to it at the Main points tab.
The orchestration framework runs the ETL load, which is composed of the next pattern RSQL scripts:
- rsql_blog_script_1.sh â This script creates a schema rsql_blog throughout the database
- rsql_blog_script_2.sh â This script creates a desk blog_table throughout the schema created within the previous script
- rsql_blog_script_3.sh â Inserts one row into the desk created within the earlier script
- rsql_blog_script_4.sh â Inserts one row into the desk created within the earlier script
- rsql_blog_script_5.sh â Inserts one row into the desk created within the earlier script
- rsql_blog_script_6.sh â Inserts one row into the desk created within the earlier script

You wish to have to switch those RSQL scripts with the RSQL scripts evolved in your workloads through putting the related configuration main points into the configuration DynamoDB desk (rsql-blog-rsql-config-table).

Validation

After you run the framework, youâll discover a schema (known as rsql_blog) with one desk (known as blog_table) created. This desk is composed of 4 rows.

You’ll verify the logs of the RSQL activity within the CloudWatch log workforce (/ops/rsql-logs/) and in addition the run standing of the workflow within the workflow audit DynamoDB desk (rsql-blog-rsql-workflow-audit-table).

Blank up

To keep away from ongoing fees for the assets that you just created, delete them. AWS CDK deletes all assets excluding knowledge assets corresponding to DynamoDB tables.

First, delete all AWS CDK stacks
At the DynamoDB console, make a choice the next tables and delete them:
- rsql-blog-rsql-config-table
- rsql-blog-rsql-job-audit-table
- rsql-blog-rsql-workflow-audit-table

Conclusion

You’ll use Amazon Redshift RSQL, Techniques Supervisor, EC2 circumstances, and Step Purposes to create a contemporary and cost-effective orchestration framework for ETL workflows. There is not any overhead to create and set up other state machines for each and every of your ETL workflow. On this submit, we demonstrated methods to use this configuration-based generic orchestration framework to set off advanced RSQL-based ETL workflows.

You’ll additionally set off an e mail notification via Amazon Easy Notification Carrier (Amazon SNS) throughout the state mechanical device to the notify the operations workforce of the crowning glory standing of the ETL procedure. Additional, you’ll be able to succeed in a event-driven ETL orchestration framework through the use of EventBridge to start out the workflow set off lambda serve as.

In regards to the Authors

Akhil is a Knowledge Analytics Marketing consultant at AWS Skilled Services and products. He is helping consumers design & construct scalable knowledge analytics answers and migrate knowledge pipelines and information warehouses to AWS. In his spare time, he loves travelling, taking part in video games and looking at motion pictures.

Ramesh RaghupathyÂ is a Senior Knowledge Architect with WWCO ProServe at AWS. He works with AWS consumers to architect, deploy, and migrate to knowledge warehouses and information lakes at the AWS Cloud. Whilst no longer at paintings, Ramesh enjoys touring, spending time with circle of relatives, and yoga.

Raza HafeezÂ is a Senior Knowledge Architect throughout the Shared Supply Observe of AWS Skilled Services and products. He has over 12 years {of professional} enjoy construction and optimizing endeavor knowledge warehouses and is captivated with enabling consumers to comprehend the facility in their knowledge. He focuses on migrating endeavor knowledge warehouses to AWS Trendy Knowledge Structure.

Dipal Mahajan is a Lead Marketing consultant with Amazon Internet Services and products founded out of India, the place he guides international consumers to construct extremely safe, scalable, dependable, and cost-efficient packages at the cloud. He brings in depth enjoy on Instrument Building, Structure and Analytics from industries like finance, telecom, retail and healthcare.