Implementing a Data Lakehouse Architecture in AWS — Part 4 of 4

Published in

White Prompt Blog

6 min readJun 21, 2022

Intro

In our previous blog post, part 3 of this series, we walked through the processing of data available from the AWS sample Tickit data. We then used EMR to process it from RAW zone to Trusted with complete integration of AWS Glue (Data Catalog, Schema Registry). To access this data and aggregate it, we used Redshift Spectrum, then creating Data Marts inside the Redshift core.

In this blog post, the last of our series, we will deploy a widely-used Open Source modern data exploration and visualization platform, the Apache Superset. As the data sources, we will use Amazon Athena and Redshift that we’ve previously approached.

Journey

From part 1 up until now, the AWS resource provisioning has occurred using Terraform. Now, we will use another tool as well, that helps the process of creating sample OS imaging with pre-installed software requirements. This tool is called Packer, and it follows the same approach of Infrastructure as Code (IaC).

Below we will list each used resource and its role in that context.

Amazon EC2 — Used to store the GitHub credential;
Docker — Platform for building, deploying, and managing containerized applications.
Docker Compose — The tool used for defining and running multi-container Docker applications.
Packer — The tool for creating identical machine images for multiple platforms from a single source configuration;
Apache Superset — Modern data exploration and visualization platform, where we’ll explore data and create some charts and dashboards;
Amazon Athena — Connected to the AWS Glue, will provide us the query capability to access data previously processed by the EMR cluster;
Amazon Redshift — Cluster previously configured and used with Redshift Spectrum to access data from S3 buckets, having in its core some Data Marts.

The diagram below illustrates the proposed solution’s architectural design.

Proposed Environment

AWS Service Creation

To start the deployment, we will first create the AMI sample which will be used to start the Apache Superset platform. Go ahead and clone the repository provided with the solution, access the folder infrastructure, and execute the script build-ami.sh. This script will download the packer binary from its source and terraform to automatically build the AMI with all necessary requirements to run Apache Superset.

As we can see in the picture below, the script validates the Packer definition and build AMI.

After a couple of minutes, this process will finish and show you some important details regarding the AMI, the region where it was built, and its ID.

Now we will need to initialize Terraform by running terraform init; this command will generate a directory named .terraform and download each module source declared in the main.tf file.

A useful command to utilize is terraform validate, used to validate the terraform code before the plan.

Following the best practices, always run the command terraform plan -out=superset-stack-plan to review the output before starting, creating or changing existing resources.

After getting the plan validated, it’s possible to safely apply the changes by running terraform apply “superset-stack-plan”. Terraform will do one last validation step and prompt confirmation before applying. With the EC2 resource created using the AMI created by Packer, we’ll see an output with the URL to access Apache Superset.

To log in and start creating charts and dashboards with Superset, the user and password are admin.

Data source configuration

To start using Apache Superset, we’ll need to connect to Amazon Athena and Amazon Redshift. For complete guidance on how to install the client library to access both data sources, follow these steps provided by the official documentation. Our Docker image was pre-built with those requirements.

After logging in, access the menu Data > Databases and click the right side button +Database. In the dropdown list, select Amazon Athena as the source. The next screen will provide the connection string as defined by documentation. Go ahead and test your connection by clicking the button TEST CONNECTION. A message will appear confirming that the connection looks good; click the button CONNECT to save.

To add the Amazon Redshift connection, follow the same process with Amazon Athena.

The screen below shows that both data sources are registered to be used.

Charts and Dashboard creation

To start creating our charts that will be used within the dashboards, we will need to create the datasets with the right SQL query. Pictured below is a list of all datasets created.

With the datasets in place, we will now be able to create our charts using the appropriate visual component to show the information. The list below contains charts that were created using data from Amazon Athena and Amazon Redshift.

Now, we will create two dashboards, the first with data from NYC Taxi trips using Amazon Athena as a data source. The second dashboard contains data from Tickit data provided by AWS using Amazon Redshift as a data source.

Let’s see what each dashboard looks like after the creation.

Dashboard: NYC Trip Summary (2019, 2020)

Dashboard: Event Ticket Selling — Year 2008

Conclusion

The creation of these dashboards concludes our blog series, as they display every resource we have built in the previous blogs. They also provide an overview of how we can explore and create visual presentations from our data. We keep it simple here in our concluding blog post of our series, but it is possible to go further with it and discover more with your data.

We think and we do!

Do you have a business that requires an efficient and powerful data architecture to succeed? Get in touch with us at White Prompt, and we will make it happen!