How I built a Node.js Service to Clone my AWS S3 Buckets
- 2019-03-08 01:56 AM
If you are an absolute starter on Node.js, please go through this write-up to get an overall understanding of why and how to Node.js?
Since this write-up is about building a fully-functional Node.js application, we would not go deep into the basics and quickly touch upon the key points to look out for and also a bit of insight into the application itself.
AWS (Amazon Web Services)
AWS is a cloud-based eco-system where you can manage the whole infrastructure of your applications without needing any physical hardware or server.
In short, AWS offers a wide range of solutions for companies seeking to store data, access data, run servers, scale their existing services and much more. Out of these services, one of the most important and simplest service is S3(Simple Storage Service). The S3 not only offers block-storage space, meaning that you can store any file format on it but also comes with a set of REST APIs through which you can make CRUD operations.
These two technologies mentioned above, make a powerful combination in the industry for various use cases when your Node.js application needs to randomly store and retrieve files, store executables or any other data format that is not supported by an ACID complaint DB such as SQL DBs. In short, it’s a file storage system where you can store any type of data and easily access them.
What’s this project about?
This project is basically building a Node.js application to clone your S3 bucket locally and recursively.
TBH, this project is a one-liner using the AWS-CLI. Yes, you heard it right. So why are we doing it anyway?
“Why should all the problems must always have only one solution? I simply like more than one. Be it good or bad”
Let’s look at the existing solution first. You install the AWS-CLI and run the following command:
aws s3 cp s3://my-s3-bucket/ ./ --recursive
_S_urprisingly, apart from using the AWS CLI, I didn’t find any proper Node.js script or an app that would do this for medium to large scale buckets using the AWS-SDK. The answers I found on Node.js posts online had several problems, including half-baked scripts, scripts that would try and synchronously create a file and would not know when to complete and also would ignore cloning empty folders if there was any. Basically, it didn’t do the job right. Hence decided to write one myself, properly.
I am positive that this would give you a better understanding of how a Node.js application should look and feel, in spite of its size and operation.
What can you expect out of this writeup?
As I said above, I am not gonna explain the code line-by-line, as I am posting the entire base out. Instead, I will talk about how I have architected the application, with a bit of insight into the core-logic and key features. Let me list out what you can expect and get an idea by the end of this write-up.
- Project Structure of our Application
- Core Logic Surrounding the Application
- Streams in Node.js to Download a File
- Using AWS-SDK to access S3 APIs
- Entire Codebase
Before we start
Before we jump the guns, let’s ask why do we call this a service and not a server. I call this a service because it just does one job and no external entities are sending requests to our running service, if they were to send the request to our application, then I would probably call this a server if it’s listening on a port.
When I was a junior frontend developer, I always thought that Node.js applications were made only of a single server file and thought that the code would always be messy, while that is true to some extent, I later learnt that that’s not how it should be. As a responsible developer, we should always aim to keep the files as small as possible and separate the concerns properly. That’s the key to developing a scalable Node.js application.
There is no opinionated approach for building the project structure, it can change from project to project based on the use case. Personally, I split them into smaller independent modules. One module does one type of task and one type of task only.
Let’s say, if I have a network.js file**,** it does only network operations and doesn’t modify the file structure or create a new file.
Let’s look at our project structure for instance,
As I said before, there is no particular way to structure your project but its ideal to pick a topic and group all your files under that topic. For me, it was activity, “what does that file handle and how ?”.
Let’s start from the root and go step by step.
These are project dependencies and are essential for development and deployment. And are mostly straight forward to understand:
- git ignore/eslint configs
- license, readme etc.
And then comes the config file, the config file consists of all your application config, api_keys, bucket name, target directory, third_party links etc., normally we would have two config files one for production and one for the development environment.
Once we made the skeleton of the application ready with the application dependencies, then we have the core entities. In our application, the Core entities include Handler, Service and Storage.
Handler is where we glue the entire application and create Services, Storages and inject required dependencies and expose an API for the index.js to handle.
Service is where our core-logic of the application lives and all the jobs to other dependencies are delegated from here.
Storage is where all our storage operations take place. In our application S3 is the external storage from where we retrieve our files and data from, hence the AWS-SDK operations exclusively happen only inside this storage module.
Helpers and Utils
When the service starts to run it needs to do all the intended tasks at the same time. For example, in our application, once we get the list of contents under a directory, we need to start creating/cloning the contents locally. This operation is delegated to cloner.js, a helper which is only responsible for cloning the files and folders. The cloner, in turn, needs to access the fileOps.js module to create directories and files.
Under helpers, we have the following files doing their respective chores with or without the help of other helpers.
Cloner, handles the cloning of the files and directories with the help of fileOps module.
Data Handler, maps and parses the data from S3 bucket into a usable or consumable data by the service.
Downloader, only downloads files from S3 bucket and pipes the operation to a write-stream or simply put, takes care of downloading files asynchronously.
fileOps, as the name suggests, uses Node’s fs module to create file-streams and directories.
filePath provides the entire application with the target folder for cloning the S3 bucket’s files and directories and also returns the target bucket and target root directory on the S3.
Logger inside utils returns an instance of Bunyan logger which can be used to send logs to third parties like Kibana.
Core Logic Surrounding the Application
Now that we have done our project setup, let’s look into the core logic of the service module. It involves the sequence of the following actions:
- Separate the files and directories, because we clone the directories and download the files.
- Clone all the directories first, and then move on to download the files.
- Download the files through streams and log success and failure respectively. (AWS-SDK ListKeys API response sometimes ignores giving out directory Keys, hence we need to check for if a directory exists, if not present we create one before downloading the contained file)
Additionally, the service not only clones the entire bucket but also clones only specific folders, without losing the folder tree structure, based on our Prefix configuration as specified here as the rootDirectory for cloning.
Downloading Files Using Streams
Another important concept around the Node.js is using streams to upload and retrieve data from an external source. In our project, the external source is the AWS S3.
When downloading the files we create a read stream from the AWS SDK getObject method and pipe it to a writeStream which will close automatically once all the data is written. The advantage of using streams is the memory consumption, where the application doesn’t care about the size of the file downloaded.
Our code inside storage moduleas shown below uses streams to asynchronously download the data without blocking the event loop.
To dig deeper into Node.js streams, please refer to this write up here.
Using AWS SDK to access S3
This is the most straight forward topic in the whole application, where you install the AWS-SDK and start accessing the methods in it. Taking a look at the storage file would give you a better understanding of how to import and call methods on the same.
The Codebase of the Application.
Here you can find the entire code for this application, more than reading this, hands-on would give a great deal of information and help you understand the core concepts of this application. Feel free to fork it, play with it and if you like it leave a star on the repo.
This marks the end of this write-up, hope it gave a better understanding of how to plan, build and run a Node.js service in the real-time on a platform such as AWS. Feel free to leave your comments, responses and not to mention claps on the story if you liked it. Thanks for reading and feel free to comment!
☞ Node.js With Passport Authentication | Full Project
☞ Express.js & Node.js Course for Beginners - Full Tutorial
☞ Full Stack Developers: Everything You Need to Know
☞ AWS DevOps: Introduction to DevOps on AWS
☞ Build a web scraper with Node
☞ Node.js Cron Jobs By Examples
☞ Moving from NodeJS to Go
☞ Authenticate a Node ES6 API with JSON Web Tokens
☞ The Complete Node.js Developer Course (3rd Edition)
☞ Angular & NodeJS - The MEAN Stack Guide
☞ NodeJS - The Complete Guide (incl. MVC, REST APIs, GraphQL)
☞ Node.js: The Complete Guide to Build RESTful APIs (2018)
☞ Learn and Understand NodeJS
☞ MERN Stack Front To Back: Full Stack React, Redux & Node.js
Originally published by Rajesh Babu at https://blog.bitsrc.io