11 Gradual Methods On How To Scale A Database

I used to work in a web shop / app agency and now as a full stack app and web developer. My work’s focus has always been about churning out application fast and furious. There is not much investment, budget and love to shower upon optimization, security and, in particular, scaling concerns.

With some free time on my hand, I decided to look at the topic that has been bothering me since my first ever Ruby On Rails application: database scaling.

The methods to scale a database can be split into 3 categories.

  • Application and code
  • Database design and schema
  • Database infrastructure

11. Obliterate N + 1 Queries

This is a common problem that can overwork the database with unnecessary number of queries. It generally lacked the use of JOINs in the queries to prefetch data that will be used eventually in the same request.

Common problem, common solution. Use JOINs to eager load the data beforehand to spare your server multiple trips to the database.

As an application level optimization on the code base, it should be a basic practice for backend developers.

I myself, however, do not practise this all the time. As a word of justice to explain myself and probably a fair number of my fellow confrère, we don’t optimize right away because we have clients who don’t really know what they want and the project requirements were not that all that clear to start with. It just does not make financial sense to spend resources optimizing the project.

From a developer’s point of view and experience building up applications from scratch, the priority is functionality, low budget and speed. Optimization is a bonus we do not paid

10. Optimize Data Type

Data type optimization is an optimization done on the database schema.

Optimize the space needed for a column. An email should not need the full VARCHAR(255) of a typical string data type for example.

Optimizing the space that each column in each table takes will reduce the time taken for the database to get the required data as it will “traverse less bytes”.

9. Normalization

Normalization is an optimization done on the database design and schema.

Split out common but less accessed data into separate tables so that there is less computation required when reading the key tables. This is a form of enumeration at the database level. It keeps reading data efficient and thus reduces the load on the database.

However, be careful not to over normalized your tables or it will require a lot of unnecessary JOINs that can quickly bloat up computational needs. An example that I have came across is the normalization of ‘days’ into a table of permanently 7 rows and 2 columns.

8. Indexing

Indexing is an optimization done in the database design.

Indexing allows the database to look through a mapping table to find the required row of data in the corresponding table rather than the whole table of data itself.

A mapping table is much lightweight, hence it reduces the time needed to get to a data, freeing up more resources for the database to handle more incoming requests.

Think of it as the table of content in a lengthy web page, book or catalog of grocery. You would do a “control F” to look for the information you want via the table of content rather than read from start to finish until you get to your data. That is indexing essentially.

7. Database Views

Database views is an optimization done on both application and database design level.

Database views are stored queries in the database. They can store temporary results and have an index attached to them.

The advantage of using database views over only indexing your tables is that the database now only has to go through the filtered results from the SQL query in the database view, as compared to an index which consist of all results without the filtered.

For example, if a database view has a query that only looks at records for this year, then your database will only be searching the records for only this year. Using only an index, it will have to search through the records since the start of time till this year, which is a lot less efficient. This stackoverflow answer answers it better.

6. Caching

Caching is an optimization done on the database infrastructure level.

Some of the information we display on our websites and apps are derived data from our database. Derived data are raw data from the database that are computed within your application based on business logic. Some examples I can think of are tabulating the total spent by users from an online shop which involves calculating the individual prices of each item, the quantity bought, discounts and miscellaneous fees like taxes and shipping.

These data would not change, given the same raw data and same computing algorithm. So rather than running through the same requests to get the same raw data from the database, and going through the same algorithm, it may make sense to cache it. We typically use cache servers, are not part of the typical databases and are add ons to the infrastructure, to handle this.

Cache servers like Redis stores data in RAM and not on the hard disk like typical databases. The significance of this distinction is that these memory on RAM can be accessed much quickly than those on the hard disk. It is also this exact reason that RAM is much more expensive than memory in the hard disk, thus destroying any idea you might have about using purely cache as the database.

This gives you much to think about what data should be cached and what should not. The art of using cache efficiently to break up the bottlenecks in your application requires much experience and experiment.

You also have to use them wisely because most cache have a limit on how much data you can store in it. Redis, for example, as a key-value store, at the very basic level, has a limit of 512MB for each value.

On top of that, you need to be smart about when and how to auto expire and explicitly expire cached data so that they show the latest data according to your application needs. For example, when there is a change in the raw data in your database, the computation has to be done again since we are talking about new and different input values.

5. Read Replicas

The use of read replicas is an optimization done on the database infrastructure level.

Read replicas involves spinning up more database copies of the master database to handle read loads. This spreads the load up, leaving mainly the write requests to the master database.

Some read requests that required strong consistency still need to go through the master database. This is due to the latency of data propagation from the master database to these read replicas when there are new changes made to the master database.

If your application is write intensive, this may not be the best tool for the job and it will achieve little improvement in performance.

4. Vertical Scaling

Vertical scaling is an optimization done on the database infrastructure level.

This is the oldest trick in the book: throw money at the problem. Upgrade the database or opt for a more IOPS intensive storage type.

This is ultimately a mere stop gap solution as there is a limit on how far this can take us. It is also a costly upgrade for a non future proof solution.

I perceive its main advantage as simply buying us time to prepare for the next level of scaling.

3. Vertical Partitioning

Vertical partitioning is an optimization done in the database infrastructure level.

Disclaimer: I have never experienced doing this, but I believe this is what vertical partitioning is theoretically about and loved to be pointed out if I am wrong about it.

This step is slightly different from what most people perceive of sharding, which is more commonly horizontal sharding that we will cover later. Vertical partitioning is a form of sharding that is easier to implement. I deem it the appetizer for sharding.

It involves splitting columns and even tables into a separate databases or “shards”. This reduces the data in the main database and thus its computing load. It also spreads the traffic, in particular write requests that replicas are not able to solve, to other shards.

Each shard itself can have its own read replica clusters to further reduce the distribute the load.

However, this complexity will seep into the application level as now your application needs to know which database to connect to to write or read whatever data.

2. Hybrid Databases

Vertical partitioning is an optimization done on the database infrastructure level.

Disclaimer: I have never experienced doing this, but I believe this is what vertical partitioning is theoretically and loved to be pointed out if I am wrong about it.

This is a follow up on vertical partitioning. We can use new and more appropriate technologies to the new shards that can manage that part of the application better.

For example, we can use NoSQL databases to handle the historical coordinates of vehicles for a location tracking module. The requirements of this module is places itself more towards the availability and partition tolerance in the CAP theorem. There is no imperative need for ACID properties to be upheld in database transactions, and eventual consistency under the principles of BASE is sufficient for this module, on a very general level. This allow us to utilize the scaling capabilities of the NoSQL database, at the expense of consistency in the data, which is something we can deal with.

That said, not all NoSQL databases are made equal. They do not all sacrifice consistency for availability and scaling. There are many flavors of NoSQL that will fit different requirement of your module and it is all about finding the correct tool for the correct job.

Another example of using hybrid databases is when you have a highly analytical application. We can partition the tables involved in the data computation into a shard, and perform an ETL process to store the data in a more efficient structure in databases that are more appropriate for analytical functions, like Amazon Redshift and Google’s BigQuery. It takes away the computational load from your application and database.

The advantage of this partitioning allow you to scale only the bottlenecks of your application in the most cost productive manner.

An example of this vertical partitioning is done by Airbnb for their chat module. They identified it as a bottle neck in their application and acted accordingly to it. They did not use a different technology for that partition in this case.

1. Sharding

Sharding, or horizontal partitioning, is an optimization done on the database infrastructure level.

Disclaimer: I have never experienced doing this, but I believe this is what horizontal partitioning is theoretically and loved to be pointed out if I am wrong about it.

Eventually, some of your tables will have so much data that there is a need to split the rows in the tables into different shards. For example, the first 10 million rows will be, in the same table, moved to a shard located in USA, the next 10 million will be, in the same table, placed in another shard located in Germany, and so on.

Usually at this point of time, you will have a handful of clusters of vertical shards. Horizontally sharding each of these clusters will not be manageable. I believe it is a complex mess to be handling this.

This also bring about new problems like cross shard latency at a global scale and application complexity to route the data to the correct shard. Add in the requirement for data recovery it is time to update your resume, as Mr. Sugu Sougoumarane mentions below, in his talk about the Vitess tool, which will bring me to the next point on this tool as a solution to sharding.

Before you carry out sharding, even for vertical partitions, you may want to consider Vitess. It is a database clustering management system for horizontal scaling to save you the complexity of handling that yourself as you scale so that you can spend your resources on the improving the application itself, which is what ultimately matters.

If I ever get to the point of having to do sharding, at least this will be the first tool that I will research and study more about to tackle the problem.

How To SSH Into Private Servers Via A Bastion Without Copying SSH Keys

This is a documentation on the the process of accessing the public EC2 instances from a bastion server that is created in the private subnet, as a follow up to the article on setting up a proper cloud infrastructure with basic security for applications on AWS.

Motivation

Once in a while, we need to communicate with the production servers to do checks. The proper way is to setup a bastion server in the public instance and ssh into them. However, setting the bastion server up with the proper configurations might be time consuming to get it right.

On top of that, once we are done, there is a financial incentive to shut the bastion server down to save cost. This will translate to more time consumed to spin it up and down.

Since we might not do this often, we would tend to forget how to set up or shut down the bastion server properly. This translates to more time debugging during each process should any steps be missed along the way.

Hence, it will be nice to have these processes recorded down in code.

Terraform Setup For Bastion Server

The terraform files to setup the bastion server is as shown below. This is a complete copy of the snippet in the article on setting up a standard AWS VPC using terraform. The explanation is there so I would not be covering that here.

# bastion
resource "aws_security_group" "bastion" {
  name = "${var.project_name}${var.env}-bastion"
  description = "For bastion server ${var.env}"
  vpc_id = aws_vpc.main.id

  tags = {
    Name = "${var.project_name}${var.env}"
  }
}

resource "aws_security_group_rule" "ssh-bastion-world" {
  type = "ingress"
  from_port = 22
  to_port = 22
  protocol = "tcp"
  # Please restrict your ingress to only necessary IPs and ports.
  # Opening to 0.0.0.0/0 can lead to security vulnerabilities
  # You may want to set a fixed ip address if you have a static ip
  security_group_id = aws_security_group.bastion.id
  cidr_blocks = ["0.0.0.0/0"]
}

resource "aws_security_group_rule" "ssh-bastion-web_server" {
  type = "egress"
  from_port = 22
  to_port = 22
  protocol = "tcp"
  security_group_id = aws_security_group.bastion.id
  source_security_group_id = aws_security_group.web_server.id
}

resource "aws_instance" "bastion" {
  ami = "ami-061eb2b23f9f8839c"
  associate_public_ip_address = true
  instance_type = "t2.nano"
  subnet_id = aws_subnet.public-ap-southeast-1a.id
  vpc_security_group_ids = ["${aws_security_group.bastion.id}"]
  key_name = aws_key_pair.main.key_name

  tags = {
    Name = "bastion-${var.project_name}${var.env}"
  }
}

resource "aws_key_pair" "main" {
  key_name = "${var.project_name}-${var.env}"
  public_key = "ssh-rsa something"
}

output "bastion_public_ip" {
  value = aws_instance.bastion.public_ip
}

Retrieving AWS EC2 Client

I will be using ruby to carry out the ssh process because I am really bad with shell script 🙁

These snippets here are translated from a rake task which is what I use for my projects. There may be errors here and there so do understand the process rather than just plain copy!

require 'aws-sdk'

aws_access_key_id = `aws --profile #{aws_profile} configure get aws_access_key_id`.chomp
aws_secret_access_key = `aws --profile #{aws_profile} configure get aws_secret_access_key`.chomp

ec2_client = Aws::EC2::Client.new(
  region: region,
  access_key_id: aws_access_key_id,
  secret_access_key: aws_secret_access_key
)

First, initialize an instance of EC2 client. That will require the correct access key id and secret access key. These can be easily retrieve by running the aws configure command in the shell.

Line 3 and 4 does this, given the desired aws_profile, and stores the value in respective ruby variables for use in the rest of the script.

Note that the back ticks (`), among other shell execution commands in ruby, should be used here as it is the only one the returns the output that we need to use. The chomp method removes the line break that is returned along with the output in the shell.

Retrieving The Bastion Server Instance

results = ec2_client.describe_instances(
  filters: [
    {
      name: 'instance.group-name',
      values: ["#{project_name}#{env}-bastion"]
    }
  ]
)

raise 'There are more than 1 reservations. Please check!' if results.reservations.count > 1

raise 'There are no reservations. Please check!') if results.reservations.count.zero?

instances = results.reservations.first.instances

raise 'There are more than 1 bastion servers. Please check!' if instances.count > 1

bastion = instances.first

Next we retrieve the bastion instance via the describe_instances method as shown on line 1.

On line 4 and 5, we narrow down our instances to search for using the filter instance.group-name. This filter refers to the security group that we have attached to the bastion instance (see the terraform files).

The next few lines handle the unexpected scenarios of having multiple reservations and instances. I do not know the difference between these 2 entities, but I guess that is trivial to our mission here.

Eventually, we will have the access to the bastion instance.

Retrieving The Private Web Server Instance

results = ec2_client.describe_instances(
  filters: [
    {
      name: 'instance.group-name',
      values: ["#{project_name}#{env}-web-servers"]
    }
  ]
)

raise 'There are no reservations' if results.reservations.count.zero?

private_ip_addresses = results.reservations.map do |reservation|
  reservation.instances.map(&:private_ip_address)
end.flatten

raise 'There are no private_ip_addresses.' if private_ip_addresses.count.zero?

instance_ip = private_ip_addresses.first

Next, we retrieve web server instances using the same method by filtering their security group’s name. Our target here is the private ip address of any one of the instances. Adjust accordingly if there is a particular private instance you are trying to access.

SSH Into Private Web Server Via Bastion Server

Here comes the main event, the ssh operation.

sh 'ssh-add -D'
sh "ssh-add -K #{Rails.root}/#{project_name}-#{env}"
sh('ssh ' \
'-tt ' \
'-A ' \
"-i #{Rails.root}/#{project_name}-#{env} " \
"ec2-user@#{bastion.public_ip_address} " \
'-o StrictHostKeyChecking=no ' \
"-o 'UserKnownHostsFile /dev/null' " \
"\"ssh ec2-user@#{instance_ip} " \
"-o 'UserKnownHostsFile /dev/null' " \
'-o StrictHostKeyChecking=no"')
sh 'ssh-add -D'

Use the sh utility command in ruby to execute shell script. We are not going to use the output of the commands here, so using back ticks is not necessary.

In this bash session, line 1 clears all ssh identities present if any with the -D option. Note that this bash session is decoupled from the bash session of the current terminal. Hence, at this point of time, there should not be any since it is a new session. We also do not need to worry about erasing the ssh agents that we have added. I am keeping it here for hygiene sake.

Line 2 adds the RSA identity of the key pair, which is used to create the bastion instance, to the ssh agent in the current session.

This step is extremely pivotal. It allows us to forward our ssh agent along with the required RSA identity to the bastion server. The bastion server will subsequently be able to authenticate with the web server due to the forwarded identity.

And realise this. All this is done without the bastion server actually possessing the ssh keys at all! This is immensely beneficial on the security side of things because the bastion server, as a server on the public subnet that is exposed to the Internet, is a point of vulnerability for your private instances. It poses a security risk if attackers are able to access the private instances using the ssh keys in the bastion server. But since the keys are not there, we can make sure Gandalf sees to them.

The command from line 3 onwards is the actual main ssh command.

Line 4 forces a pseudo-terminal allocation for us to interact with the web server once we have established the connection. Multiple t option ensures that the interactive session will be forced even if the ssh did not have a local tty for interaction purpose.

Line 5 forwards the ssh agent through the tunneling. And since we have added the RSA identity to the our ssh agent in this session, the authentication keys are also forwarded in the process, without make a copy in the bastion server itself.

Line 6 points to the identity file required to access the bastion server from our local machine. You would not need this line if you have created your bastion server and your web instances using the same key pair. For this case, this is an extra step that is not necessary as I have set up the web and bastion servers to use the same key pair.

Line 7 states the endpoint of the bastion server and where to ssh into.

Line 8 prevents the ssh mechanism to ask for our confirmation to carry out the operation.

Line 9 prevents our bastion’s ip address to be registered a known host on our ssh known_hosts file. As these bastion servers are meant to be shut down after use, their ip addresses will be different each time we spin them up. Hence, this option will prevent the unnecessary and unmonitored bulging of our known_hosts file.

Line 10 is the command to run after we have successfully ssh into the bastion server. In this case, we are running the a subsequent ssh command to ssh into the web server instance via its private ip address that we found earlier. Note that this command is wrapped in quotes.

To reiterate a few pointers here:

  • This subsequent ssh command does not require any identity file due to the RSA identity that is forwarded
  • The bastion server can access the private instances due to the setup of the security groups
  • The shell interaction session between the bastion server and the web instance is available to use on our local machine due to the -tt option mentioned in line 4.

Line 11 and 12 serve the same purpose as line 8 and, but this time for the bastion server’s ssh operation into the web servers.

Line 13 for hygiene sake, clear the RSA identity from the ssh agent.

Conclusion

There you have it, an ssh session via a bastion server without copying the security keys into it to ensure minimum vulnerability and maximum security!

Connecting MSSQL Database Using Ruby On Rails

This is a documentation on how to connect to a MSSQL database in a Rails application. We will use FreeTDS as the main toolkit to establish the connection.

Motivation

I came across a gig that requires me to connect to a MSSQL database to extract the data via the application that I was building in Ruby On Rails. I spend quite some time experimenting  and playing with it before I can manage to get it to work.

It will be good to document my steps and reasons in case I come across another such request and my memory fails me.

Installation

While Ruby on Rails has a gem that serves as a wrapper around the FreeTDS library of files, it requires the FreeTDS binaries to be installed natively on the machine that is running the application.

This presents a number of challenges. First, the local machine used may be different for different users. Second, the operating system used in the servers and local machine may be different too.

For my case, I use macOS for my development work, and the Amazon flavored linux for my staging and production sites.

Installing FreeTDS on macOS

The steps listed here follows this guide closely.

First, install using these files locally in the kernel using homebrew.

​
brew update
brew install unixodbc freetds

ODBC is an API that is meant for database access across different platforms. unixODBC is the driver manager that allows unix systems to connect to ODBC-capable databases.

MSSQL is one such database. However, while it uses ODBC for connection, it uses the TDS protocol on the application layer for communication. Hence, a ODBC driver alone is insufficient for the machine to process the data in the database. This is where FreeTDS comes in.

FreeTDS is a set of libraries that will do the translation and allow our application to connect to the database and retrieve the data.

Installing TDS on Amazon Linux

Credits to this answer on stackoverflow. He even gave the steps required to install the packages via Elastic Beanstalk, which is convenient for me as I also use Elastic Beanstalk for deployment.

[ ! -e /home/ec2-user/freetds-1.00.86.tar.gz ] && \
wget -nc ftp://ftp.freetds.org/pub/freetds/stable/freetds-1.00.86.tar.gz -O /home/ec2-user/freetds-1.00.86.tar.gz || \
true

The first section of the code that is enclosed within a pair of square bracket is a unix command to check the existence of the zip file, which contains the necessary libraries, in the home path of the server. In the Amazon Linux system, the home path is /home/ec2-user by default. Adjust accordingly if you are installing in a linux local machine.

Should the file exist, the subsequent command to download the file will not be executed due to the logical && operation.

The last || operation with a true ensures the command returns a true, and the whole Elastic Beanstalk process will continue even if the file already exist. Of course, this step is not necessary if we are installing the libraries manually on our local linux machine.

[ ! -e /home/ec2-user/freetds-1.00.86 ] && \
tar -xvf /home/ec2-user/freetds-1.00.86.tar.gz -C /home/ec2-user/ || \
true

Similarly this step check for the presence of the unzipped file to prevent repeated and unnecessary unzipping of the compressed library.

[ ! -e /usr/local/etc/freetds.conf ] && cd /home/ec2-user/freetds-1.00.86 && \
sudo ./configure --prefix=/usr/local --with-tdsver=7.4 || \
true

[ ! -e /usr/local/etc/freetds.conf ] && \
( cd /home/ec2-user/freetds-1.00.86 && sudo make && sudo make install ) || \
true

The next 2 commands set up the configurations for FreeTDS and start finally installing its libraries. Upon installation, the config file freetds.conf will be produced, which explains the checks against its existence to prevent duplicate installation operations.

Application in Ruby on Rails

With the FreeTDS libraries installed in the kernel, we can look at how to use the tiny_tds gem to communicate with the MSSQL database. After installing it via bundler, we can sue the following commands to connect.

client = TinyTds::Client.new(
  username: Rails.application.credentials.dig(Rails.env.to_sym, :deltek, :username),
  password: Rails.application.credentials.dig(Rails.env.to_sym, :deltek, :password),
  host: Rails.application.credentials.dig(Rails.env.to_sym, :deltek, :host),
  port: Rails.application.credentials.dig(Rails.env.to_sym, :deltek, :port),
  database: Rails.application.credentials.dig(Rails.env.to_sym, :deltek, :database)
)

Following the new practice of using credential file to store secrets, I have stored all the database credentials in the encrypted credential.yml.enc file.

client.execute("
  SET ANSI_WARNINGS ON;
  SET ANSI_PADDING ON;
  SET ANSI_NULLS ON;
  SET QUOTED_IDENTIFIER ON;
  SET ANSI_NULL_DFLT_ON ON;
  SET CONCAT_NULL_YIELDS_NULL ON;
  SELECT @@OPTIONS;
").each

This next snippet sets the settings of the connection. I would not pretend to understand the reasons for the settings made here. However, this is the final settings that worked for me to make the subsequent queries to the database tables. I came to this final configurations after googling around for the different errors that were thrown at me while getting TDS to work.

result = client.execute("SELECT TOP 1 * FROM SOME_TABLE").each

This is an example of executing a query in SQL language. The result variable will be an array of hashes, where each hash represent 1 row of record.

result = client.execute("SELECT TOP 1 * FROM SOME_TABLE").each

Last but not least, make sure to close the client’s connection. This is not active record that “automagically” does that for you.

AWS Lamba and API Gateway Integration With Terraform To Collect Emails

This is a documentation on creating a service that collects emails. It runs on serverless technology utilizing AWS lambda and API Gateway. It is also made easy to deploy to the cloud with infrastructure as code via Terraform in the form of a plug-and-play methodology.

Motivation

Often, I have to make static websites that are not exactly completely static because it requires a backend to collect the emails. While 3rd party services like mailchimp and sendgrid has their own SDKs to support easy integration for email collection, we have to be worried about hitting the limit in their packages and plans. This translate to stress for developers as we have to find a solution on it quickly and properly. If this happens on a weekend or a Friday, somehow this is always the case as more people are surfing the net then, the intensity is amplified.

For a new website, it is very hard to gauge the traffic and thus the plan required for the 3rd party service. this poses difficulties when budgeting for the project. Under utilizing the service also translate to unnecessary cost. The best kind of plan for such website is a pay as you go model, in my opinion, and that can be achieved by integrating with cloud providers like AWS.

Technology Stack

AWS Lambda

Enter AWS lambda where you only pay for what you use. You do not need to fork out money at the start of your project. Instead, you will just pay for how much you use, hence relieving you of the worry of wasting money on resources you are not using. In fact, this is only an issue if you are hitting 1 million potential users signing up with their email every month. The reason is because AWS lambda has 1 million free request every month before they start charging. This is highly unlikely for a new website, which means you now have a backend for your static website for free.

API Gateway

For the Serverless fuction, that is AWS Lambda, to connect to the Internet via an API, we need the API Gateway. This exposes the serverless function to be accessible by the World Wide Web with a HTTPS endpoint. It runs on the encrypted transport security layer protocol to uphold security by default. This allows your websites to use the serverless function via API calls.

Terraform

To set up the infrastructures, the usual way is to navigate the AWS management console, deploy the required AWS resoures and link them. This can be a challenge if you are not familiar with the required configurations. Not only will this translate to loss of precious time to debug these issues, which otherwise developers could have spent it with your loved ones and challenge the meme below, but it will also lead to frustration.

While frustration is a part and parcel of life as a programmer, we can also avoid them with our knowledge of code. Here is where Terraform enters the fray. It is an Infrastructure As Code where you write the configurations of the infrastructure once and you can deploy it multiple time without having to go through the whole forest of the AWS console each time. This means you do not need to remember every single step and do not need to deal with surprise bugs because you forgot one of them, or worse, had a spelling error.

Programming is like magic. You write very specific instructions in arcane languages to invoke commands, and if you get it even a little bit wrong you risk unleashing demons and destroying everything.

— Diana Carrier (@artemis_134) June 23, 2018

Since the blueprint infrastructure is in code, this means we can leverage version control features with git, and work together to improve the code base along the way without fear of not being able to rollback to the previous successful configuration.

Terraform Files

I will start off with the terraform files required to setup the infrastructure to deploy the code. Let’s start off with the place to store our emails.

The database – AWS DynamoDB

I will store the emails collected in AWS’s own noSQL database DynamoDB. This is a fast, simply structured and schemaless storage which fits my use case very aptly.

It allows fast and simultneous writes at high speed, so there is no fear of race conditions from spike in the volume of signups during a PR event promoting the product and getting people to leave their emails at the website.

Since it is schemaless, we can easily add new details of the users that you would like to collect on top of their emails along the way without having to migrate and fiddle with the structure of the database. With proper metaprogramming, you do not need to touch the backend code as well, leaving only the frontend to work on adding the new text fields for data collection.

For the sake of argument, we can also use the traditional relational database management system  (RDBMS) for this project. It is written in SQL, which is a langauge most, if not all, developers who every touched a database would have known. There is no need to use fancy noSQL for this simple project. In addition, the chances of leveraging the scaling advantage of noSQL over SQL databases are low, because you will need alot of traffic for that to become a worry. For a new website, that is highly unlikely to happen.

However, highly influenced by the cost, I am still sticking with DynamoDB in this case. To setup an AWS RDS to host a managed relational database, the cheapest MySQL database already goes for around 20 USD a month, as compared to the pay as you go model the DynamoDB employs. On top of that, it has a generous amount of free usage and storage under its free tier. This free tier does not last for the first 12 months after your signup but forever, unlike the RDS counterpart. We probably will NOT incur any cost using DynamoDB unless your marketing is brilliant for your new website.

resource "aws_dynamodb_table" "main" {
  name = "${var.project_name}-dynamodb_table"
  billing_mode = "PROVISIONED"
  read_capacity = var.dynamodb-read_capacity
  write_capacity = var.dynamodb-write_capacity
  hash_key = "email"

  attribute {
    name = "email"
    type = "S"
  }
}

Provisioning the database is the simplest. I am using Terraform variables to substitute values to set the number of reading and writing units required, as well as the table name for robustness sake.

I have set the billing mode to “provisioned” for simplicity sake. Afterall I am not expecting any insane burst of traffic for a site that is not popular. Even if it does, maybe due to some incrediable promotion at some hugely popular event, I do not expect the load to require me to scale the reading and writing capacities of the database. It is going to be a quick write of a few bytes.

On top of that, provisioned capacity means less configurations needed for the permissions to autoscale of the capacities of the database. It can take some time to configure that, and since that is outside the topic of the article, I will stick to “provisioned” billing mode.

The hash_key, or “partition key” in other definitions, is analogous to the primary key in a SQL database table. It requires specific details under the attribute property. You can specify the range_key, or “sort key” here if you require, and remember to add attribute to describe it as well.

Other attributes that are neither the partition key nor the sort key need not have a attribute property in this file. You can simply just write it in the database and it will register. Afterall, this is a schemaless database.

On top of that, it is a fully managed database, so it comes with all the goodies like backup and version maintenance to spare developers from all these chores.

The backend – AWS Lambda

​Next is the lambda function. It is written in Javascript using Nodejs. The file below is the configuration file to set the infrastructure required. Let’s dive into it.

resource "aws_lambda_function" "main" {
  filename = var.zipfile_name
  function_name = "${var.project_name}"
  role = aws_iam_role.main.arn
  handler = "index.handler"

  source_code_hash = "${filebase64sha256("${var.zipfile_name}")}"

  runtime = "nodejs12.x"
}

resource "aws_iam_role" "main" {
  name = "${var.project_name}-iam_lambda"

  assume_role_policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": "sts:AssumeRole",
      "Principal": {
        "Service": "lambda.amazonaws.com"
      },
      "Effect": "Allow",
      "Sid": ""
    }
  ]
}
EOF
}

resource "aws_iam_policy" "main" {
  name = "main"
  path = "/"
  description = "IAM policy for lambda to write to dynamodb table and logging"

  policy = templatefile("${path.module}/lambda_policy.tmpl", { dynamodb_arn = aws_dynamodb_table.main.arn })
}

resource "aws_iam_role_policy_attachment" "main" {
  role = "${aws_iam_role.main.name}"
  policy_arn = "${aws_iam_policy.main.arn}"
}

resource "aws_lambda_permission" "main" {
  statement_id = "AllowExecutionFromAPIGateway"
  action = "lambda:InvokeFunction"
  function_name = aws_lambda_function.main.function_name
  principal = "apigateway.amazonaws.com"

  source_arn = "${aws_api_gateway_rest_api.main.execution_arn}/*/*/*"
}

Uploading of the backend code will be using the base64 hash of the zipfile of the code. The code will need to be first compressed and zipped before taking this action. We will see how we can automate this process later.

This lambda function will need the permissions to write to the dynamoDB table. This is done using

  • aws_iam_role to establish trust between the 2 AWS services
  • aws_iam_policy to give permission for the lambda function access the database resource and perform the PutItem action. Details of the policy is interpolated via a template file, which we will go through later
  • aws_iam_role_policy_attachment to bind the aws_iam_role to the aws_iam_policy on the lambda function
  • aws_lambda_permissionto allow API Gateway to be able to integrate the lambda function and invoke it

The template file for the aws_iam_policy is shown below. It lists the actions that the lambda function is permitted to perform on the specified dynamodb table. It also contains the permissions for lambda function to push the logs to AWS Cloudwatch. By the way, these logging permissions are the default permissions for a lambda function, and this template adds on the DynamoDB permissions to them. Note the dynamodb_arn variable that is interpolated, which jusitifies the use of the template file instead of hardcoding the whole policy in the main terraform file for robustness sake.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": "dynamodb:PutItem",
      "Resource": "${dynamodb_arn}",
      "Effect": "Allow"
    },
    {
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Resource": "arn:aws:logs:*:*:*",
      "Effect": "Allow"
    }
  ]
}

The API layer – AWS API Gateway

The API Gateway is required to expose the lambda function to be consumed by servers and websites via a URL endpoint. The endpoint will be served over the HTTPS, which requires some extra configurations as documented below.

resource "aws_api_gateway_rest_api" "main" {
  name = var.project_name
}

resource "aws_api_gateway_resource" "main" {
  rest_api_id = aws_api_gateway_rest_api.main.id
  parent_id = aws_api_gateway_rest_api.main.root_resource_id
  path_part = "email"
}

resource "aws_api_gateway_integration" "main" {
  rest_api_id = aws_api_gateway_rest_api.main.id
  resource_id = aws_api_gateway_resource.main.id
  http_method = aws_api_gateway_method.main.http_method
  integration_http_method = aws_api_gateway_method.main.http_method
  type = "AWS_PROXY"
  uri = aws_lambda_function.main.invoke_arn
}

resource "aws_api_gateway_integration_response" "main" {
  depends_on = [aws_api_gateway_integration.main]

  rest_api_id = aws_api_gateway_rest_api.main.id
  resource_id = aws_api_gateway_resource.main.id
  http_method = aws_api_gateway_method.main.http_method
  status_code = aws_api_gateway_method_response.main.status_code
}

resource "aws_api_gateway_method" "main" {
  rest_api_id = aws_api_gateway_rest_api.main.id
  resource_id = aws_api_gateway_resource.main.id
  http_method = "POST"
  authorization = "NONE"
}

resource "aws_api_gateway_deployment" "main" {
  depends_on = [
    "aws_api_gateway_integration_response.main",
    "aws_api_gateway_method_response.main",
  ]
  rest_api_id = aws_api_gateway_rest_api.main.id
}

resource "aws_api_gateway_method_settings" "main" {
  rest_api_id = aws_api_gateway_rest_api.main.id
  stage_name = aws_api_gateway_stage.main.stage_name
  
  # settings not working when specifying the single method
  # refer to: https://github.com/hashicorp/terraform/issues/15119
  method_path = "*/*"

  settings {
    throttling_rate_limit = 5
    throttling_burst_limit = 10
  }
}

resource "aws_api_gateway_stage" "main" {
  stage_name = var.stage
  rest_api_id = aws_api_gateway_rest_api.main.id
  deployment_id = aws_api_gateway_deployment.main.id
}

resource "aws_api_gateway_method_response" "main" {
  rest_api_id = aws_api_gateway_rest_api.main.id
  resource_id = aws_api_gateway_resource.main.id
  http_method = aws_api_gateway_method.main.http_method
  status_code = "200"
}

output "endpoint" {
  value = "${aws_api_gateway_stage.main.invoke_url}${aws_api_gateway_resource.main.path}"
}

So let’s break it down.

The aws_api_gateway_rest_api represents the project in its entirety.

The aws_api_gateway_resource refers to each api route of this project, and there is only 1 in this case.

I have setup only 1 stage environment of aws_api_gateway_stage for this project using a Terraform variable. You can setup a different stages to differentiate the staging and production environments.

The aws_api_gateway_stage is associated to a aws_api_gateway_method_settings that sets the throttling rate of the API to prevent spams and overloading. For the method_path property, the wildcard route is used to apply to all routes instead of the only API route that was created. It is trivial in this case, but the explanation for picking this “easy” route is simply due to a bug. It I were to specify the exact route, which is in the form of {resource_path}/{http_method}, the settings on the throttling rate will not propagate. It was documented here on github but was not properly resolved. Leaving it here for now.

The aws_api_gateway_deployment configures the deployment of the API. Note the depends_on attribute that was assigned. This explicit dependency is critical to ensure the deployment is called into effect after all the necessary resources have been provisioned.

The aws_api_gateway_integration configuration sets the integration to lambda proxy using POST HTTP method without any authorization, as specified by the aws_api_gateway_method configuration. Lambda proxy allows us to handle the request from the server like how we would in a typical web application backend framework. The full request object is passed to lambda function and the API Gateway plays no part in mapping any of the request parameters. The API Gateway mapping has great potential to integrate interfaces properly, but for our use case, it is not necessary. I find this article doing a great job in explaining the API Gateway features with easy to consume information and summary, like a gameshark guide book written by the half-blood prince. Do take a look to understand AWS API Gateway better.

The aws_api_gateway_integration_response is responsible for handling the response from the lambda function. This is where we can make changes to the headers returned from the lambda function using the response_parameters property, which is not used in this case. This is also the place to map and transform the response data from the backend to fit the desired data structure using the response_templates property.

The aws_api_gateway_method_response is where we can filter what response headers and data from aws_api_gateway_integration_response to pass on to the caller.

The transform and mapping of the headers and data from the backend (ie the lambda function) in aws_api_gateway_integration_response and the filter of headers and data before passing to the front end in aws_api_gateway_method_response is not needed in this sample application. It is just good knowledge to have. There are 2 reasons why we do not need them here.

First, in a bit, we will go through the front end that will make an API call that is a simple request. A simple request does not require a preflight request, which is a API call made by browsers prior to the actual API call, as they are deemed safe since they are using standard CORS-safelisted request headers. In the event that one does need a preflight request because one is not making a simple request, we will need to set up another API route that will transform the headers returned from the backend and allow the relevant headers to be passed on to the front end for this preflight request. This will allow the frontend website to overcome the CORS policy enabled by default in modern browsers. This will mean we need to configure a new set of aws_api_gateway_rest_api, aws_api_gateway_integration, aws_api_gateway_method, aws_api_gateway_integration_response, aws_api_gateway_method_response just for this preflight request. Things can get complicated here, so I will leave out of this article. If you still to implement CORS, [this gist](https://gist.github.com/keeth/6bf8b67c82f9a085e03ecbb289a859d6) is a good reference.

Second, we are using lambda proxy integration, so the full response from the lambda will be passed to the front end and mapped automatically, provided the response from the lambda code is properly formatted. Refer to this documentation for more details on it.

At last, the output resource will print the value of the enpoint of the api for us to integrate in our frontend.

The Admin Stuff

This file contains the details that we will need to setup terraform and the variables we are using. The provider‘s region attribute here is hardcoded, which should ideally not be the case. I have yet to figure out how to make this dynamic and robust. The name with the todo- prefix should be changed to fit the project.

We are using an S3 bucket as the Terraform backend to hold the state of the infrastructure provisioned by Terraform. ​Creation of the bucket will be automated via a script that we will go through during the section on deployment.

provider "aws" {
  version = "~> 2.24"
  region = "eu-west-1"
}

terraform {
  required_version = "~> 0.12.0"
  backend "s3" {
    bucket = "todo-project-tfstate"
    key = "terraform.tfstate"
    region = "eu-west-1"
  }
}

variable "project_name" {
  type = string
  default = "todo-project"
}

variable "region" {
  type = string
  default = "eu-west-1"
}

variable "stage" {
  type = string
  default = "todo-stage"
}

variable "zipfile_name" {
  type = string
  default = "todo-project.zip"
}

variable "dynamodb-read_capacity" {
  type = number
  default = 1
}

variable "dynamodb-write_capacity" {
  type = number
  default = 1
}

The Application

Here is the application code in written in nodejs. It is a simple write to the dynamodb with basic error handling. It takes in only 1 parameter, that is the email. This code can definitely be improved by allowing more parameters to be written to the database in a dynamic way, so that the same code base can be used for a site that collects the first and last name of the user, as well as another site that collects the date of birth of the user. I will leave that as a future personal quest.

// Load the AWS SDK for Node.js
const AWS = require('aws-sdk');

// Set the region 
AWS.config.update({region: 'eu-west-1'});

// Create the DynamoDB service object
const ddb = new AWS.DynamoDB({apiVersion: '2012-08-10'});

exports.handler = async (event) => {
  console.log(JSON.stringify(event, null, 2));
  const params = {
    TableName: 'todo-project-dynamodb_table',
    Item: {
      'email' : {S: JSON.parse(event.body).email}
    }
  };

  // Call DynamoDB to add the item to the table
  ddb.putItem(params, function(err, data) {
    if (err) {
      console.log("Error", err);
    } else {
      console.log("Success", data);
    }
  });
  
  try {
    const result = await ddb.putItem(params).promise();
    console.log("Result", result);
    const response = {
      statusCode: 204,
      headers: {
        "Access-Control-Allow-Origin" : "*",
      },
    };
    return response;
  } catch(err) {
    console.log(err);
    const response = {
      statusCode: 500,
      headers: {
        "Access-Control-Allow-Origin" : "*",
      },
      body: JSON.stringify({ error: err.message }),
    };
    return response;
  }
};

A thing to note here is the need to return the Access-Control-Allow-Origin header in the response. The response also has to follow a particular but straightforward and common format in order for lambda proxy integration with API Gateway. This will map the response properly to the API Gateway method response and be returned to the frontend websites to overcome the CORS policy implemented by modern browsers.

Deployment

I will be using 3 ruby scripts for deployment related tasks, namely init.rb, apply.rb and destroy.rb, and a helper service object, get_aws_profile.rb for the deployment process.

Let’s take a look at them.

get_aws_profile.rb

# get_aws_profile.rb

class GetAwsProfile
  def self.call
    aws_profile = "todo-aws_profile"

    begin
      aws_access_key_id = `aws --profile #{aws_profile} configure get aws_access_key_id`.chomp
      abort('') if aws_access_key_id.empty?

      aws_secret_access_key = `aws --profile #{aws_profile} configure get aws_secret_access_key`.chomp
      abort('') if aws_secret_access_key.empty?
    rescue Errno::ENOENT => e
      abort("Make sure you have aws cli installed. Refer to https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html for more information.")
    end

    p "AWS_ACCESS_KEY_ID = #{aws_access_key_id}"
    p "AWS_SECRET_ACCESS_KEY = #{aws_secret_access_key}"

    [aws_profile, aws_access_key_id, aws_secret_access_key]
  end
end

This is a helper method that will get the aws_access_key_id and the aws_secret_access_key for usage in the scripts. Note that it uses the aws cli command to attain the keys. Hence, it has to be installed on your local machine prior to running. It also assumes you are using named profile to hold your credentials.

I don’t really like this setup since it requires these prerequisites. But well that can be solved again in the future.

init.rb

The first script to run is init.rb.

The init.rb will create the S3 bucket to be used as the terraform backend. Line 20 checks for the presence of this bucket and throws an exception if the bucket does not exist. The rescue block, if triggered, will create the non-existent bucket.

The initialization process on terraform is run via its docker image.

require 'bundler/inline'

gemfile do
  source 'https://rubygems.org'
  gem 'pry'
  gem 'aws-sdk-s3', '~> 1'
end

require './get_aws_profile.rb'

aws_profile, aws_access_key_id, aws_secret_access_key = GetAwsProfile.call

s3_client = Aws::S3::Client.new(
  access_key_id: aws_access_key_id,
  secret_access_key: aws_secret_access_key,
  region: 'eu-west-1'
)

begin
  s3_client.head_bucket({
    bucket: 'todo-project-tfstate',
    use_accelerate_endpoint: false
  })
rescue StandardError
  s3_client.create_bucket(
    bucket: 'todo-project-tfstate',
    create_bucket_configuration: {
      location_constraint: 'eu-west-1'
    }
  )
end

response = `docker run \
  --rm \
  --env AWS_ACCESS_KEY_ID=#{aws_access_key_id} \
  --env AWS_SECRET_ACCESS_KEY=#{aws_secret_access_key} \
  -v #{Dir.pwd}:/workspace \
  -w /workspace \
  -it \
  hashicorp/terraform:0.12.12 \
  init`

puts response

apply.rb

Once initialized, the next script to run is apply.rb.

Prior to applying the Terraform instructure, the backend code is packaged into a zip file. After application, the zip file is deleted for housekeeping.

require 'bundler/inline'

gemfile do
  source 'https://rubygems.org'
  gem 'pry'
  gem 'rubyzip', '>= 1.0.0'
end

require './get_aws_profile.rb'
require 'zip'

aws_profile, aws_access_key_id, aws_secret_access_key = GetAwsProfile.call

folder = Dir.pwd
input_filenames = ['index.js']
zipfile_name = File.join(Dir.pwd, 'todo-project.zip')

File.delete(zipfile_name) if File.exist?(zipfile_name)

Zip::File.open(zipfile_name, Zip::File::CREATE) do |zipfile|
  input_filenames.each do |filename|
    zipfile.add(filename, File.join(folder, filename))
  end
end

response = `docker run \
  --rm \
  --env AWS_ACCESS_KEY_ID=#{aws_access_key_id} \
  --env AWS_SECRET_ACCESS_KEY=#{aws_secret_access_key} \
  -v #{Dir.pwd}:/workspace \
  -w /workspace \
  -it \
  hashicorp/terraform:0.12.12 \
  apply -auto-approve`

puts response

File.delete(zipfile_name) if File.exist?(zipfile_name)

With this, the api is now deployed and can be called from any website. We will go through a sample front end integration in a bit.

destroy.rb

Once you are done with the project or are in the process of debugging, the destroy script will remove all the resources deployed. It will also remove the S3 backend that was created outside of Terraform.

require 'bundler/inline'

gemfile do
  source 'https://rubygems.org'
  gem 'pry'
  gem 'aws-sdk-s3', '~> 1'
end

require './get_aws_profile.rb'

aws_profile, aws_access_key_id, aws_secret_access_key = GetAwsProfile.call

response = `docker run \
  --rm \
  --env AWS_ACCESS_KEY_ID=#{aws_access_key_id} \
  --env AWS_SECRET_ACCESS_KEY=#{aws_secret_access_key} \
  -v #{Dir.pwd}:/workspace \
  -w /workspace \
  -it \
  hashicorp/terraform:0.12.12 \
  destroy -auto-approve`

puts response

s3_client = Aws::S3::Client.new(
  access_key_id: aws_access_key_id,
  secret_access_key: aws_secret_access_key,
  region: 'eu-west-1'
)

begin
  s3_client.head_bucket({
    bucket: 'todo-project-tfstate',
    use_accelerate_endpoint: false
  })

  s3_client.delete_object({
    bucket:  'todo-project-tfstate',
    key: 'terraform.tfstate', 
  })
  s3_client.delete_bucket(bucket: 'todo-project-tfstate')
rescue StandardError
  puts "todo-project-tfstate S3 bucket already destroyed."
end

Sample Frontend Integration

<!DOCTYPE html>
<html>
<head>
  <script
  src="https://code.jquery.com/jquery-3.4.1.min.js"
  integrity="sha256-CSXorXvZcTkaix6Yvo6HppcZGetbYMGWSFlBw8HfCJo="
  crossorigin="anonymous"></script>
</head>
<body>

  <h2>HTML Forms</h2>

  <form id="form">
    <label for="email">First name:</label><br>
    <input type="text" id="email" name="email" value="test@test.com"><br>
    <input type="submit" value="Submit">
  </form>

  <script type="text/javascript">
    $( "#form" ).submit(function(event) {
      event.preventDefault();

      $.ajax({
        type: "POST",
        url: "https://todo-endpoint.execute-api.eu-west-1.amazonaws.com/todo-stage/email",
        data: JSON.stringify({
          email: $('#email').val()
        }),
        success: function(data, textStatus, jqXHR) {
          debugger
        },
        error: function(jqXHR, textStatus, errorThrown) {
          debugger
        }
      });
    });
  </script>

</body>
</html>

Below is a simple html web page that will has the email prefilled for demonstration purpose. The form will submit via jquery.ajax() using default settings so as not to trigger the need for preflight request.

You will see that the email will be added to the DynamoDB table, and the logs of the lambda funciton will be recorded in AWS Cloudwatch.

Conclusion

This exercise helped me understand how lambda is integrated with API Gateway, as well as the immense potential as a robust middleware the latter can be. In addition, I got to understand preflight request and CORS better, as well as the jquery.ajax() function.

The project is saved in this repository for future reference.

How To Setup A Standard AWS VPC With Terraform

This is a documentation on how to setup the standard virtual private network (VPC) in AWS with the basic security configurations using Terraform.

In general, I classify the basics as having the servers and databases in the private subnets, and having a bastion server for remote access. There is definitely much room to improve from this setup and certainly much more in the realms beyond my knowledge. However, as a start, this is, at the very least, essential for a production environment,

Personally, I have an Amazon Certified Solutions Architect (Associate) certificate to my name, but like most of the engineering university graduates out there who have forgotten how to do dy/dx or  what the hell is the L’Hôpital’s rule, I have all but forgotten the exact steps to recreate such an environment.

AWS Associate Solutions Architect | vic-l

As a saving grace 😅, I should say that I do know how to set it up, just that I do not have it at the tip of my fingers. I would not get it right the first time, but given time I will eventually set it up correctly.

This is true for whenever I setup an environment for new projects. Debugging the setup which can be time consuming and frustrating. It is not efficient and is probably one of the key reasons why infrastructure as code (IaC) has become a trending topic in recent years.

Provisioning these infrastructures using code implies:

  • version control on code and, in turn, infrastructural changes made by members of the development team
  • easily reproducible infrastructures
  • automation

One of the frontrunners in this industry is Terraform. All that is required are the configurations written in files ending with the “tf” extension placed in the same directory.

The VPC

Start by provisioning the VPC.

We set the CIDR block to provide the maximum number private ip addresses that an AWS VPC allows. This implies that you can have up to 65,536 AWS resources in your VPC, assuming each of them require a private IP address for communication purpose.

resource "aws_vpc" "main" {
  cidr_block = "10.0.0.0/16" # 65536 ip addresses

  tags = {
    Name = "${var.project_name}${var.env}"
  }
}

The variables project_name and env can be placed in a separate .tf as long as they are in the same directory when Terraform eventually runs to apply the changes.

The Gateways

Next, we setup the Internet gateway (IGW) and NAT gateway (NGW).

The IGW allows for resources in the public subnets to communicate with the outside Internet.

The NGW does the same thing,  but for the resources in the private subnets. Sometimes, these resources need to download packages from the Internet for updates etc. This is in direct conflict with the security requirements that placed them in the private subnets in the first place. The NGW balances these 2 requirements.

# IGW
resource "aws_internet_gateway" "main" {
  vpc_id = aws_vpc.main.id

  tags = {
    Name = "${var.project_name}${var.env}"
  }
}

resource "aws_route_table" "igw" {
  vpc_id = aws_vpc.main.id

  tags = {
    Name = "igw-${var.project_name}${var.env}"
  }
}

resource "aws_route" "igw" {
  route_table_id = aws_route_table.igw.id
  destination_cidr_block = "0.0.0.0/0"
  gateway_id = aws_internet_gateway.main.id
}

# NGW
resource "aws_route_table" "ngw" {
  vpc_id = aws_vpc.main.id

  tags = {
    Name = "ngw-${var.project_name}${var.env}"
  }
}

resource "aws_route" "ngw" {
  route_table_id = aws_route_table.ngw.id
  destination_cidr_block = "0.0.0.0/0"
  nat_gateway_id = aws_nat_gateway.main.id
}

### NOTE ###
resource "aws_eip" "nat" {
  vpc = true
}

resource "aws_nat_gateway" "main" {
  allocation_id = aws_eip.nat.id
  subnet_id = aws_subnet.public-ap-southeast-1a.id

  tags = {
    Name = "${var.project_name}${var.env}"
  }
}

Both gateways need to be associated to their respective aws_route_table via an aws_route that will route out to everywhere on the Internet, as indicated by the 0.0.0.0/0 CIDR block.

The NGW requires some additional setup.

First, a NAT gateway requires an elastic IP address due to the way it is engineered. I would not pretend I know how it works to tell you why a static IP address is required, but I do know we can easily provision using Terraform.

This static IP address will also come in useful if your private instances need to make API calls to third party sources that require the instances ip address for whitelisting purpose. The outgoing requests from the private instances will bear the ip address of the NGW.

In addition, a NAT gateway needs to be placed in one of the the public subnet in order to communicate with the Internet. As you can see, we have made an implicit dependency on the aws_subnet which we will define later. Terraform will ensure the NAT gateway will be created after the subnets are setup.

The Subnets

Now, let’s setup the subnets.

We will setup 1 public and 1 private subnet in each availability zones that the region provides. I will be using the ap-southeast-1 (Singapore) region. That will be a total of 6 subnets to provision as there are 3 subnets in this region.

#### public 1a
resource "aws_subnet" "public-ap-southeast-1a" {
  vpc_id = aws_vpc.main.id
  cidr_block = "10.0.100.0/24"
  availability_zone_id = "apse1-az2"

  tags = {
    Name = "public-ap-southeast-1a-${var.project_name}${var.env}"
  }
}

resource "aws_route_table_association" "public-ap-southeast-1a" {
  subnet_id = aws_subnet.public-ap-southeast-1a.id
  route_table_id = aws_route_table.igw.id
}

#### public 1b
resource "aws_subnet" "public-ap-southeast-1b" {
  vpc_id = aws_vpc.main.id
  cidr_block = "10.0.101.0/24"
  availability_zone_id = "apse1-az1"

  tags = {
    Name = "public-ap-southeast-1b-${var.project_name}${var.env}"
  }
}

resource "aws_route_table_association" "public-ap-southeast-1b" {
  subnet_id = aws_subnet.public-ap-southeast-1b.id
  route_table_id = aws_route_table.igw.id
}

#### public 1s
resource "aws_subnet" "public-ap-southeast-1c" {
  vpc_id = aws_vpc.main.id
  cidr_block = "10.0.102.0/24"
  availability_zone_id = "apse1-az3"

  tags = {
    Name = "public-ap-southeast-1c-${var.project_name}${var.env}"
  }
}

resource "aws_route_table_association" "public-ap-southeast-1c" {
  subnet_id = aws_subnet.public-ap-southeast-1c.id
  route_table_id = aws_route_table.igw.id
}

#### private 1a
resource "aws_subnet" "private-ap-southeast-1a" {
  vpc_id = aws_vpc.main.id
  cidr_block = "10.0.1.0/24"
  availability_zone_id = "apse1-az2"

  tags = {
    Name = "private-ap-southeast-1a-${var.project_name}${var.env}"
  }
}

resource "aws_route_table_association" "private-ap-southeast-1a" {
  subnet_id = aws_subnet.private-ap-southeast-1a.id
  route_table_id = aws_route_table.ngw.id
}

#### private 1b
resource "aws_subnet" "private-ap-southeast-1b" {
  vpc_id = aws_vpc.main.id
  cidr_block = "10.0.2.0/24"
  availability_zone_id = "apse1-az1"

  tags = {
    Name = "private-ap-southeast-1b-${var.project_name}${var.env}"
  }
}

resource "aws_route_table_association" "private-ap-southeast-1b" {
  subnet_id = aws_subnet.private-ap-southeast-1b.id
  route_table_id = aws_route_table.ngw.id
}

#### private 1c
resource "aws_subnet" "private-ap-southeast-1c" {
  vpc_id = aws_vpc.main.id
  cidr_block = "10.0.3.0/24"
  availability_zone_id = "apse1-az3"

  tags = {
    Name = "private-ap-southeast-1c-${var.project_name}${var.env}"
  }
}

resource "aws_route_table_association" "private-ap-southeast-1c" {
  subnet_id = aws_subnet.private-ap-southeast-1c.id
  route_table_id = aws_route_table.ngw.id
}

Amidst this long snippet of configuration for the subnets, it is essentially a repeat of the same resources association.

For the public subnets, they are assigned the CIDR blocks 10.0.1.0/2410.0.2.0/24 and 10.0.3.0/24 respectively. Each will have up to 256 ip addresses to house 256 AWS resources that requires an ip address. Their addresses will be from, taking the first subnet as example, 10.0.1.0 to 10.0.1.255.

For the private subnets, they occupy the CIDR blocks 10.0.101.0/24, 10.0.102.0/24 and 10.0.103.0/24 respectively.

To be exact, there will be less than 256 addresses per subnet as some private IP addresses are reserved in every subnet. Of course, you can provision more or less ip addresses per subnet with the correct subnet masking setting.

Each subnet is associated to different availability zones via the availability_zone_id to spread out the resources across the region.

Each public subnet is also associated to the aws_route_table that is related to the IGW, while each private subnet is associated to the aws_route_table related to the NGW.

The Database

Next, we setup the database. We will provision the database using RDS and place it in the private subnets for security purpose.

At this point of time, I must admit that I do not know if this is the best way to setup the database. I personally have a lot of questions on how the infrastructure will change when the application scales eventually, especially for the database. How will the database be sharded into different regions to serve a global audience? How do the database sync across the different regions? These are side quests that I will have to pursue in the future.

For now, a single instance in a private subnet.

resource "aws_db_instance" "main" {
  allocated_storage = 20
  storage_type = "gp2"
  engine = "mysql"
  engine_version = "5.7"
  instance_class = "db.t2.micro"
  identifier = "rds-${var.project_name}${var.env}"
  name = "something"
  username = "something"
  password = "something"

  skip_final_snapshot = false
  # notes time of creation of rds.tf file
  final_snapshot_identifier = "rds-${var.project_name}${var.env}-1573454102"

  vpc_security_group_ids = [aws_security_group.rds.id]
  db_subnet_group_name = aws_db_subnet_group.main.id

  lifecycle {
    prevent_destroy = true
  }

  tags = {
    Name = "rds-${var.project_name}${var.env}"
  }
}

resource "aws_db_subnet_group" "main" {
  name = "db-private-subnets"
  subnet_ids = [
    aws_subnet.private-ap-southeast-1a.id,
    aws_subnet.private-ap-southeast-1b.id,
    aws_subnet.private-ap-southeast-1c.id
  ]

  tags = {
    Name = "subnet-group-${var.project_name}${var.env}"
  }
}

As you can see, we can see and review the full configuration for the database using code as compared to having to navigate around the AWS management console to complete the puzzle. We can easily know the size of the database instance we have provisioned as well as its credentials (Ok this is debatable if we want to commit sensitive data in our code).

In this configuration, I ensured that the database will produce a final snap shot in the event it gets destroyed.

Access to the database will be guarded by an aws_security_group that will be defined later.

The database is also associated to the aws_db_subnet_group resource. This resource consist of all the private subnet that we provisioned. This creates an implicit dependency on these subnets, ensuring that the database will only be created after the subnets are created. This would also tell AWS to place the database in the custom VPC that the subnets exist in.

I also ensured the database will not be destroyed by Terraform accidentally using the lifecycle configuration.

The Bastion

The bastion server allows us to access the servers and the database instance in the private subnets. We will provision the bastion inside the public subnet.

resource "aws_instance" "bastion" {
  ami = "ami-061eb2b23f9f8839c"
  associate_public_ip_address = true
  instance_type = "t2.nano"
  subnet_id = aws_subnet.public-ap-southeast-1a.id
  vpc_security_group_ids = ["${aws_security_group.bastion.id}"]
  key_name = aws_key_pair.main.key_name

  tags = {
    Name = "bastion-${var.project_name}${var.env}"
  }
}

resource "aws_key_pair" "main" {
  key_name = "${var.project_name}-${var.env}"
  public_key = "ssh-rsa something"
}


output "bastion_public_ip" {
  value = aws_instance.bastion.public_ip
}

I am using a Ubuntu-18.04 LTS image to setup the bastion instance. Note that the AMI id will differ from region to region, even for the same operating system. The image below shows the difference in the AMI id between Singapore and Tokyo regions.

ubuntu ami in ap-southeast-1| vic-l
ubuntu ami in tokyo region | vic-l

I will mainly use the bastion to tunnel the commands to the private subnet. Hence, there is no need for a large computation. The cheapest and smallest instance size of t2.nano is chosen.

It is associated to a public subnet that we created. Any subnet will work, but make sure it is public as we need to be able to connect to it.

Its security group will be defined later.

All EC2 instances in AWS can be given an aws_key_pair. We can generate a custom private key using the ssh-keygen command or you can use the default ssh key in your local machine so that you can ssh into the bastion easily without having to define the identity file each time you do so.

Then, there is the output block. After Terraform has completed its magic, it will output values defined in these output blocks. In this case, the public ip address of the bastion server will be shown on the terminal, making it easy for us to obtain the endpoint.

The Security Groups

Lastly, the connection is not completed without setting up the security groups that guards the traffic going in and out of the resources. This was the bane of my AWS Solution Architect journey. With the required configurations spelled out in code instead of steps in the console that exist only in the memory, Terraform has helped me greatly to further understand this feature.

There are a total of 3 aws_security_group resources  to be created, representing the bastion, the instances and the database respectively. Each of them have their own set of inbound and/or outbound rules, named “ingress” and “egress” in Terraform terms, that are configured separately.

While you can configure the inbound and outbound rules together within the resource block of the respective aws_security_group, I would recommend against that. This is because doing so will result in tight coupling between the security groups, especially if one of its aws_security_group_rule is pointing to another aws_security_group as the source. This is problematic when we eventually make changes to the security groups because, for example, maybe one cannot be destroyed because a security group that is it dependent on is not supposed to be destroyed.

And the frustrating thing is that Terraform, or maybe the underlying AWS api, do not indicate the error. In fact it takes forever to destroy security groups that are created this way, only to fail after making us wait for a long time, which makes debugging superfluously tedious.

There are many issues mentioning this and something related on Github, like this. This has to do with has been termed “enforced dependencies” that Terraform currently has no mechanism to handle.

By decoupling the aws_security_group and their respective aws_security_group_rule into separate resources, we will give Terraform and ourselves an easier time removing and making changes to the security groups in the future.

Bastion

Let’s see how we can configure Terraform setup the security of the subnets. We start off with the security group for the bastion server. We will make 3 rules for it.

# bastion
resource "aws_security_group" "bastion" {
  name = "${var.project_name}${var.env}-bastion"
  description = "For bastion server ${var.env}"
  vpc_id = aws_vpc.main.id

  tags = {
    Name = "${var.project_name}${var.env}"
  }
}

resource "aws_security_group_rule" "ssh-bastion-world" {
  type = "ingress"
  from_port = 22
  to_port = 22
  protocol = "tcp"
  # Please restrict your ingress to only necessary IPs and ports.
  # Opening to 0.0.0.0/0 can lead to security vulnerabilities
  # You may want to set a fixed ip address if you have a static ip
  security_group_id = aws_security_group.bastion.id
  cidr_blocks = ["0.0.0.0/0"]
}

resource "aws_security_group_rule" "ssh-bastion-web_server" {
  type = "egress"
  from_port = 22
  to_port = 22
  protocol = "tcp"
  security_group_id = aws_security_group.bastion.id
  source_security_group_id = aws_security_group.web_server.id
}

resource "aws_security_group_rule" "mysql-bastion-rds" {
  type = "egress"
  from_port = 3306
  to_port = 3306
  protocol = "tcp"
  security_group_id = aws_security_group.bastion.id
  source_security_group_id = aws_security_group.rds.id
}

The first is an ingress rule to allow us to ssh into it from wherever we are. Of course, this is not ideal as it means anyone from anywhere can ssh into it. We should scope it to the ip address where you work from, be it your home or your office. However, for my case, as a digital nomad, the ip address that I work with just changes so often as I moved around that it just makes more sense to open it up to the world. I made a calculated risk here. Please don’t try this at home.

The second is an egress rule that allow the bastion instance to ssh into the web servers in the private subnets. The source of this rule is set as the aws_security_group of the web servers.

The third rule is another outbound rule  to allow the bastion to communicate with the database. Since I am using <code>mysql</code> as the database engine, the port used is 3306. This allows us to run database operation on the isolated database instance in the private subnet via the bastion over the correct port securely.

Web Servers

Next will be the security groups for your web servers. The only rule that it requires will be the ingress rule for the bastion to ssh into itself over port 22.

resource "aws_security_group" "web_server" {
  name = "${var.project_name}${var.env}-web-servers"
  description = "For Web servers ${var.env}"
  vpc_id = aws_vpc.main.id

  tags = {
    Name = "${var.project_name}${var.env}"
  }
}

resource "aws_security_group_rule" "ssh-web_server-bastion" {
  type = "ingress"
  from_port = 22
  to_port = 22
  protocol = "tcp"
  security_group_id = aws_security_group.web_server.id
  source_security_group_id = aws_security_group.bastion.id
}

RDS

Lastly, the rds instance. It consist of 2 rules.

resource "aws_security_group" "rds" {
    name = "rds-${var.project_name}${var.env}"
    description = "For RDS ${var.env}"

vpc_id = aws_vpc.main.id
  tags = {
    Name = "${var.project_name}${var.env}"
  }
}

resource "aws_security_group_rule" "mysql-rds-web_server" {
  type = "ingress"
  from_port = 3306
  to_port = 3306
  protocol = "tcp"
  security_group_id = aws_security_group.rds.id
  source_security_group_id = aws_security_group.web_server.id
}

resource "aws_security_group_rule" "mysql-rds-bastion" {
  type = "ingress"
  from_port = 3306
  to_port = 3306
  protocol = "tcp"
  security_group_id = aws_security_group.rds.id
  source_security_group_id = aws_security_group.bastion.id
}

The first is of course to open up port 3306 to allow request from the web servers to reach the database to run the application.

The second is to allow the bastion to communicate over port 3306. We have to define the egress rule applied on the bastion server itself to connect out to the RDS instance previously. Now, this ingress rule will allow the incoming request from the bastion server to reach the RDS instance instead of being blocked off.

Terraform Apply

These resources can be defined in a single or multiple terraform files with the extension tf, as long as they are in the same directory.

If you are using docker to run terraform, you can do a volume mount of the current directory into the workspace of the docker container and apply the infrastructure!

Improvements

We can harden the security of this setup further by, for example, configuring the Network Access Control Level (NACL or Network ACL). In this setup, the default is allow all traffic in bound and outbound for all the resources. However, this will be beyond the scope of this article.

What’s Next

Note that I did not provision any EC2 instances where my application will run. At this point of time, you can feel free to provision the EC2 instances for the web servers just like the bastion server, but associating them with the private subnets.

For me, I favor AWS Elastic Beanstalk in handling the deployment. What I have done so far is only the provisioning of the infrastructure. Hence, in my case, instead of defining the EC2 instances, I will define an elastic beanstalk environment to host my Rails application and configure it to use the VPC to leverage on all the security.

How To Change Or Add New SSH Key for EC2

This is a documentation of how to change or add new ssh key for your EC2 instance if you lost, and maybe compromised your private key.

The gist of it is to add in a new key pair to the disk volume of the EC2 instance. Pretty straightforward! But how can you do it without  being able to ssh into the EC2 instance without the private key you just lost? You will need to attach the root volume of the EC2 instance to another temporary EC2 instance, which you can access with a new key pair, and add in the new key pair to the original volume from there.

Summon the NewKeyPair!

First, create a new key pair. You can either generate a private and public key pair on your own and import the public one into the AWS console, or create it from the AWS management console and download the private key that they generated for you thereafter. Should you go for the latter, make sure your browser is not blocking the download.

Blocked download | vic-l

For the rest of the article, the new key pair will be referred to as NewKeyPair, and the old key pair LostKeyPair.

Retire The Veteran

Stop your old instance. Do not terminate!

NOTE: Your instance root volume need to be EBS backed and not instance store as instance store volumes are ephemeral. They do not persist the data after power down.

Once it has successfully stopped, you will realise that its volume remains attached. That’s EBS for you!

We will come to detaching it in a while. For now, spin up a new server.

Katon: Summon-The-New-Server-Jutsu

Launch a new server with the NewKeyPair. This is a temporary server and can be any of the linux distribution.

Detach The Old Volume

In the volumes page, select the old instance volume and select Detach as shown. There should be no error unless your old instance is still in the process of shutting down.

Detach old EBS Volume | vic-l

Once it is detached, you will observe that its status has changed to available and its Attachment Information will become blank. Now it is freeee! Time to attach it to the new server and receive its new key pair.

Attach To New Instance

Attach the root volume to the new instance as shown.

Then select the device to mount on.

attach volume device | vic-l

I will set /dev/sdf as suggested. The other devices reserved for the root volume (/dev/sda) and instance store volumes (/dev/sd[b-e]). More information on the device naming in AWS EC2 can be found here.

Run the command lsblk to see the new volume mounted. Note that the linux kernel has change my mount point from sdf to xvdf as noted in the warning callout in the image above.

lsblk | vic-l

Mounting The Volume

You would not be able to use the volume right away after attaching without mounting the volume in the system. Mounting will tell the EC2 instance how to access this new device via its list of directory. This will require setting up a mount point. Run the commands below.

sudo mkdir /mnt/tempvol
sudo mount /dev/xvdf1 /mnt/tempvol

These commands will mount the root of the device to the directory named /mnt/tempvol. You can change directory into the volume and see that it contains content from your old server.

From the image above, you can see that the authorized_keys file containing the old public key is placed in /home/ubuntu/.ssh directory relative to the mount point. The new public key pair exist in the /home/ubuntu/.ssh directory in the absolute path, which exist in the root volume of the new instance.

Adding The NewKeyPair To The Old Volume

Eventually, we want to use the new key pair to access the old server, with the content of the old volume, just like the good old times. To do get, add the NewKeyPair to the ssh folder of the old volume.

Realise that this is now possible because of the attaching and mounting of the old volume to a new instance which can be accessed due to a fresh setup and key pair creation.

I have used the append operation, >> instead of the overwrite operation, which is a single >. This is not necessary. It is up to you to decide if you want to get rid of the old key pair or not, depending on your situation.

If you lost your old key pair, feel free to overwrite it. There is no point hoarding it, and Marie Kondo can’t help you declutter software.

Attaching The Volume Back

Next, you can shut down your new server and attach your old volume back to the old instance. Remember, the EC2 instance will not be deleted and is still available if you chose stop instead of terminate when shutting it down initially.

attach volume back | vic-l
Mount the volume, this time, to /dev/sda1.
attach to sda1 | vic-l

The reason to mount it at /dev/sda1 is because we need to give the instance back its root volume for its boot operations. If we were to mount it to another device, you will see this error when starting the server because no root volume is detected.

error starting old instance | vic-l

Back To The Past

Now you can try to connect to your old instance after it has started up.

NOTE: you will see that the connection instruction is still mentioning the LostKeyPair. Even if you had overwritten your ssh key pair to the new one, this wrong instructions will still persist. Of course, you should connect with your NewKeyPair.

Connect to old instance again | vic-l

To ascertain that you have the old public keys, head dow to the <code>~/.ssh</code> directory and see that the changes you made on the new instance via attaching and mounting of the old instance’s volume has persisted.

Now, we have successfully added a new key pair to the old instance, and we can use it to ssh into the old instance form now on, even though we had lost our original key pair that was used to create it.