Easily deploying test environments from CI builds with Travis, AWS and Slack

Our project uses a git flow process: a new branch is created for each new feature. The feature is developed in its entirety in the branch. This includes tests, database migrations and the code itself. When the feature is ready, a GitHub pull request is made and another developer tests and reviews the code.

We wanted to make testing another branch easy for developers. Developers will most likely have their own branch in progress on their local machine and we wanted a system where developers don’t need to hop between their own branch and another one. Initially we made six development test machines where the initial developer would deploy a new version for testing but this was cumbersome as it needed manual deployment work and a reservation system to reserve a test machine for your branch. Also the machines were idling most of the time.

Here’s a TL;DR high level view:

Along comes AWS

We are using the excellent Travis CI for building our project and running the automated test suites. We had the idea that we could automatically push the build artifacts to S3 so that the latest build of a branch (along with a database dump) is always available for deployment. Luckily Travis has direct support for S3 so the configuration is a simple snippet in .travis.yml file:

deploy:
  provider: s3
  region: eu-central-1
  skip_cleanup: true
  bucket: "harjatravis"
  acl: public_read
  local_dir: s3-deploy
  on:
    all_branches: true

This instructs Travis to push all branches to our bucket. We set the bucket configuration to automatically expire the artifacts after 2 weeks so that we don’t accumulate old builds for too long. Our build yields two artifacts: the production build .jar file and a PostgreSQL dump of the test data that has been migrated to the latest version. Both are uploaded with the branch name as part of the file name. Our branch names are usually the same as the JIRA issue number so they are predictable.

Setting up the build environment

Next we had to set up the actual machine for running the builds. We started with a CentOS Linux image and started a new EC2 instance from it. We then provisioned it with all the pieces needed to run the builds with Ansible. This includes installing NGINX, Java, PostgreSQL and some shell scripts to start our service. Normally a production environment wouldn’t have the front-end proxy, the application service and the database on the same machine but for testing purposes it will suffice.

After the machine had everything we need, we stopped it and created a new AMI from it to use as a template for a deployed build.

A serverless solution for starting up servers

Next we needed some way to easily start new EC2 machines for builds. AWS Lambda provides a nice way to run predefined code and hook it up to services.

Lambda is described as a “serverless” solution for building applications and has a pay-per-execution billing model. In our case, we are using Lambda to start new servers, so I guess the serverless term doesn’t really apply here.

yo_dawg_serverless

We used Python as it can be easily edited right from the AWS Lambda console and has a good library for using AWS services programmatically available. The library is called boto3 and it can be used with zero configuration from Lambda Python code.

# coding=utf-8
import os
import boto3
import urllib2
import time
import ssl

initscript = '''...cloud config script here...'''

def check_branch(branch):
    req = urllib2.Request('...our deployment s3 bucket public url.../harja-travis-' + branch + '.jar')
    req.get_method = lambda: 'HEAD'
    try:
        res = urllib2.urlopen(req)
        if res.getcode() == 200:
            return True
    except:
        pass
    return False

def deploy(branch):
    script = initscript.replace('$BRANCH',branch)
    ec2 = boto3.client('ec2',region_name='eu-central-1')
    res = ec2.run_instances(ImageId='...ami-id...',
                            InstanceType='t2.medium',
                            UserData=script,
                            KeyName='...keypair...',
                            MinCount=1,
                            MaxCount=1)
    id = res['Instances'][0]['InstanceId']
    host = None
    while host is None:
        time.sleep(2)
        instances = ec2.describe_instances(InstanceIds=[id])
        for r in instances['Reservations']:
            for i in r['Instances']:
                dns = i['PublicDnsName']
                if dns != '':
                    host = dns
    return host

def wait_for_url(url):
    ctx = ssl.create_default_context()
    ctx.check_hostname = False
    ctx.verify_mode = ssl.CERT_NONE
    status = None
    while status != 200:
        time.sleep(2)
        try:
            status = urllib2.urlopen(url, context=ctx).getcode()
        except:
            pass

def lambda_handler(event, context):
    branch = event['branch']
    txt = None
    if check_branch(branch):
        host = deploy(branch)
        url = 'https://' + host + '/'
        wait_for_url(url)
        txt = 'Started ' + branch + ': ' + url;
    else:
        txt = 'No Travis build for branch: ' + branch + '. Check build.'
    try:
        payload = '{"text": "'+txt+'"}'
        urllib2.urlopen(event['response_url'], payload)
    except Exception, e:
        print 'error sending Slack response: ' + str(e)

The above script does quite a lot. It takes an event that has two fields: branch and response_url. The branch is the name of the git branch to deploy and response_url is the Slack webhook URL to send a message to when the build is ready.

First the branch is checked by doing a HEAD request for the build artifact. If it doesn’t exist, bail out and notify the user that the build cannot be found.

If the build is ok, start a new EC2 instance for it from our previously created AMI and pass it a user data cloud init script. More on cloud init later. After the machine has been started we repeatedly call describe until the machine has a public network name and return it.

Finally, we wait for the machine to be actually up and running the software before notifying the user with a URL to the new machine. Running status is determined simply by doing an HTTP GET request to it (with fake SSL, because we don’t use real certificates for the ephemeral test machines).

Cloud init

Cloud init is a simple way to customize servers that have been cloned from a template. With EC2 CentOS images, we can provide a cloud init script with the instance user data.

Cloud init is a YAML configuration file which can do lots of different customization tasks on the starting instance. In our case, we only need to run some shell commands:

#cloud-config
runcmd:
  - echo "Start PostgreSQL 9.5";
  - sudo service postgresql-9.5 start;
  - sudo -u postgres psql -c 'create database harja;';
  - sudo wget https://...s3 bucket url.../harja-travis-$BRANCH.pgdump.gz;
  - sudo wget https://...s3 bucket url.../harja-travis-$BRANCH.jar;
  - sudo zcat harja-travis-$BRANCH.pgdump.gz | sudo -u postgres psql harja;
  - sudo service nginx start
  - /home/centos/harja.sh $BRANCH

We start up PostgreSQL, create the database and restore it from the downloaded dump and start nginx and our application service from a downloaded .jar file.

The above script is included in the Python lambda script and the $BRANCH is replaced with the actual branch name parameter.

Adding the chat interface

We use Slack as our company wide chat and all our tools report to our project channel. We already have GitHub and CI notifications and error logging going to Slack so it makes sense to also have our test builds triggered from there.

Slack makes it easy to create your own applications and uses webhooks for integrating with external services. So we made our custom slack app and added a slash command “/harjadeploy” which calls an AWS Lambda service.

Unfortunately, this stage was somewhat trickier as Slack webhook POST calls use HTML form encoding to deliver values and AWS API gateway only accepts JSON input. Slack slash commands also have only three seconds to respond before they time out and deployments take much longer than that so we can’t use the previous deployment Lambda directly.

The first problem was solved by a quick search. Other people have had the same problem and we found a gist on GitHub to convert FORM encoded data to JSON.

The second problem required adding another Lambda function which responds to the Slack slash command. That Lambda then asynchronously calls our deployment Lambda passing it the branch and the response URL. This provided a good way to separate the Slack webhook processing from the actual deployment.

# coding=utf-8
import os
import boto3

def deploy(branch, response_url):
    lam = boto3.client('lambda',region_name='eu-central-1')
    lam.invoke(FunctionName='deployasync',
               InvocationType='Event',
               Payload=b'{"branch": "'+branch+'", "response_url": "'+response_url+'"}')

def lambda_handler(event, context):
    token = os.environ['slacktoken']
    user = event['user_name']
    allowed = os.environ['allowed_users'].split(',')
    if not user in allowed:
        return {'text': 'You are not on the allowed user list.'}
    else:
        try:
          if event['token'] == token:
            deploy(event['text'], event['response_url'])
            return {'text': 'Starting branch: ' + event['text'] + '. Please wait.'}
        except Exception, e:
            return "error: " + str(e)

This Lambda function is much simpler. It simply checks that the webhook call is coming from Slack by comparing an environment variable to an event field. Then we check that the Slack user name is on a list of allowed users.

If the invocation is valid, fire up an asynchronous invocation of the previously defined deployment Lambda and return a message to the user.

Finishing touches and the final result

Now that we have a way to easily start new application instances from the comfort of our chat room, we have one final thing to consider: termination. We are starting up new servers easily but we have no way of terminating them. We certainly don’t want to do that by hand and we can’t leave them running forever and costing us money.

We decided on a very simple solution that fits our working patterns quite well: a cron job. We set up a third Lambda function that is run every evening a few hours after business hours and its only job is to terminate all running EC2 instances. We don’t have anything besides our test environments in this AWS account so we can safely indiscriminately terminate all running instances.

def lambda_handler(event, context):
    ec2 = boto3.client('ec2', region_name='eu-central-1')
    instances = ec2.describe_instances()
    ids = []
    for r in instances['Reservations']:
        for i in r['Instances']:
            ids.append(i['InstanceId'])
    ec2.terminate_instances(InstanceIds=ids)
    return 'Terminated instances: ' + str(ids)

Conveniently AWS CloudWatch supports cron-like expressions for scheduled events so we triggered it to run every night at 19:00.

We now can easily start up new test environments and don’t need to worry about terminating them. The whole system has a lot of moving parts but all the complexity is hidden behind the simple chat interface. The final result in all its simplicity looks like this:

slack-deploy

All things considered I think the solution was surprisingly easy to set up even for a total AWS newbie. The slack slash command also provides an easy way to provide a user interface. It’s much simpler to create a slash command than it is to make a shell script or a web app.