Bioinformatics Solutions on AWS For Exploratory AnalysisAug 10, 2020
This is part 1 of a series I have in the works about Bioinformatics Solutions on AWS. Each part of the series will stand on its own, so you won't be missing anything by reading one part and not the others.
So let's dive right in!
Lay of the Land
To keep things simple I am going to say exploratory analyses are analyses that are completed in the shell, with either a programming language such as Python or R, or just the terminal.
(Any other Bioinformaticians remember just how much we all used to use sed and awk? ;-) )
There can be some visualization here, but we'll draw the line at anything that can't be displayed in a Jupyterhub notebook or a Jupyter Lab instance.
In another post I'll discuss Production Analyses.
AWS Solutions for Exploratory Analyses
Common Infrastructure Components
It's always a good idea to look for terms such as "auto-scaling", "elastic" or "on demand" when using AWS. This means that there is some smart mechanism baked in that only uses the resources when you need them, and this cuts down on costs.
Along with any of these solutions you will usually have an S3 bucket with custom start up commands, and one or more EFS instances for networked file storage.
Using a networked file storage means your data or applications can be available across any of the solutions I'll list below.
EFS is expensive, and one approach to get around costs is to use EFS for data you're analyzing now. Then archive the data with AWS S3 Glacier.
To summarize, what we need is:
- S3 bucket that has our installation instructions to customize any nodes (EC2 instances) we spin up. This can include installing packages and setting up users and user environments, or even pulling scripts from github.
- EFS storage for application installation with Conda.
- EFS storage for data that is processing now.
- S3 Glacier for archival storage.
You can create any of these resources through the AWS console, the AWS CLI, or an infrastructure as code tool such as Terraform.
I will make a quick point here that I truly recommend thinking about how you can future proof your infrastructure up front. Keeping all your configuration on S3 instead of hard coded somewhere, and your data on EFS instead of locally on the instance allows you to use new services as projects with different needs arise.
AWS Parallel Cluster for Elastic HPC Clusters
One of the most straight forward solutions for exploratory and interactive analyses is to use AWS ParallelCluster to build out an auto-scaling HPC cluster on AWS.
AWS Parallel Cluster builds you out an HPC cluster based on a templated configuration file. There is, of course, a nearly endless amount of customization and configuration you can apply.
Mostly you tell it what instance type you want, which will depend on how much memory and how many CPUs you want along with network speed, the minimum and maximum amount of compute nodes to keep available, and optionally you can give it a set of installation instructions that will be applied to any of your nodes when they start up.
Then you say create my cluster and poof! HPC cluster at your disposal!
- An HPC environment will feel very natural for many Bioinformatics Users.
- The autoscaling is very straight forward. User submits job, node spins up. Job ends. Node spins down.
- The AWS Parallel Cluster wizard is really nice. Once you're setup administering the cluster is mostly basic Linux systems administration.
- Customizing is straight forward. Create an S3 bucket with your post installation scripts to create users or hook up to a user directory, install packages, etc.
- Since you're essentially dealing with a bunch of remote servers you can all the things you would normally do on remote servers. Anything that can be run on the command line is fair game.
- The biggest draw back is that you can only have one compute type (t2.medium, t2.large, etc) for your cluster. The HPC scheduler itself is smart enough to stack users on a node depending on their resource requirements. If you have a big job that needs a larger node you'll have to either spin up a separate EC2 instance or restart your cluster with a different configuration.
- People who don't like HPC really don't like HPC.
- Because the cluster is on demand functionality such as interactive jobs with
srundon't always work if a node is not already available and in the running state. You can get around this by submitting a job and then ssh over to the node, but not everyone likes that additional step.
- If your users are unfamiliar with a terminal and SSH an HPC system will have a steep learning curve!
Zero To Jupyterhub on Kubernetes
Using Jupyterhub on Kubernetes is a fantastic solution for data science teams that are already invested in the scientific python ecosystem.
I truly can't say enough great things about this solution. It is extremely customizable, and the implementation is very well thought out.
From a birds eye view it looks quite similar to the HPC solution, except that you're giving your users Jupyterhub or Jupyterlab notebooks along with a remote server. The autoscaling looks about the same, except its user logins Jupyterhub, requests a server, node spins up, user logins out or is idle, node spins down.
Generally, with the HPC solution I create an EFS for the apps, and have a centrally managed software repository with mostly conda environments. You can still do this, because you can use networked file storage literally anywhere, including on Kubernetes with the efs provisioner helm chart. You also have the option of giving your users preconfigured environments with docker containers.
Jupyterhub also has lots of nice wrappers for almost anything you would want to do through a browser. You can deploy separate proxied services for anything you run in a browser, including services for documentation,to run RStudio Server, or to launch your Computer Vision Image Labelling Infrastructure.
If you or your team builds out data visualization internal solutions with RShiny or Dash you can set these up as services within Jupyterhub. This gives you a single authentication strategy to enable an endless possibility of services.
If you need the power of an HPC scheduler to submit multistep, long running pipeline jobs along with your super shiny looking notebooks your HPC Cluster and Jupyterhub Cluster solutions can coexist using SSH with the remote ikernel python package. (Side note: I really love that package. It's gold and I use it all the time.)
Depending on the computation you're doing you may be able to side step using an HPC by giving your users access to auto scaling Dask Clusters. This is becoming an ever more popular approach as scientific software, such as SciKit Allel for statistical genomics.
- Your cluster can have as many instance types as you want. This is a major plus compared to AWS Parallel Cluster, where you can only have 1 instance.
- You can configure your cluster as a central hub for all of your data visualization and web solutions, all behind a single authentication wall.
- This solution is much more natural for users that are already invested in scientific python solutions.
- Jupyterhub Managed Services! Do you run it through the web? You can run it as a part of your Jupyterhub Cluster.
- Many authentication mechanisms are supported.
- You still get a terminal, and can interact with any of your other services and clusters through SSH.
- It's built on Kubernetes, with depending on your view of Kubernetes is either a pro or a con. ;-)
- See previous comment about Kubernetes
- If your users don't like Jupyterhub they might be a bit cranky, but they can still get a terminal so they'll probably calm down.
- You can use your HPC cluster to launch notebooks, but can't use Jupyterhub to deploy long running jobs that are common to Bioinformatics.
Jupyterhub Cluster + HPC Cluster Solution
You can deploy both! You could use your Jupyterhub cluster for your exploratory analyses with Python or R, and to proxy any data visualization solutions. Then submit any multi step long running jobs that require an HPC with your SLURM cluster, either on the cluster directly or opening a terminal in your Jupyterhub instance and using SSH. Your storage is kept completely separate from any service, which allows for any solution you allow to access your data.
Computational requirements around Bioinformatics are becoming increasingly complex as data sets increase in size and resolution.
A HPC cluster is still the backbone of many biotech companies, core groups and labs. Nothing can quite beat HPC for managing complex pipelines with complex dependency trees and scheduling needs.
Along with the number crunching provided by many HPC clusters many analysis, such as single cell, are moving towards real time data visualization solutions that need more finesse than running a pipeline to crank out some PDFs. Whether you are creating network diagrams for drug discovery, analytics dashboards, or exploratory analyses with multiple parameters, staying on the cutting edge of data analysis with data visualization solutions is becoming more and more necessary.
Listed below are resources to get started with all the technologies I discussed in this post. If you prefer a managed approach to your infrastructure you can get started today with my Jupyterhub Cluster on AWS Service or my HPC Cluster on AWS Service.