This is part 1 of a series I have in the works about Bioinformatics Solutions on AWS. Each part of the series will stand on its own, so you won't be missing anything by reading one part and not the others.
So let's dive right in!
To keep things simple I am going to say exploratory analyses are analyses that are completed in the shell, with either a programming language such as Python or R, or just the terminal.
(Any other Bioinformaticians remember just how much we all used to use sed and awk? ;-) )
There can be some visualization here, but we'll draw the line at anything that can't be displayed in a Jupyterhub notebook or a Jupyter Lab instance.
In another post I'll discuss Production Analyses.
It's always a good idea to look for terms such as "auto-scaling", "elastic" or "on demand" when using AWS. This means that there is some smart mechanism baked in that only uses the resources when you need them, and this cuts down on costs.
Along with any of these solutions you will usually have an S3 bucket with custom start up commands, and one or more EFS instances for networked file storage.
Using a networked file storage means your data or applications can be available across any of the solutions I'll list below.
EFS is expensive, and one approach to get around costs is to use EFS for data you're analyzing now. Then archive the data with AWS S3 Glacier.
To summarize, what we need is:
You can create any of these resources through the AWS console, the AWS CLI, or an infrastructure as code tool such as Terraform.
I will make a quick point here that I truly recommend thinking about how you can future proof your infrastructure up front. Keeping all your configuration on S3 instead of hard coded somewhere, and your data on EFS instead of locally on the instance allows you to use new services as projects with different needs arise.
One of the most straight forward solutions for exploratory and interactive analyses is to use AWS ParallelCluster to build out an auto-scaling HPC cluster on AWS.
AWS Parallel Cluster builds you out an HPC cluster based on a templated configuration file. There is, of course, a nearly endless amount of customization and configuration you can apply.
Mostly you tell it what instance type you want, which will depend on how much memory and how many CPUs you want along with network speed, the minimum and maximum amount of compute nodes to keep available, and optionally you can give it a set of installation instructions that will be applied to any of your nodes when they start up.
Then you say create my cluster and poof! HPC cluster at your disposal!
srundon't always work if a node is not already available and in the running state. You can get around this by submitting a job and then ssh over to the node, but not everyone likes that additional step.
Using Jupyterhub on Kubernetes is a fantastic solution for data science teams that are already invested in the scientific python ecosystem.
I truly can't say enough great things about this solution. It is extremely customizable, and the implementation is very well thought out.
From a birds eye view it looks quite similar to the HPC solution, except that you're giving your users Jupyterhub or Jupyterlab notebooks along with a remote server. The autoscaling looks about the same, except its user logins Jupyterhub, requests a server, node spins up, user logins out or is idle, node spins down.
Generally, with the HPC solution I create an EFS for the apps, and have a centrally managed software repository with mostly conda environments. You can still do this, because you can use networked file storage literally anywhere, including on Kubernetes with the efs provisioner helm chart. You also have the option of giving your users preconfigured environments with docker containers.
Jupyterhub also has lots of nice wrappers for almost anything you would want to do through a browser. You can deploy separate proxied services for anything you run in a browser, including services for documentation,to run RStudio Server, or to launch your Computer Vision Image Labelling Infrastructure.
If you or your team builds out data visualization internal solutions with RShiny or Dash you can set these up as services within Jupyterhub. This gives you a single authentication strategy to enable an endless possibility of services.
If you need the power of an HPC scheduler to submit multistep, long running pipeline jobs along with your super shiny looking notebooks your HPC Cluster and Jupyterhub Cluster solutions can coexist using SSH with the remote ikernel python package. (Side note: I really love that package. It's gold and I use it all the time.)
Depending on the computation you're doing you may be able to side step using an HPC by giving your users access to auto scaling Dask Clusters. This is becoming an ever more popular approach as scientific software, such as SciKit Allel for statistical genomics.
You can deploy both! You could use your Jupyterhub cluster for your exploratory analyses with Python or R, and to proxy any data visualization solutions. Then submit any multi step long running jobs that require an HPC with your SLURM cluster, either on the cluster directly or opening a terminal in your Jupyterhub instance and using SSH. Your storage is kept completely separate from any service, which allows for any solution you allow to access your data.
Computational requirements around Bioinformatics are becoming increasingly complex as data sets increase in size and resolution.
A HPC cluster is still the backbone of many biotech companies, core groups and labs. Nothing can quite beat HPC for managing complex pipelines with complex dependency trees and scheduling needs.
Along with the number crunching provided by many HPC clusters many analysis, such as single cell, are moving towards real time data visualization solutions that need more finesse than running a pipeline to crank out some PDFs. Whether you are creating network diagrams for drug discovery, analytics dashboards, or exploratory analyses with multiple parameters, staying on the cutting edge of data analysis with data visualization solutions is becoming more and more necessary.
Listed below are resources to get started with all the technologies I discussed in this post. If you prefer a managed approach to your infrastructure you can get started today with my Jupyterhub Cluster on AWS Service or my HPC Cluster on AWS Service.
Subscribe to the newsletter! You'll get a weekly tutorial on all the DevOps you need to know as a Data Scientist. Build Python Apps with Docker, Design and Deploy complex analyses with Apache Airflow, build computer vision platforms, and more.