Installing demultiplexing such as bcl2fastq, CellRanger, LongRanger, demuxlet, and whatever else pops up, holds a special place in those that do Bioinformatics and Genomics hearts and potentially therapy sessions. It has been enough of an issue in my professional life that I thought I would dedicate a series to setting up servers for different analysis types.
This is my big chance to go on a total rant about bioinformatics servers!
Don't install all kinds of software as system packages. Ok? Just don't do it. It may not backfire on you today, or tomorrow, but someday it will!
I'm going to make a few caveats to that. Things like zlib, openssl, and ssh are fine. I'll even cheat sometimes and yum install some development tools. Mostly, what I am talking about here is bioinformatics software. Don't bother installing bcl2fastq, blast, augustus, R, python, dask, or pretty much anything else as system dependencies.
Amazon Web Services (AWS) gives you the ability to build out all kinds of cool infrastructure. Databases, web servers, Kubernetes backed applications, Spark clusters, machine learning models, and even High Performance Computing Clusters with AWS ParallelCluster.
One of the cooler aspects of using a cloud provider like AWS is the ability to scale up and down based on requests or need. This is generally called Elastic, and applies to a whole lot of services. Storage, kubernetes, load balancers, and compute clusters. This is first of all just awesome, because writing up something to scale up or down based on demand would be a major pain, and can give the best of all worlds.
Let's say you're running a genomics analysis. First, you run your alignment, which takes (for the sake of argument) 20 nodes. Then you do variant calling, which takes 5 compute nodes, haplotype...
Recently I saw that Dask, a distributed Python library, created some really handy wrappers for running Dask projects on a High Performance Computing Cluster, HPC.
Most people who use HPC are pretty well versed in technologies like MPI, and just generally abusing multiple compute nodes all at once, but I think technologies like Dask are really going to be game changers in the way we all work. Because really, who wants to write MPI code or vectorize?
If you've never heard of Dask and it's awesomeness before, I think the easiest way to get started is to look at their Embarrassingly Parallel Example, and don't listen to the haters who think speeding up for loops is lame. It's a superpower!
Firstly, these are all pretty much borrowed from the Dask Job Queues page. Pretty much, what you do, is you write your Python code as usual. Then, when you need to scale across nodes you leverage your HPC scheduler to get you some...