How To Become a Freelance Software or DevOps Engineer

career freelance Nov 18, 2019

Your first steps

Getting started freelancing is one of those things you can google for the rest of your life and it will just leave your head spinning. I know this because I've been working towards being my very own Jillian Inc for the last 6 months or so. Here are a few straightforward tasks you can set for yourself in order to get started.

Register a Business

I got a lot of mostly wishy washy advice on this. There are people out there that say if you're not making much just don't bother. That may be a fair enough point, but I would say if you truly want to pursue freelancing, even on a moonlighting basis, to just register a business.

First of all, you can expense stuff to your business (legit business expenses only!) and it will probably save you the cost of registering the business. It will also make your taxes a lot more straightforward. Being a software engineer who likes open source this felt a bit too beaurocratic for me, but seriously, just do it.

Continue Reading...

Setup a Bioinformatics Demultiplex Server from Scratch

Install Demultiplex Software

Installing demultiplexing such as bcl2fastq, CellRanger, LongRanger, demuxlet, and whatever else pops up, holds a special place in those that do Bioinformatics and Genomics hearts and potential support groups. It has been enough of an issue in my professional life that I thought I would dedicate a series to setting up servers for different analysis types.

Don't install system packages

This is my big chance to go on a total rant about bioinformatics servers!

Don't install all kinds of software as system packages. Ok? Just don't do it. It may not backfire on you today, or tomorrow, but someday it will!

I'm going to make a few caveats to that. Things like zlib, openssl, and ssh are fine. I'll even cheat sometimes and yum install some development tools. Mostly, what I am talking about here is bioinformatics software. Don't bother installing bcl2fastq, blast, augustus, R, python, dask, or pretty much anything else as system dependencies.

There are better...

Continue Reading...

AWS Elastic Compute Clusters for Genomics

aws bioinformatics hpc Oct 30, 2019

Amazon Web Services (AWS) gives you the ability to build out all kinds of cool infrastructure. Databases, web servers, Kubernetes backed applications, Spark clusters, machine learning models, and even High-Performance Computing Clusters with AWS ParallelCluster.


Not just clusters, but Elastic Clusters!

One of the cooler aspects of using a cloud provider like AWS is the ability to scale up and down based on requests or need. This is generally called Elastic, and applies to a whole lot of services. Storage, Kubernetes, load balancers, and compute clusters. This is first of all just awesome, because writing up something to scale up or down based on demand would be a major pain, and can give the best of all worlds.

Example Genomic Analysis Computional Needs

Let's say you're running a genomics analysis. First, you run your alignment, which takes (for the sake of argument) 20 nodes. Then you do variant calling, which takes 5 compute nodes, haplotype...

Continue Reading...

Deploy HPC Modules From Bioconda Packages

The Struggle is Real

I have been working in Bioinformatics for nearly 10 years, mostly on the computational side of things. I have spent a lot of that time building and installing software. Some of those wounds will never heal! Luckily, along came Anaconda, the scientific distribution of Python, along with the awesome BioConda who took on the task of installing bioinformatics software with relative ease! I don't know if Anaconda necessarily wanted to make life easier for those installing software on HPC systems, but in any case they did. 

(Disclaimer, I am technically a core team member of BioConda, but I'm really kind of a slacker core member and the real credit goes to the rest of the team!)

Deploy Modules with EasyBuild

One of my main goals in life is to deploy conda packages as HPC Modules. Deploying HPC Modules can be a bit of a pain. There are a lot of naming conventions, environmental variables, file permissions, recursive file...

Continue Reading...

Dask Tips and Tricks – HighLevelGraphs

Uncategorized Oct 23, 2019
Dask is an open source project in Python that allows you to scale your code on your laptop or on a cluster. Not only does it have a very clear syntax, you can also declare your order of operations in a data structure. This is a feature I was very interested in as this tends to be the use case I am tasked with most often. It's cool stuff!
For those of you who have written MPI, this is kind of like that, except you don't have to write MPI!If you would like to know more about basic Dask syntax, check out my blog post on Parallelizing For Loops with Dask.

Dask Syntax

Normally when using dask you wrapped dask.delayed around a function call, then when all those are queued up tell dask to compute your results. This is great, and I really like this syntax, but what about when you are fed a list of tasks and need to somehow feed these to Dask?

That is where a HighLevelGraph comes in!

Dask HighLevelGraphs

Dask HighLevelGraphs allow you to define a data...

Continue Reading...

Dask Tips and Tricks – Parallelize a For Loop

Uncategorized Oct 20, 2019

If you're in the scientific computing space there's a good chance you've used Python. A relatively recent addition to the family of awesome libraries in Python for Scientific Computing is Dask. It is a super cool library that allows you to parallelize your code with a very simple and straightforward syntax. There a few aspects of this library that especially call to me.

In no particular order, here they are!

  • It can be dropped into an existing codebase with little to no drama! Dask is meant to wrap around existing code and simply decide what can be executed asyncronously.
  • Dask can scale either on your laptop or to an entire compute cluster. Without writing MPI code! How cool is that?
  • Dask can parallelize data structures we already know and love, such as numpy arrays and data frames.

For those of you sitting here saying, but Spark can do all that, why yes, it can, but I don't find Spark nearly as easy to drop into an existing codebase as Dask. Also, I have I like to...

Continue Reading...

Dask on HPC

Recently I saw that Dask, a distributed Python library, created some really handy wrappers for running Dask projects on a High-Performance Computing Cluster, HPC.

Most people who use HPC are pretty well versed in technologies like MPI, and just generally abusing multiple compute nodes all at once, but I think technologies like Dask are really going to be game-changers in the way we all work. Because really, who wants to write MPI code or vectorize?

If you've never heard of Dask and it's awesomeness before, I think the easiest way to get started is to look at their Embarrassingly Parallel Example, and don't listen to the haters who think speeding up for loops is lame. It's a superpower!

Onward with examples!

Client and Scheduler

Firstly, these are all pretty much borrowed from the Dask Job Queues page. Pretty much, what you do, is you write your Python code as usual. Then, when you need to scale across nodes you leverage your HPC scheduler to get you some...

Continue Reading...

Deploy Bioinformatics Modules on HPC

Uncategorized Sep 22, 2019

Deploying scientific software in an HPC environment can be challenging. Deploying bioinformatics software anyhow anywhere, or say a High Performance Computing Cluster (HPC), can be especially challenging!

Life, at least in this regard, has become so much better in the last few years. Anaconda, the scientific python distro along with Conda, package manager and builder awesomeness made deploying software so much more streamlined. There are amazing groups contributing packages to conda. It's become a whole ecosystem of people working on infrastructure, software and packaging. Bioconda and Conda-Forge are two great groups that have added a ton of value to communities that use scientific software. EasyBuild gives you some pretty great capabilities for deploying your software as modules, which makes them available to everybody!

Disclaimer - I am a core team member of Bioconda, but I'm kind of a slacker member and they are awesome all on...

Continue Reading...

Kubernetes on AWS – Getting Started with EKS

aws docker kubernetes Jul 14, 2019

AWS Elastic Kubernetes Service (EKS) is a fully managed service AWS launched recently. Elastic services in AWS it means that the number of instances actually in use scales up or down based on the demand. This is first of all seriously cool, and second of all can cut down on costs. Fewer requests? Fewer nodes!

I'm just getting started with Kubernetes myself, and going through this walkthrough was a great learning exercise. 

I love deploying applications with docker swarm because it's fairly simple and I already know it, however, Swarm for AWS has some downsides. Firstly, it is not elastic, secondly, in order to get sticky sessions you need to add an additional service such as Traefik. With session affinity you can deploy RShiny and Python Dash applications with no other functionality besides the built in, and that's amazing!

I also personally think the moving towards Kubernetes over Swarm. It even comes installed on my mac version of Docker. Now is a great time to get...

Continue Reading...

Launch Your First AWS EC2 Instance

aws Jul 05, 2019

I'm loving creating videos, and so here is a 3 part series on getting started with AWS and EC2 Instances. Once you understand launching an EC2 instance absolutely every other part of AWS is going to make so much more sense. Behind the scenes we are all always just spinning up servers, installing all the things, and getting our stuff done. As always AWS has excellent documentation and I really encourage you to check it out!

Learning about deployment is an excellent strategy for so many reasons. Probably number one there is not having to wait around for people like me to have their acts together. 

Not only that, there are just so many cool things that are possible with all the neat new deployment strategies and distributed computing libraries out there. It used to be that software engineers would need to put in a HUGE amount of time and effort in order to really optimize for speed.

These days the barrier to getting your application running fast is just so...

Continue Reading...

50% Complete

Two Step

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.