Manage High Content Screening CellProfiler Pipelines with Apache Airflow

If you are running a High Content Screening Pipeline you probably have a lot of moving pieces. As a non exhaustive list you need to:

  • Trigger CellProfiler Analyses, either from a LIMS system, by watching a filesystem, or some other process.
  • Keep track of dependencies of CellProfiler Analyses - first run an illumination correction and then your analysis.
  • If you have a large dataset and you want to get it analyzed sometime this century you need to split your analysis, run, and then gather the results.
  • Once you have results you need to decide on a method of organization. You need to put your data in a database and set up in depth analysis pipelines.

These tasks are much easier to accomplish when you have a system or framework that is built for scientific workflows.

If you prefer to watch I have a video where I go through all the steps in this tutorial.

Enter Apache Airflow

Apache Airflow is :

Airflow is a platform created by the community to programmatically author, schedule and...

Continue Reading...

Setup a Bioinformatics Demultiplex Server from Scratch

Install Demultiplex Software

Installing demultiplexing such as bcl2fastq, CellRanger, LongRanger, demuxlet, and whatever else pops up, holds a special place in those that do Bioinformatics and Genomics hearts and potential support groups. It has been enough of an issue in my professional life that I thought I would dedicate a series to setting up servers for different analysis types.

Don't install system packages

This is my big chance to go on a total rant about bioinformatics servers!

Don't install all kinds of software as system packages. Ok? Just don't do it. It may not backfire on you today, or tomorrow, but someday it will!

I'm going to make a few caveats to that. Things like zlib, openssl, and ssh are fine. I'll even cheat sometimes and yum install some development tools. Mostly, what I am talking about here is bioinformatics software. Don't bother installing bcl2fastq, blast, augustus, R, python, dask, or pretty much anything else as system dependencies.

There are better...

Continue Reading...

AWS Elastic Compute Clusters for Genomics

aws bioinformatics hpc Oct 30, 2019

Amazon Web Services (AWS) gives you the ability to build out all kinds of cool infrastructure. Databases, web servers, Kubernetes backed applications, Spark clusters, machine learning models, and even High-Performance Computing Clusters with AWS ParallelCluster.

 

Not just clusters, but Elastic Clusters!

One of the cooler aspects of using a cloud provider like AWS is the ability to scale up and down based on requests or need. This is generally called Elastic, and applies to a whole lot of services. Storage, Kubernetes, load balancers, and compute clusters. This is first of all just awesome, because writing up something to scale up or down based on demand would be a major pain, and can give the best of all worlds.

Example Genomic Analysis Computional Needs

Let's say you're running a genomics analysis. First, you run your alignment, which takes (for the sake of argument) 20 nodes. Then you do variant calling, which takes 5 compute nodes, haplotype...

Continue Reading...

Deploy HPC Modules From Bioconda Packages

The Struggle is Real

I have been working in Bioinformatics for nearly 10 years, mostly on the computational side of things. I have spent a lot of that time building and installing software. Some of those wounds will never heal! Luckily, along came Anaconda, the scientific distribution of Python, along with the awesome BioConda who took on the task of installing bioinformatics software with relative ease! I don't know if Anaconda necessarily wanted to make life easier for those installing software on HPC systems, but in any case they did. 

(Disclaimer, I am technically a core team member of BioConda, but I'm really kind of a slacker core member and the real credit goes to the rest of the team!)

Deploy Modules with EasyBuild

One of my main goals in life is to deploy conda packages as HPC Modules. Deploying HPC Modules can be a bit of a pain. There are a lot of naming conventions, environmental variables, file permissions, recursive file...

Continue Reading...
Close

50% Complete

DevOps for Data Scientists Weekly Tutorials

Subscribe to the newsletter! You'll get a weekly tutorial on all the DevOps you need to know as a Data Scientist. Build Python Apps with Docker, Design and Deploy complex analyses with Apache Airflow, build computer vision platforms, and more.