Setup a Bioinformatics Demultiplex Server from Scratch

bioinformatics easybuild hpc Nov 07, 2019

Install Demultiplex Software

Installing demultiplexing such as bcl2fastq, CellRanger, LongRanger, demuxlet, and whatever else pops up, holds a special place in those that do Bioinformatics and Genomics hearts and potential support groups. It has been enough of an issue in my professional life that I thought I would dedicate a series to setting up servers for different analysis types.

Don't install system packages

This is my big chance to go on a total rant about bioinformatics servers!

Don't install all kinds of software as system packages. Ok? Just don't do it. It may not backfire on you today, or tomorrow, but someday it will!

I'm going to make a few caveats to that. Things like zlib, openssl, and ssh are fine. I'll even cheat sometimes and yum install some development tools. Mostly, what I am talking about here is bioinformatics software. Don't bother installing bcl2fastq, blast, augustus, R, python, dask, or pretty much anything else as system dependencies.

There are better solutions, that I promise aren't that bad to get started with, and they will mostly prevent you from killing your severs! 

Prepare Your Server

The very first thing I do when I get a shiny new server is to install LmodMiniconda and EasyBuild

Install Lmod

If you need to install Lmod without root permissions I recommend checking out the Easybuild Guide on installing Lmod with root permissions.

This is going to be different depending on your system. If you're using yum you'll need to enable the epel repos.


#!/usr/bin/env bash

yum install -y epel-release 

# If you're on AWS and using one of their newer AMIs 
# you'll install it from the 
# amazon-linux-extras package manager.
# amazon-linux-extras install epel
yum-config-manager --enable epel
# Optional, but nice for servers that have been sitting around forever
# yum update -y; yum upgrade -y
yum install -y Lmod

From there, to make your modules available you'll need to source the shell awesomeness that is Lmod.  If you've installed from source, yum, apt-get the location may be different. 

source /usr/share/lmod/lmod/init/bash
source /usr/local/lmod/lmod/init/bash

Try those commands. If the file doesn't exist it will throw an error. If neither of these works the lmod script is someplace weird and you will have to do a bit of hunting. The easiest way to do that is to use the linux command find, which is a totally awesome command and along with grep runs most of my life.

# You may have to play around a bit with the maxdepth
find / -maxdepth 4 -name lmod -type d

Once you've found the lmod/init/bash file source it and run module avail to check and make sure it's working.

Install Miniconda and EasyBuild

We're going to do this with one fell swoop here by using a bash script. I'm not going to go through this whole thing, just know that you'll need to switch around the environmental variables at the top to suit your setup.  Make sure you have a non root user to do all this with, because EasyBuild will complain loudly and then die if you try to run it as root. 

Once you've changed the environmental variables to suit your setup just chmod 777 the script, run, and off you go!

Install your Software

Right now we have our basic setup. We have a base miniconda for our own purposes, Lmod, Easybuild, and a Miniconda3 base module that we will use as our base module for our bioinformatics software. Let's go through installing a few common demultiplex softwares here.

A Quick Note - Installing Conda Software with EasyBuild

EasyBuild does some truly awesome things, and admittedly I don't use even a fraction of its awesomeness. EasyBuild has a ton of toolchains available, which are perfect for those who truly care about the exacting details of how their software is compiled. I've found that for bioinformatics software this just doesn't matter as much. We rarely have vectorized code, or code that could benefit from the intel compiler. If you want to know more about this just check out toolchains with EasyBuild.

I tend to install everything with conda and bioconda, because they have just done such a great job of making my life easier. Then, sometimes I create additional recipes with EasyBuild for software that has a do not distribute clause. GATK, bcl2fastq, and CellRanger I'm looking at you guys here!

Install a base devtools package

A devtools package is going to give us all of our low level packages that we need. Again, if you're interested in doing this from scratch check out toolchains in EasyBuild. Otherwise, just install them from conda.

You could also create eb configs for each of these packages separately.

# This is an easyconfig file for EasyBuild, see
# devtools-1.0.eb

easyblock = 'Conda'

name = "devtools"
version = "1.0"

homepage = ''
description = """Collection of tools and compilers"""

toolchain = SYSTEM

requirements = "automake autoconf perl gcc_linux-64 gxx_linux-64 gfortran_
linux-64 cmake"
channels = ['conda-forge']

builddependencies = [('Miniconda3', '4.7.10')]

modextrapaths = {'PATH': ['libexec/gcc/x86_64-conda_cos6-linux-gnu/7.3.0']

modextravars = { 'CC': '%(installdir)s/bin/x86_64-conda_cos6-linux-gnu-cc'
, 'GCC': '%(installdir)s/bin/x86_64-conda_cos6-linux-gnu-gcc', 'GFORTRAN':
 '%(installdir)s/bin/x86_64-conda_cos6-linux-gnu-gfortran', 'GXX': '%(inst
alldir)s/bin/x86_64-conda_cos6-linux-gnu-g++', 'FC' : '%(installdir)s/bin/
x86_64-conda_cos6-linux-gnu-gfortran', 'CPP' : '%(installdir)s/bin/x86_64-
conda_cos6-linux-gnu-cpp', 'CXX': '%(installdir)s/bin/x86_64-conda_cos6-li

sanity_check_paths = {
    'files': ['bin/python'],
    'dirs': ['bin']

moduleclass = 'tools'

Here's the bcl2fastq config. At the time of writing you must have your development tools installed as system dependencies. On yum this is `yum groupinstall "Development tools" and on Ubuntu and Debian systems the commadn is apt-get install build-essential.

# This file is an EasyBuild reciPY as per

easyblock = 'ConfigureMake'

name = 'bcl2fastq2'
version = '2.20.0'

homepage = ''
description = """bcl2fastq Conversion Software both demultiplexes data and converts BCL files generated by
 Illumina sequencing systems to standard FASTQ file formats for downstream analysis."""

toolchain = SYSTEM

source_urls = ['ftp://webdata2:[email protected]/downloads/software/bcl2fastq/']
sources = [{
    'filename': '' % (name, version.replace('.', '-')),
    'extract_cmd': 'unzip -p %s | tar -xzvf -',  # source file is a .zip that contains a .tar.gz

checksums = ['8dd3044767d044aa4ce46de0de562b111c44e5b8b7348e04e665eb1b4f101fe3']

#builddependencies = [('devtools', '1.0')]
# CMake, Boost, libxml2 and libxslt are all built and used internally with specific versions

start_dir = 'src'
configopts = '--force-builddir'

sanity_check_paths = {
    'files': ['bin/bcl2fastq'],
    'dirs': ['lib']

moduleclass = 'bio'

Now you just install these with the awesomeness that is EasyBuild. We're going to use the robot here, which is going to recursively install of our dependencies, in this case that devtools module we have above.
eb --robot=$PWD bcl2fastq2-2.20.0.eb

And boom! You will see your software get installed. I'm planning on putting together a github repo with some of my most commonly used easybuild configs, so stay tuned!

Bioinformatics Solutions on AWS Newsletter 

Get the first 3 chapters of my book, Bioinformatics Solutions on AWS, as well as weekly updates on the world of Bioinformatics and Cloud Computing, completely free, by filling out the form next to this text.

Bioinformatics Solutions on AWS

If you'd like to learn more about AWS and how it relates to the future of Bioinformatics, sign up here.

We won't send spam. Unsubscribe at any time.