Bioinformatics Automations Solutions on AWS
Bioinformatics analysis is complicated. There is no one size fits all solution, and most infrastructures are created from many different components.
Deploy load balanced, fault tolerant RShiny and Dash applications on AWS. Each deployment is completely customizable and can include additional file systems, databases, and compute clusters.
SLURM HPC Clusters
Explore your large datasets on your own HPC Cluster on AWS. Avoid having to manually bring instances up and down with on demand by leveraging the scheduler to do the work for you.
Data Science Pipelines
Design complex data science pipelines with the help of Apache Airflow. Generate QC reports, label images, train models, and execute computationally intensive workloads with the help of AWSBatch or HPC.
Icons from www.flaticon.com
Blog - I have quite a lot of content on my blog for those that want to take a DIY approach.
30 minute call - I regularly chat with people facing particular challenges. It’s a good way for me to get to know people, stay current, and I am always happy to point you towards any resources or open-source software I think may be helpful.
Free Courses - At this time I have a free Docker for Data Scientists course. Knowing docker is foundational knowledge for many other technologies and for much of cloud computing.
DIY (less than $100 USD)
Labs and Project Templates - I have several labs and project templates available. If you have a request for a lab you can let me know too.
Paid 1 hour Road-mapping and Strategy Session ($300 USD)
You can also book a paid 1 hour strategy session. I make every effort to be very flexible with these. Coming from the bioinformatics world I understand that many teams are made up of different strengths and weaknesses, and I always try to compliment those when I am able to.
- A roadmap with your next steps along with resources. (Buildout AWS Batch for production analysis, version analysis for medical imaging lab, create docker images, etc.)
- Curriculums for onboarding new bioinformatics staff to various data science infrastructures.
- A flow chart on how to split and optimize your analysis for an HPC scheduler, Airflow, etc.
- Depending on the analysis complexity and length I will also sometimes bootstrap the analysis itself, creating DAG and plugin templates using Apache Airflow. This option is most helpful for teams that have a high degree of in-house scientific knowledge on their analysis, but maybe less strong on the computational side of running the analysis.
I have oodles of labs, project templates, documentation, infrastructure recipes, etc that I will offer where applicable as well. For example, if what you need is a Kubernetes cluster to run your Apache Airflow analyses in production, you will get my Kubernetes on AWS Lab and my Apache Airflow Lab included with your Strategy Session.
Data Science Infrastructure and Clusters (Starts at $5K USD)
If you are a biotech startup looking to onboard new bioinformaticians while automating and scaling your infrastructure an elastic cluster on AWS should be your next step.
These clusters started at $5K USD and go up depending on the level of customization and additional services needed. These quite commonly include customized Dask and Spark clusters, additional services such as LabelStudio for ML workflows, autoscaling RStudio Server instances, customized software stacks for specific analysis types (Single Cell, RNASeq, Variant Calling, etc.)
Each cluster also includes 2 months of support. Add me to your slack channel, discourse forum, etc and I will answer questions as needed. Support also includes extensive documentation, usually as Jupyterhub notebooks and man pages, but have also included JIRA, internal documentation sites, Google Docs, etc.
Retainers for Longer-Term Projects - Minimum of $3K per month for a minimum of 3 months
I fully understand that different Bioinformatics teams have different strengths and weaknesses. I frequently work with teams that have a high degree of specialization in their scientific domain, but are challenged by scaling that out on a computational infrastructure. If you’re active working on a research project, new startup, or production pipeline that you want to scale out you may want to engage me for a longer period of time.
Bioinformatics Analysis Optimization with Apache Airflow (Starts at $5K USD)
Do you have analysis that you need to scale out? Maybe your analysis is still under development, or maybe you created a proof of concept with a subset of your samples and now you need to scale to hundreds or thousands of samples (or images, flies, worms, cells, etc).
Here's a Preview
Bioinformatics analyses always share common traits. These traits will point you towards the type of infrastructure you need to automate and scale your analyses or exploratory data science infrastructure.
Why Work With Me?
Over the course of my career, I have earned a robust reputation for outstanding genomics and bioinformatics DevOps, and I am known for my ability to design and integrate innovative, flexible infrastructures, leveraging in-depth client and business consultation to uncover critical, unique program needs. Throughout the years I’ve seen datasets grow in size and complexity (who remembers microarrays?) and worked with researchers to develop analysis infrastructure to accommodate the ever-growing demand for more number crunching.
I have consulted with the Bioinformatics Contract Research Organization (CRO) and BitBio to design and deploy a major manual-labor saving HPC cluster with integrated SLURM scheduler and user / software stacks, and elastic computational infrastructure for genomics analysis, empowering a greater focus on high-priority projects and activities.
I also designed and deployed complex data visualization applications on AWS such as NASQAR. I am both a contributor and core team member of Bioconda as well as a contributor to the Conda Ecosystem and EasyBuild.
Prefer to DIY
That's great! I have oodles of material for people who want to deploy their own resources.
Deploy all the things!
Here are some blog posts that I hope find helpful! If you have any requests for tutorials please don't hesitate to reach out to me at [email protected].
Data Science and Machine Learning Pipelines
- Get Started with CellProfiler in Batch