Bioinformatics Solutions on AWS For Exploratory Analysis

This is part 1 of a series I have in the works about Bioinformatics Solutions on AWS. Each part of the series will stand on its own, so you won't be missing anything by reading one part and not the others.

So let's dive right in!

Lay of the Land

To keep things simple I am going to say exploratory analyses are analyses that are completed in the shell, with either a programming language such as Python or R, or just the terminal.

(Any other Bioinformaticians remember just how much we all used to use sed and awk? ;-) )

There can be some visualization here, but we'll draw the line at anything that can't be displayed in a Jupyterhub notebook or a Jupyter Lab instance.

In another post I'll discuss Production Analyses.

AWS Solutions for Exploratory Analyses

Common Infrastructure Components

It's always a good idea to look for terms such as "auto-scaling", "elastic" or "on demand" when using AWS. This means that there is some smart mechanism baked in that only uses the resources when you need...

Continue Reading...

Deploy Dash with Helm on Kubernetes with AWS EKS

aws dash eks kubernetes python Jul 14, 2020

Today we are going to be talking about Deploying a Dash App on Kubernetes with a Helm Chart using the AWS Managed Kubernetes Service EKS.

For this post, I'm going to assume that you have an EKS cluster up and running because I want to focus more on the strategy behind a real-time data visualization platform. If you don't, please check out my detailed project template for building AWS EKS Clusters.

Dash is a data visualization platform written in Python.

Dash is the most downloaded, trusted framework for building ML & data science web apps.

Dash empowers teams to build data science and ML apps that put the power of Python, R, and Julia in the hands of business users. Full stack apps that would typically require a front-end, backend, and dev ops team can now be built and deployed in hours by data scientists with Dash. https://plotly.com/dash/

If you'd like to know what the Dash people say about Dash on Kubernetes you can read all about that here.

Pretty much though, Dash is...

Continue Reading...

Deploy and Scale your Dask Cluster with Kubernetes

Dask is a parallel computing library for Python. I think of it as being like MPI without actually having to write MPI code, which I greatly appreciate!

Dask natively scales Python

Dask provides advanced parallelism for analytics, enabling performance at scale for the tools you love

https://dask.org

One of the cooler aspects of Dask is that you scale across computers/servers/nodes/pods/container/etc. This is why I say it's like MPI.

What we'll be talking about today are:

  • Advantages to using Kubernetes
  • Disadvantages to using Kubernetes
  • Install the Dask Helm Chart
  • Scale your Dask Workers Up / Down with kubectl scale
  • Modify the Dask Helm Chart to add Extra Packages
  • Autoscale your Dask Workers with Horizontal Pod AutoScalers

Benfits to Dask on Kubernetes

Let's talk about some of the (many!) benefits to using Kubernetes!

Customizable Configuration

Another very important aspect of Dask, at least for me, is that I can set it up so that the infrastructure side of things is completely...

Continue Reading...

Learn Apache Airflow By Example – 4 Part Series

apacheairflow docker python Dec 02, 2019

Introduction

I've been having a blast the last few months learning ApacheAirflow. It's become an indispensable tool in my getting stuff done toolbox.

Follow me on Twitter @jillianerowe, or send me a message with questions, topic suggestions, ice cream recommendations, or general shenanigans. 

 

Get the Source Code

...

Continue Reading...

Dask on HPC

Recently I saw that Dask, a distributed Python library, created some really handy wrappers for running Dask projects on a High-Performance Computing Cluster, HPC.

Most people who use HPC are pretty well versed in technologies like MPI, and just generally abusing multiple compute nodes all at once, but I think technologies like Dask are really going to be game-changers in the way we all work. Because really, who wants to write MPI code or vectorize?

If you've never heard of Dask and it's awesomeness before, I think the easiest way to get started is to look at their Embarrassingly Parallel Example, and don't listen to the haters who think speeding up for loops is lame. It's a superpower!

Onward with examples!

Client and Scheduler

Firstly, these are all pretty much borrowed from the Dask Job Queues page. Pretty much, what you do, is you write your Python code as usual. Then, when you need to scale across nodes you leverage your HPC scheduler to get you some...

Continue Reading...

Develop and Deploy Python Applications with Docker

docker python Jun 28, 2019

Dockerize all the Things!

If you've read anything on this blog you will know I am all about docker and I am all about world domination with distributed computing, particularly with Python Applications.

This week marks something new for me. I've written plenty of documentation, oodles of Power Point presentations with accompanying in-person training. This is the first for me to start recording myself for video, and it was an exciting start! It feels a bit silly, because there are probably millions of people on YouTube or other video formats by now, but it was a first for me. 

Check it out over here on the NEW Dabble of DevOps Youtube channel!

 

More Python + Docker Resources

If you want to see a deeper dive on building python applications in Docker containers check out my Develop a Python Application in Docker and Deploy to AWS series.

Please watch, comment, share, and let me know what you think!

Happy teching!

Continue Reading...

Apache Airflow Tutorial – Part 4 DAG Patterns

Overview

During the previous parts in this series, I introduced Apache Airflow in general, demonstrated my docker dev stack, and built out a simple linear DAG definition. I want to wrap up the series by showing a few other common DAG patterns I regularly use.

In order to follow along, get the source code!

Bring up your Airflow Development Environment

unzip airflow-template.zip cd airflow-template docker-compose up -d docker-compose logs airflow_webserver
Bash

This will take a few minutes to get everything initialized, but once its up you will see something like this:

DAG Patterns

I use 3 main DAG patterns. Simple, shown in Part 3, Linear, and Gather. Of course, once you master these patterns, you an combine them to make much more complex pipelines.

Simple DAG Pattern

What I call a simple pattern (and I have no idea if any of these patterns have official names) is a chain of tasks where each task depends upon the previous task. In this...

Continue Reading...

Apache Airflow Tutorial – Part 3 Start Building

Overview

If you've read this far you should have a reasonable understanding of the Apache Airflow layout and be up and running with your own docker dev environment. Well done!  This part in the series will cover building an actual simple pipeline in Airflow.

Start building by getting the source code!

Build a Simple DAG

The simplest DAG is simply having a list of tasks, where each task depends upon its previous task. If you've spun up the airflow instance and taken a look, it looks like this:

Now, if you're asking why I would choose making an ice cream sundae as my DAG, you may need to reevaluate your priorities.

Generally, if you order ice cream, the lovely deliverer of the ice cream will first as you what kind of cone (or cup, you heathen) you want, then your flavor (or flavors!), what toppings, and then will put them all together into sweet, creamy, cold, deliciousness.

You would accomplish this awesomeness with the following Airflow code:

Now,...

Continue Reading...

Apache Airflow Tutorial – Part 2 Install with Docker

Install Apache Airflow With Docker Overview

In this part of the series I will cover how to get a nice Apache Airflow instance up and running with docker. You won't need to have anything installed locally besides docker, which is fantastic, because configuring all these pieces individually would be kind of awful!

This is the exact same setup and configuration I use for my own Apache Airflow instances. When I run Apache Airflow in production I don't use Postgres in a docker container, as that is not recommended, but this setup is absolutely perfect for dev and will very closely match your production requirements!

Following along with a blog post is great, but the best way to learn is to just jump in and start building. Get the Apache Airflow Docker Dev Stack here.

Celery Job Queue

Getting an instance Apache Airflow up and running looks very similar to a Celery instance. This is because Airflow uses Celery behind the scenes to execute tasks. Read more...

Continue Reading...

Apache Airflow Tutorial – Part 1 Introduction

What is Apache Airflow?

Briefly, Apache Airflow is a workflow management system (WMS). It groups tasks into analyses, and defines a logical template for when these analyses should be run. Then it gives you all kinds of amazing logging, reporting, and a nice graphical view of your analyses. I'll let you hear it directly from the folks at Apache Airflow

Apache Airflow is a platform to programmatically author, schedule and monitor workflows.

Use airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command line utilities make performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed.

When workflows are defined as code, they become more maintainable, versionable, testable, and collaborative.

Source - ...

Continue Reading...
1 2
Close

50% Complete

DevOps for Data Scientists Weekly Tutorials

Subscribe to the newsletter! You'll get a weekly tutorial on all the DevOps you need to know as a Data Scientist. Build Python Apps with Docker, Design and Deploy complex analyses with Apache Airflow, build computer vision platforms, and more.