This is part 1 of a series I have in the works about Bioinformatics Solutions on AWS. Each part of the series will stand on its own, so you won't be missing anything by reading one part and not the others.
So let's dive right in!
To keep things simple I am going to say exploratory analyses are analyses that are completed in the shell, with either a programming language such as Python or R, or just the terminal.
(Any other Bioinformaticians remember just how much we all used to use sed and awk? ;-) )
There can be some visualization here, but we'll draw the line at anything that can't be displayed in a Jupyterhub notebook or a Jupyter Lab instance.
In another post I'll discuss Production Analyses.
It's always a good idea to look for terms such as "auto-scaling", "elastic" or "on demand" when using AWS. This means that there is some smart mechanism baked in that only uses the resources when you need...
Today we are going to be talking about Deploying a Dash App on Kubernetes with a Helm Chart using the AWS Managed Kubernetes Service EKS.
For this post, I'm going to assume that you have an EKS cluster up and running because I want to focus more on the strategy behind a real-time data visualization platform. If you don't, please check out my detailed project template for building AWS EKS Clusters.
Dash is a data visualization platform written in Python.
Dash is the most downloaded, trusted framework for building ML & data science web apps.
Dash empowers teams to build data science and ML apps that put the power of Python, R, and Julia in the hands of business users. Full stack apps that would typically require a front-end, backend, and dev ops team can now be built and deployed in hours by data scientists with Dash. https://plotly.com/dash/
If you'd like to know what the Dash people say about Dash on Kubernetes you can read all about that here.
Pretty much though, Dash is...
Dask is a parallel computing library for Python. I think of it as being like MPI without actually having to write MPI code, which I greatly appreciate!
Dask natively scales Python
Dask provides advanced parallelism for analytics, enabling performance at scale for the tools you love
One of the cooler aspects of Dask is that you scale across computers/servers/nodes/pods/container/etc. This is why I say it's like MPI.
What we'll be talking about today are:
Let's talk about some of the (many!) benefits to using Kubernetes!
Another very important aspect of Dask, at least for me, is that I can set it up so that the infrastructure side of things is completely...
I've been having a blast the last few months learning ApacheAirflow. It's become an indispensable tool in my getting stuff done toolbox.
Follow me on Twitter @jillianerowe, or send me a message with questions, topic suggestions, ice cream recommendations, or general shenanigans.
Recently I saw that Dask, a distributed Python library, created some really handy wrappers for running Dask projects on a High-Performance Computing Cluster, HPC.
Most people who use HPC are pretty well versed in technologies like MPI, and just generally abusing multiple compute nodes all at once, but I think technologies like Dask are really going to be game-changers in the way we all work. Because really, who wants to write MPI code or vectorize?
If you've never heard of Dask and it's awesomeness before, I think the easiest way to get started is to look at their Embarrassingly Parallel Example, and don't listen to the haters who think speeding up for loops is lame. It's a superpower!
Firstly, these are all pretty much borrowed from the Dask Job Queues page. Pretty much, what you do, is you write your Python code as usual. Then, when you need to scale across nodes you leverage your HPC scheduler to get you some...
If you've read anything on this blog you will know I am all about docker and I am all about world domination with distributed computing, particularly with Python Applications.
This week marks something new for me. I've written plenty of documentation, oodles of Power Point presentations with accompanying in-person training. This is the first for me to start recording myself for video, and it was an exciting start! It feels a bit silly, because there are probably millions of people on YouTube or other video formats by now, but it was a first for me.
If you want to see a deeper dive on building python applications in Docker containers check out my Develop a Python Application in Docker and Deploy to AWS series.
Please watch, comment, share, and let me know what you think!
During the previous parts in this series, I introduced Apache Airflow in general, demonstrated my docker dev stack, and built out a simple linear DAG definition. I want to wrap up the series by showing a few other common DAG patterns I regularly use.
In order to follow along, get the source code!
unzip airflow-template.zip cd airflow-template docker-compose up -d docker-compose logs airflow_webserver
This will take a few minutes to get everything initialized, but once its up you will see something like this:
If you've read this far you should have a reasonable understanding of the Apache Airflow layout and be up and running with your own docker dev environment. Well done! This part in the series will cover building an actual simple pipeline in Airflow.
Start building by getting the source code!
The simplest DAG is simply having a list of tasks, where each task depends upon its previous task. If you've spun up the airflow instance and taken a look, it looks like this:
Now, if you're asking why I would choose making an ice cream sundae as my DAG, you may need to reevaluate your priorities.
Generally, if you order ice cream, the lovely deliverer of the ice cream will first as you what kind of cone (or cup, you heathen) you want, then your flavor (or flavors!), what toppings, and then will put them all together into sweet, creamy, cold, deliciousness.
You would accomplish this awesomeness with the following Airflow code:
In this part of the series I will cover how to get a nice Apache Airflow instance up and running with docker. You won't need to have anything installed locally besides docker, which is fantastic, because configuring all these pieces individually would be kind of awful!
This is the exact same setup and configuration I use for my own Apache Airflow instances. When I run Apache Airflow in production I don't use Postgres in a docker container, as that is not recommended, but this setup is absolutely perfect for dev and will very closely match your production requirements!
Following along with a blog post is great, but the best way to learn is to just jump in and start building. Get the Apache Airflow Docker Dev Stack here.
Getting an instance Apache Airflow up and running looks very similar to a Celery instance. This is because Airflow uses Celery behind the scenes to execute tasks. Read more...
Briefly, Apache Airflow is a workflow management system (WMS). It groups tasks into analyses, and defines a logical template for when these analyses should be run. Then it gives you all kinds of amazing logging, reporting, and a nice graphical view of your analyses. I'll let you hear it directly from the folks at Apache Airflow
Apache Airflow is a platform to programmatically author, schedule and monitor workflows.
Use airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command line utilities make performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed.
When workflows are defined as code, they become more maintainable, versionable, testable, and collaborative.
Source - ...
Subscribe to the newsletter! You'll get a weekly tutorial on all the DevOps you need to know as a Data Scientist. Build Python Apps with Docker, Design and Deploy complex analyses with Apache Airflow, build computer vision platforms, and more.