Every time I want to get started with new tech I figure out how to get a stack up and running that closely resembles a real-world production instance as much as possible.
This is a get up and running post. It does not get into the nitty gritty details of developing with Spark, since I am only just getting comfortable with Spark myself. Mostly I wanted to get up and running, and write a post about some of the issues that came up along the way.
Spark is a distributed computing library with support for Java, Scala, Python, and R. It's what I refer to as a world domination technology, where you want to do lots of computations, and you want to do it fast. You can run computations from the embarrassingly parallel, such as parallelizing a for loop to complex workflows, and support for distributed machine learning as well. You can transparently scale out your computations to not only multiple cores, but even multiple machines by creating a spark cluster....
Subscribe to the newsletter! You'll get a weekly tutorial on all the DevOps you need to know as a Data Scientist. Build Python Apps with Docker, Design and Deploy complex analyses with Apache Airflow, build computer vision platforms, and more.