Commentary: Developers struggle to build distributed applications that stitch together MapReduce, TensorFlow, and more. The Ray project just might make developers’ lives easier.
There’s something about Berkeley. Maybe it’s the nuclear-free zone. Or maybe it’s the 99 Nobel laureates, 23 Turing Award winners, and 14 Pulitzer Prize winners that the university has fostered. But something in the Berkeley DNA keeps turning out interesting open source projects. It may have started with Michael Stonebraker and Ingres, then Postgres, but that was soon followed by BSD (with some involvement from Bill Joy and others), Mike Olson and Sleepycat, and more recently by Databricks and Apache Spark.
Today, that long history of open source innovation at Berkeley continues with the Ray project, a simple, universal API for building distributed applications, backed by the company Anyscale. In a conversation with Anyscale (and Ray) cofounder Robert Nishihara, he described Anyscale as an “infinite laptop.” That is, it enables developers to build applications on their laptops that automatically scale to run elastically in the cloud.
It’s a killer idea. Fortunately, it’s much more than just a vague notion.
Gluing together the future
This wasn’t where things started. As with many other great things in Berkeley, it all started when Nishihara was working on his Ph.D. in machine learning. What he wanted to do was design better algorithms for learning from data or solving optimization problems–things like that. But what he actually spent time doing was on all the undifferentiated engineering required to scale and speed things up: “You’d often have an algorithm that you want to try out and you could implement the algorithm fairly quickly, but…the tooling was the bottleneck.”
Nishihara felt there had to be a better way and better tools for distributed computing. Or would be, once he built them.
SEE: Linux commands for user management (TechRepublic Premium)
This isn’t to say that tools didn’t exist–they did–but they were just too specialized. Need to run a compute workload that involves large quantities of data that requires parallel processing? MapReduce has you covered. Need to train neural networks? TensorFlow to the rescue. Batch processing? Apache Spark. The problem, Nishihara explained, is that “As soon as your application doesn’t fit neatly into one of these boxes, then you’re building a new distributed system.”
By “fit neatly” we’re not talking about picky applications that need things “just so.” It can simply be a matter of combining two systems. So if you want to build an application that does data processing (Spark!) and training of neural networks (TensorFlow!), there’s no good way to do that. Why? Because these different systems aren’t really designed to interface with each other. As just one example, Nishihara pointed to a large financial services company that wanted to build an online learning application that incrementally learns from incoming data to constantly update recommendations for users. “To build this kind of application, they had to [write a lot of glue code to] stitch together a stream processing system, a training system, a serving system. It’s very difficult.”
Or it was, until Ray came along.
The infinite laptop
Contrast the difficulty of gluing together these disparate systems with the experience a developer has on their laptop. Let’s say you’re writing a Python application. Python offers a rich ecosystem of libraries for things like manipulating data, machine learning, or linear algebra. A Python developer simply writes their Python application and imports these libraries for the necessary functionality. The libraries all play nicely together. Easy.
However, for distributed systems, nothing like this exists. Nishihara and his cofounders built Ray to serve as a general purpose distributed system; “the go-to way for developers to build distributed applications.”
Ray is designed to be a lower-level, flexible system with a rich ecosystem of scalable libraries. This foundation allows developers to build distributed applications in the same way they might write that Python application. And, as with Python, a community is growing up around Ray to provide useful libraries like Horovod, an open source framework to make distributed training of deep neural networks fast and easy for TensorFlow, PyTorch, and Apache MXNet. Horovod, developed at Uber, is great, but scaling a Horovod job with hundreds of GPUs can be challenging. By running Horovod on Ray, elastic scalability becomes straightforward.
This is the genius of Ray and Anyscale. Too often we forget that many of the best applications start with a single developer working on her laptop. “Of course we’re thinking about making things work really well at a large scale,” Nishihara stressed, “But we also really care about the single machine experience and the process of getting started and removing all friction along the way.”
How would this work in practice? According to Nishihara, Anyscale wants to give developers the freedom to forget about infrastructure and focus on their application logic. They start prototyping their application, just like writing Python applications on their laptop. They’re not even thinking about scale, about cloud.
But once they have a prototype running and need to scale it, they install open source Ray, modify a few lines of code, and immediately they can parallelize it across their laptop or across a single machine without rearchitecting things. As they need to scale that application, they run it on Anyscale, extending their laptop to the cloud, as it were. The application needs more CPUs? Done. Need more GPUs? They just appear. When their application doesn’t need these resources anymore, it scales down. “It just should work,” he declared.
It’s an incredibly bold, and useful, vision. But perhaps that’s sort of the Berkeley way.
Disclosure: I work for AWS, but the views expressed herein are mine.