Exaloop: From MIT Research to High-Performance Data Science Platform

I have been wanting to write about my journey from an MIT student working towards a PhD, to a startup founder who has raised a seed round and is working hard on building a data science platform to make data scientists and software engineers a lot more productive. I wanted to share the story here—I hope it’s both entertaining and educational.

The beginning

Before Exaloop was founded, my co-founder Ibrahim and I were part of MIT’s Computer Science and AI Lab (CSAIL) working on bioinformatics and computational genomics— he as a postdoc and I as a grad student. Now, computational biology, broadly speaking, is an interesting domain as it’s composed of people from a diverse set of backgrounds: you have biologists, chemists, clinicians, data scientists, software engineers and many others, often working together to solve complex multidisciplinary problems. Both my co-founder and I have been fortunate enough to work with folks from many of these different backgrounds over the years, and to learn about the different ways in which they might think about a given question or problem: while a biologist or clinician might naturally focus more on the biology or potential impact on a patient, a computationalist will often focus more on algorithms, ML, scalability and so on—both perspectives are needed to produce meaningful innovations and solutions.

However, being a very multidisciplinary field is not without its challenges. Over time, Ibrahim and I realized that a lot of time and energy was being wasted going back and forth between the scientists, the data folks and the software engineers. This was only made worse by the fact that datasets were becoming massive, meaning many of the tools that folks historically had in their arsenals could no longer scale to the size of data being analyzed. Languages like Python were too slow, but languages like C++ required too much expertise to realistically be used by most practitioners. Because of that, we saw a dichotomy emerge: concepts were often explored and iterated on in Python, but then passed on to software engineers to re-implement so as to be performant and scale. Beyond that, resources would frequently go under-utilized because it was simply too difficult and cumbersome to write software that could utilize them effectively, whether it’s multiprocessors, GPUs or distributed systems. After all, who wants to spend hours writing CUDA code (or worse, pay someone else to do it) when they don’t know if the thing will even work in the end?

We decided to tackle this problem and develop a way to empower practitioners to write performant, scalable software without having to worry about low-level details or optimizations, all the while giving them access to things like parallel programming and GPU computing. The chasm that’s historically been present between “ease & simplicity of programming” and “performance & scalability” was, we were convinced, entirely artificial.

The bio space was our focus at the start, and we actually began with a pretty limited domain-specific language for genomics applications. We realized fairly quickly that a limited DSL couldn’t realistically handle all the use cases of the domain, and as a result started focusing on Python, which is arguably the most popular programming language today. Over time, we got to a point where anyone with basic Python programming experience could comfortably use our tech, but get orders-of-magnitude performance improvements. We published a couple papers, notably in Nature Biotechnology and OOPSLA, where we looked at the impact on a number of genomics applications like sequence alignment, genotyping and phasing, and actually were able to outperform many of the original C/C++ implementations with much simpler Python code.

Towards the end of my PhD was when we really started looking beyond life sciences and thinking more deeply about broader applications of the technology. After all, Python was and continues to be the lingua franca of many domains, and by the same token there’s a growing population of domain experts who are well-versed in it. This, combined with the excitement both of us had for the technology and what it might unlock, was the turning point where I decided that instead of getting a “normal job” and working for a company or university that would pay me a good salary, I’d instead venture into the startup world and, together with Ibrahim, try to build a company around this idea. As a result, Exaloop was founded in late 2021, with a mission to pursue our vision of democratizing high-performance, scalable computing in all domains.

2021-2023

The process of founding a company was overwhelming at first—we didn’t even know where to start. Over time, with the help of some great mentors at MIT and elsewhere, Exaloop was incorporated and we started to focus on actually building a product.

There was another problem though: there were just too many things we could do; too many directions we could take. Everyone we talked to had a different use case in mind, and almost all of them sounded interesting and impactful—from distributed computing, to embedded systems, to running in the browser and many many more. As a 2-person company, we simply couldn’t explore all of these. We continued developing and refining the core technology, but for the first year or so we didn’t have a clear sense of direction.

Soon, the list of features and projects we wanted to get done grew far beyond what Ibrahim and I could do alone. We didn’t have enough money at the time to hire new engineers, so we instead decided to take a step back and think about what we’re actually trying to build. Who’s our real target audience? What is it that they need the most? It took us a long time to answer those questions.

Exaloop today

As Exaloop has matured since its founding in 2021, so too has our understanding of the problem space we’re working in. We realized that there’s one thing in common among all the people who care about performance and scalability: data. That’s exactly what we saw in our early days working in the bio space: manageable datasets worked fine with existing tools and approaches, but as the data got larger, the need for performance and scalability became critical. The same is true, we found, in any domain, be it finance, AI or anything else.

That realization caused us to focus heavily on the people working with data directly: data scientists, data engineers, analysts, ML engineers and so on. Our mission as a company solidified into one of giving them a performant, scalable solution for tackling large datasets without relying on anyone else’s help. Stay in Python, get 100x performance improvements, access to parallel computing, GPU acceleration and beyond… that’s the core value proposition.

We’ve been working hard to achieve that lofty goal ever since. In December of 2022 we publicly released Codon, a new Python implementation that generates native code using LLVM to achieve the performance of (and sometimes even outperform) C/C++, while also offering parallel/GPU programming. We were thrilled to see tremendous community engagement and feedback following that release—in fact Codon gained over 10,000 stars on GitHub in just a few months, and has a growing community of users today!

Our team has also grown alongside our technology; what started as a two-person team has expanded into a group of over a dozen talented engineers and advisors, following our seed round earlier this year.

Exaloop tomorrow – a high-performance data science platform

Our vision has evolved substantially since we started Exaloop. The core technology that we originally developed at MIT, we realized, is not the destination but rather a stepping stone. The Exaloop of tomorrow will be a high-performance platform for data teams. Think of it as a one-stop shop where

  • your code lives… create and manage workspaces for your code, and edit it in the browser;
  • your data lives… connect to any data sources you have and import whatever you need;
  • you have access to your favorite tools and frameworks… whether that’s Jupyter, a specific library or anything else;
  • you can launch, track and monitor executions… launch and deploy your code on the cloud with one click;
  • you get the best possible performance… automatically.

The world has also changed over the last year with the meteoric rise of generative AI, which has also helped shape our vision of the Exaloop platform. AI-based assistants like Copilot are quickly becoming an essential part of many programmers’ toolkits, and we’re leveraging generative AI to not only write code but also optimize it.

I know there’s a lot here, and we’re excited to share more—our official announcement will come out in a few months. In the meantime, you can join our waitlist on our website and follow us on X/Twitter, LinkedIn and GitHub for the latest updates!

Author