The State of IT Automation: New Pressures Invite New Opportunities Read Report

Episode #64: How Training-At-Scale Accelerates Deployment Of Machine Learning Models Into Production

In today’s podcast, we interview William Falcon, Founder and CEO of

Failing fast is one of Agile development’s conceptual pillars. Embracing the principle of failing fast leads to lower cost of failure, accelerated learning, and innovation-driven organizational culture. In the AI & machine learning world however, where testing & fast failing of models is heavily dependent on access to computing power, these desirable benefits are often out of reach. If your employer is computationally affluent like Facebook or Google, your machine learning models enjoy the best environment to fail fast & get rapidly fine-tuned towards a minimum viable product. If your employer doesn’t have those resources, then not so much.

William Falcon wants to remedy that inequality with a Training-at-Scale system that compresses research & testing from months into days. Expediting that process not only gets machine learning models into production faster, but can also accelerate an organization’s journey towards digital transformation. We talk with William about how his startup will provide this “superpower” and how doing so can provide enterprises with a competitive advantage.

Read Full Transcript

Guy Nadivi: Welcome, everyone. My name is Guy Nadivi, and I’m the host of Intelligent Automation Radio. Our guest on today’s episode is William Falcon, founder and CEO of Grid AI, providers of a “training at scale” system for AI and machine learning models. Training at scale offers the tantalizing promise of accelerating a given AI or machine learning model’s readiness for production. Since digital transformation is becoming increasingly reliant on AI and machine learning, the ability to get those models into production more quickly can be the difference between having a competitive advantage or having to play catch-up. William Falcon certainly knows something about making learning models more scalable since he is the author of PyTorch Lightning, a popular open source Python library for high performance AI research. If he sounds like he’s a pretty busy man, you’re right, he is. So we’re very happy William was able to carve some time out of his hectic calendar to join us on the podcast today. William, welcome to Intelligent Automation Radio.

William Falcon: Hi, Guy. Thank you so much for having me. It’s an honor to meet you and to chat with everyone here today.

Guy Nadivi: William, what led you to start Grid AI? And, how did you come up with that name?

William Falcon: So, I’ve been in the research world for a few years now and I’ve been doing things like self-supervised learning. I’ve put models into production at my previous startup on other companies. There are a lot of challenges associated with it. When I got into AI research, I realized that the number one superpower that you could have is unlimited computes, right? I experienced this firsthand at Facebook AI Research and it’s incredible that the speed at which we can move when you have a lot of machines, right? So, you can take months of research and ideas that you’d have to go from the first version of the idea, to the next version of that and get all of that done in only a few days or a few weeks, right? This is a super power that certain AI labs in the world have that, but not everyone has that superpower. I was a little bit distraught by that because it means that the way that research moves in AI is basically limited to where you are, right? If you’re at the right company, then you have the access to that. Otherwise, you don’t. I didn’t love that. One of my first approaches, I’d kind of try and solve that problem with Lightning, right? It’s basically, how do you create a framework that lets you move through ideas really quickly? That’s something that I’ve been basically researching since my undergrad days. The resolution of that is what you see today, which is Lightning, and Lightning can’t solve every problem for you. So, when you get into training at scale, there are a lot of issues around infrastructure and coordinating, orchestrating infrastructure that come up, especially at companies where you’re working at scale or even large research institutions and labs. Grid was really around trying to help those users scale up, right? So if you want to train on 256 GPUs, you should be able to do that. You shouldn’t have to be an expert, deep learning engineer to be able to figure that out, right? You can leverage what the community does today and make that happen. That’s really kind of the premise for Grid is how do you scale up deep learning specifically focused on PyTorch Lightning and PyTorch, but Grid supports other frameworks as well, not just those, and that’s really the premise around that. I was really inspired by how things have evolved throughout history in terms of science, right? So, if you think about electricity, for example. You can invent the light bulb, but it’s not until you have an electric grid, where you can kind of give access to everyone for that. Those are some of the original motivations behind Grid. Okay, so we have this amazing research that’s being pushed by thousands of brilliant researchers, GANs, convolutional networks, transformers, et cetera. But, in order to enable more people to continue to build more things and faster, there needs to be that infrastructure in place to help them do that. That’s really what Grid is there to do is to help you scale that up when you’re doing work that really matters.

Guy Nadivi: Can you please elaborate on the simplification of the training process? How do enterprises benefit from using your training at scale system, in that respect?

William Falcon: If you’re trying to train at scale today, you have to deal with a lot of issues like throughput of data, right? So, you might be dealing with terabytes. You need to have dozens of models or hundreds of models trying to access the same data sets. You need to be able to share these data sets. There are a lot of nuances that come around there. And then, not only that, but then you have to try to figure out how to distribute across the GPUs or multiple machines, how to keep things in sync. When you read every tutorial out there, it’s like, here’s how you train this thing on this cloud machine, but it’s always one simple model, one small script on a small data set on one GPU, which is very trivial to do. It’s a little bit misleading because it makes you feel like training on the cloud at scale is easy.

But it’s not, right? It’s really complex. When you’re trying to do it at scale, you could be at news agency trying to do recommendations in near real time where you have a latency issue where if your models train slowly, you could be behind the curve and your models won’t be recommending articles that are going to show up on the front page of Google results, right? When you’re doing medical imaging, when you’re doing finance, training and deploying is really critical and the speed there really matters. Enterprises can benefit from using Grid because they no longer have to be experts at this, right? We have expert engineers both on the framework side and also on the MLOps side that basically instills the best practices and squeezes the most performance out of these systems, right? So, you can basically take people who are mathematicians, who are coming out of their PhDs, who are working as a machine learning engineer, research engineer, data scientist, and let them do what they do best, which is math, and which is modeling, which is looking at data and trying to get insights. What they’re not great at is engineering, right? A lot of people teach themselves to do this, but it’s not because they want to, it’s because there’s no better tooling out there for that. Grid frees those people up to do what they do best and to focus on their core skill sets. Guy Nadivi: So, will an enterprise be able to use currently available pre-trained models available in TensorFlow or PyTorch, for example, using Grid AI?

William Falcon: Absolutely. So, Grid is agnostic to the framework, right? You can run PyTorch, TensorFlow, and you have these popular frameworks. There are a lot of model hubs out there with pre-trained models. You can take those and run them on Grid in a second. What’s really cool as well, is that the models that you do run on Grid are reproducible. You can actually within your team share those models around. Within Grid, it’s a function of finding the models that are pre-trained and dropping them into the platform and then adding your data to basically fine tune those, which is kind of the starting point for most production AI systems. It’s always kind of getting that first baseline.

But, the beautiful thing about Grid is that the baseline may not be enough for you a lot of times, right? If you’re maybe recommending movies, sure. It can be 80% accuracy because the difference between 80 and 90% may not be apparent, but if it’s for finance or if it’s for our medical imaging or for healthcare, the difference between 80 and 95% can sometimes be losing a lot of money or saving someone’s life. It does matter at that point. You’ll have to go beyond just the baselines.

Guy Nadivi: William, can you speak about some of the more interesting use cases Grid AI has applied training at scale to, and the results you achieved?

William Falcon: Yeah, absolutely. Our focus, within Grid, we have research focus as well, right? We’ve worked a lot on self-supervised learning, so this is a lot of the new technology that’s coming out where you don’t actually need labels. Self-supervised learning basically takes data that’s not labeled and lets you learn something from it. This is something that’s been happening for a long time, but recently started working really well. One of the first examples of this is transformers which have taken off, right? You have BERT that came out a few years ago and those models don’t use labels, right? They used the input to generate the actual training signals. We do a lot of that kind of training at Grid and we use Grid for that. What’s really cool is that we’re able to scale across machines. We’re able to scale up models to have enormous amount of parameters. Recently, we published an article where we use Lightning and DeepSpeed to scale up a transformer model to 45 billion parameters, which is roughly a third of what OpenAI did a few years ago, which is incredible because it was only eight GPUs as well. We’re basically providing a lot of this scalability to groups that don’t necessarily have all those super advanced engineers as well, but really it’s around… So, I would say the primary use cases have been really that. We have users across many companies, news agencies, Telco, healthcare, and so on, and they’re using Grid a lot to basically get through their ideas quickly as well, right? I can’t speak about the models that they’re training or what code that they’re using because we don’t have insights into that, but we know the output, which you can see in newsfeeds and so on.

Guy Nadivi: So, if I’m a decision maker at an enterprise looking to deploy an AI system using machine learning, what kind of ROI can I expect from a massively parallel training at scale service like Grid AI?

William Falcon: You start to move into a serverless world, right? You start to move into a world where you don’t need machines running 24/7, that costs a lot of money. If you want to reserve those on any cloud provider, Grid lets you basically scale up on demand and shut down. If you also buy your own machines, Grid lets you run on those machines. It’s about equalizing the access to the infrastructure that you have. If you own a cluster, if you have bought a cluster for your company, you will notice that the distribution of usage follows a power curve, right? That means that five percent of the people there use the cluster a lot and the others don’t. What Grid does is it enables everyone to basically use a cluster just as efficiently because they don’t have to be experts anymore. It allows you to get more out of your resources. But also, it just means that you can move through your ideas faster and get something into production, way, way, way quicker, something that would take you months to do before. It’s really around productionizing the work that you’ve done internally. The ROI really comes in terms of time to market. So we take that from months into days.

Guy Nadivi: Last year, there was an article in MIT Technology Review about artificial general intelligence or AGI. And, in that piece, the author, Karen Hao, who’s been on our podcast, wrote, “There are two prevailing technical theories about what it will take to reach AGI. In one, all the necessary techniques already exist; it’s just a matter of figuring out how to scale and assemble them. In the other, there needs to be an entirely new paradigm; deep learning, the current dominant technique in AI won’t be enough”. William, where do you fall on this spectrum? Do you think we need a new paradigm to achieve AGI? Or do we have everything we need right now?

William Falcon: I want to address this question in two parts. The first is kind of the motivation for AGI, right? I think if we think about why we want to achieve AGI, right? Or what does that mean? I think it’s about seeing this as a kind of catch all solution for everything. But, I argue, and a bunch of other people in the research world argue as well, that humans are not really general intelligence, right? We’re very specific to certain things. We may not even know what AGI looks like ourselves because we’re not AGI. Trying to get at this question is a little bit hard because we’re basically trying to do something that’s approaching like an unknown unknown. Now, if we say, okay, well we do have… That’s the first part. Now, if we do say, well, I don’t buy that. I do think that there is an AGI and that we can get there. The next question is, is deep learning enough? Well, I think you have to generalize the definition of deep learning. Anything that’s differentiable, it can be thought about as deep learning, right? Random forest. You can approximate those using a differentiable structure. So, you can also approximate SVMs. You can use most of the kind of popular machine learning approaches today. You can somehow code in a differentiable way where you can get close to it. If you’re not into math and that doesn’t sound super familiar, basically, what I mean by this is that I think that deep learning is very, very general and it’s a term that can encompass a lot and it’s even things that we haven’t explored today. By our current definition of deep learning today, it’s probably not enough, but as math and research world expand and more things fall under that umbrella, I think that deep learning will become the catch all term for kind of the new things that evolve.

Guy Nadivi: Most people have heard of the Turing test, which basically states that if a human can’t tell if they’re communicating with another human or a machine, then that machine or computer has passed the Turing test. Now, I learned not too long ago that Steve Wozniak, of Apple fame, proposed an alternative called the “coffee test” and this test some machines’ intelligence by seeing if it can enter an average American home and figure out how to make coffee, which, if you think about, it is not entirely straightforward. It has to find the coffee machine, find the coffee, find a coffee cup, add water into the coffee machine, and brew some coffee by pushing the correct buttons. Now, I know some humans who would have trouble passing that test, myself included, since I don’t drink coffee. William, do you have a personal favorite test when it comes to appraising machine intelligence?

William Falcon: I would definitely have trouble making coffee. I don’t think I’m the biggest coffee drinker. So, I don’t know if I would pass that test. It’s also a super specific test, right? Maybe the way that I would think about a test is… The Turing test, I wouldn’t say it’s long past, but we have systems today that can make you think that you’re speaking to a human and it’s not until you dig into it a little bit longer that you realize that it’s not. Maybe the test that I would frame is something more about humor or having a conversation that’s engaging where you’re going back and forth with people, or the system, I guess, and you can have a dynamic conversation where there can be jokes embedded in there. It’s not just about, Hey, how are you? And then you reply back, but it’s more, can you have a sassy conversation or something more interesting where you have to really read between the lines and think more, have more of the context of the conversation to be able to actually answer the questions.

Guy Nadivi: Mark Twain is often credited with saying, “The art of prophecy is very difficult, especially with respect to the future”. Given your high-level perspective, though, William, what kinds of breakthroughs can you prophecy the Grid AI service will enable for AI practitioners and researchers over the next few years?

William Falcon: I think by removing the need to think about any kind of infrastructure and just focus on the work that people are doing, it’s going to free up so many cycles of their bandwidth of day-to-day work that I think that it’ll enable people to kind of come out of this bubble of, okay, I only have so many machines or I have to even be thinking about how I’m going to do the computational stuff, but instead, focus on the actual meat of the problem and the data that you have then you’re trying to solve. If you can release 20%, 30% of bandwidth of current researchers today, I think that we can accelerate research much, much, much faster. That’s the first part. And the second part is that a lot of the research today and the success of AI comes from rapid iteration. It’s basically, you have a lot of ideas and the faster you can get through those, the faster you get to a result. I wouldn’t say that we’re particularly smart about being able to just know when an idea’s great and just go for it. I think it takes a lot of trial and failures for all researchers. So, the more that you can fail and the faster you can fail, the faster you’ll be able to find a solution that does work and Grid helps you do that because it allows you to try things that would’ve taken you six months in just a matter of days.

Guy Nadivi: William, for the CIOs, CTOs, and other IT executives listening in, what is the one big must have piece of advice you’d like them to take away from our discussion with regards to deploying training at scale for machine learning models at their organizations?

William Falcon: I think today, in organizations, the main focus is always on deploying to production, but people forget that there’s a whole process that happens before that. And, that is actually where a bulk of the slow part happens and that’s where you can get a lot of the ROI, whether it’s data processing, modeling, it all happens during that iteration phase during the kind of the training phase. The production part can be deterministic, right? We know how to do this. We’ve been doing this for many years already, and we know how to scale that up. But, on the training side, it’s really where you can get a lot of that ROI. So, investing in more robust solutions to be able to get the members in the company to move faster through their ideas without thinking about all of this overhead of all of this engineering, it will just convert immediately to a much higher ROI and be able to deliver results faster for the organization.

Guy Nadivi: All right, well, it looks that’s all the time we have for on this episode of Intelligent Automation Radio. William, there are so many aspects to AI and machine learning, which is still very much a mysterious black box to a lot of people. I think you’ve really done a great job today of demystifying training at scale, which is one aspect of AI machine learning that most of our audience probably hadn’t even heard of yet, but now has a much greater appreciation for. I expect we’ll all be hearing more about it though, in the years ahead. Thank you for coming onto the podcast today and sharing your insights with us.

William Falcon: Thank you for having me. It’s been really fun to chat with you, Guy.

Guy Nadivi: William Falcon, Founder and CEO of Grid AI. Thank you for listening everyone. And remember, don’t hesitate, automate.

William Falcon

William Falcon

Founder and CEO of

William Falcon is the creator of the popular open-source project PyTorch Lightning, and the recently announced Grid AI. William created Lightning while doing his PhD at NYU and as a PhD researcher at Facebook AI; Lightning allows users to scale models without the boilerplate and Grid enables large-scale training on the cloud.

Previously he co-founded the now acquired NextGenVest and spent time at Goldman Sachs. His PhD (currently on leave to focus on Lightning), is funded by Google Deepmind and NSF Foundation. His research interest is in unsupervised learning and the intersection of AI and neuroscience. William is a native of Venezuela and holds a BA from Columbia University in Computer Science and Statistics, with a minor in Math.

William can be reached at:



Listen to the Podcast