Open source isn’t working for AI

What’s the point of open sourcing code that runs at a scale no one can replicate? AI needs collaboration, but let’s think about it differently.

Contributor, InfoWorld |

Open source isn’t working for AI — Gorodenkoff / Shutterstock

Clearly, we need to do something about how we talk about open source and openness in general. It’s been clear since at least 2006 when I rightly got smacked down for calling out Google and Yahoo! for holding back on open source. As Tim O’Reilly wrote at the time, in a cloud era of open source, “one of the motivations to share—the necessity of giving a copy of the source in order to let someone run your program—is truly gone.” In fact, he went on, “Not only is it no longer required, in the case of the largest applications, it’s no longer possible.”

That impossibility of sharing has roiled the definition of open source during the past decade, and it’s now affecting the way we think about artificial intelligence (AI), as Mike Loukides recently noted. There’s never been a more important time to collaborate on AI, yet there’s also never been a time when doing so has been more difficult. As Loukides describes, “Because of their scale, large language models have a significant problem with reproducibility.”

Just as with cloud back in 2006, the companies doing the most interesting work in AI may struggle to “open source” in the ways we traditionally have expected. Even so, this doesn’t mean they can’t still be open in meaningful ways.

Good luck running that model on your laptop

According to Loukides, though many companies may claim to be involved in AI, there are really just three companies pushing the industry forward: Facebook, OpenAI, and Google. What do they have in common? The ability to run massive models at scale. In other words, they’re doing AI in a way that you and I can’t. They’re not trying to be secretive; they simply have infrastructure and knowledge of how to run that infrastructure that you and I don’t.

“You can download the source code for Facebook’s OPT-175B,” Loukides acknowledges, “but you won’t be able to train it yourself on any hardware you have access to. It’s too large even for universities and other research institutions. You still have to take Facebook’s word that it does what it says it does.” This, despite Facebook’s big announcement that it was “sharing Open Pretrained Transformer (OPT-175B) ... to allow for more community engagement in understanding this foundational new technology.”

That sounds great but, as Loukides insists, OPT-175B “probably can’t even be reproduced by Google and OpenAI, even though they have sufficient computing resources.” Why? “OPT-175B is too closely tied to Facebook’s infrastructure (including custom hardware) to be reproduced on Google’s infrastructure.” Again, Facebook isn’t trying to hide what it’s doing with OPT-175B. It’s just really hard to build such infrastructure, and even those with the money and know-how to do it will end up building something different.

This is exactly the point that Yahoo!’s Jeremy Zawodny and Google’s Chris DiBona made back in 2006 at OSCON. Sure, they could open source all their code, but what would anyone be able to do with it, given that it was built to run at a scale and in a way that literally couldn’t be reproduced anywhere else?

Back to AI. It’s hard to trust AI if we don’t understand the science inside the machine. We need to find ways to open up that infrastructure. Loukides has an idea, though it may not satisfy the most zealous of free software/AI folks: “The answer is to provide free access to outside researchers and early adopters so they can ask their own questions and see the wide range of results.” No, not by giving them keycard access to Facebook’s, Google’s, or OpenAI’s data centers, but through public APIs. It’s an interesting idea that just might work.

But it’s not “open source” in the way that many desire. That’s probably OK.

Think differently about open

In 2006, I was happy to rage against the mega open source machines (Google and Yahoo!) for not being more open, but that accusation was and is mostly meaningless. Since 2006, for example, Google has packaged and open sourced key infrastructure when doing so met its strategic needs. I’ve called things like TensorFlow and Kubernetes the open sourcing of on-ramps (TensorFlow) or off-ramps (Kubernetes), either open sourcing industry standards for machine learning that hopefully lead to more Google Cloud workloads, or ensuring portability between clouds to give Google Cloud more opportunity to win over workloads. It’s smart business, but it’s not open source in some Pollyanna sense.

Nor is Google alone in this. It’s just better at open source than most companies. Because open source is inherently selfish, companies and individuals will always open code that benefits them or their own customers. Always been this way, and always will.

To Loukides’ point about ways to meaningfully open up AI despite the delta between the three AI giants and everyone else, he’s not arguing for open source in the way we traditionally did under the Open Source Definition. Why? Because as fantastic as it is (and it truly is), it has never managed to answer the cloud open source quandary—for both creators and consumers of software—that DiBona and Zawodny laid out at OSCON in 2006. We’ve had more than a decade, and we’re no closer to an answer.

Except that we sort of are.

I’ve argued that we need a new way of thinking about open source licensing, and my thoughts might not be too terribly different from how Loukides reasons about AI. The key, as I understand his argument, is to provide enough access for researchers to be able to reproduce the successes and failures of how a particular AI model works. They don’t need full access to all the code and infrastructure to run those models because, as he argues, doing so is essentially pointless. In a world where a developer could run an open source program on a laptop and make derivative works, it made sense to require full access to that code. Given the scale and unique complexities of the code running at Google or Microsoft today, this no longer makes sense, if it ever did. Not for all cloud code running at scale, anyway.

We need to ditch our binary view of open source. It’s never been a particularly useful lens through which to see the open source world, and it’s becoming less so every day, given our cloud era. As companies and individuals, our goal should be to open access to software in ways that benefit our customers and third-party developers to foster access and understanding instead of trying to retrofit a decades-old concept of open source to the cloud. It hasn’t worked for open source, just as it’s not working for AI. Time to think differently.

Next read this:

Matt Asay runs developer relations at MongoDB. The views expressed herein are Matt’s and do not reflect those of his employer.