Episode 538: Roberto Di Cosmo on Archiving Public Software at Massive Scale
Roberto Di Cosmo, Computer Science professor at University Paris Diderot and founder of the Software Heritage initiative, discusses how to protect against sudden loss from the collapse of a "free" source code repository provider, how to protect...
Roberto Di Cosmo, professor of Computer Science at University Paris Diderot and founder of the Software Heritage Initiative, discusses the reasons for and challenges of the long-term archiving of publicly available software. SE Radio’s Gavin Henry spoke with Di Cosmo about a wide range of topics, including the selection of storage solutions, efficiently storing objects, graph databases, cryptographic integrity of archives, and protecting mirrored data from local legislation changes over time. They explore details such as ZFS, CEPH, Merkle graphs, object databases, the Software Heritage ID registered format, and why archiving our software heritage is so important. They further consider how to use certain techniques to validate and secure your software supply chain and how the timing of projects has a great impact on what is possible today.
Show Notes
- Twitter: @rdicosmo
- Twitter: @swheritage
- LinkedIn: @roberto-di-cosmo
- Wikipedia: Ceph_(software))
- Wikipedia: Merkle tree
- Wikipedia: Ralph Merkle
- Wikipedia: Salt_(cryptography))
Transcript
Transcript brought to you by IEEE Software magazine.
This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number and URL.
Gavin Henry 00:00:16 Welcome to Software Engineering Radio. I’m your host, Gavin Henry, and today my guest is Roberto Di Cosmo. Your bio is very impressive, Roberto. I’m only going to mention a very small part of it, so apologies in advance. Roberto has a PhD in Computer Science from the University of Pisa. He was an Associate Professor for almost a decade at Ecole Normale Supreme in Paris. You can correct me on that. And in 1999 you became a Computer Science full professor at the University Paris, Diderot, I think.
Roberto Di Cosmo 00:00:49 The first school is École Normale Supérieure. The university is now University of Paris city.
Gavin Henry 00:00:56 Thank you, perfect. Roberto is a long-term free software advocate contributing to its adoption since 1998 with the best seller Hijacking the World, running seminars, writing articles, and creating free software himself. He created in 2015, and now directs Software Heritage, an initiative to build the universal archive of all the source code publicly available, in partnership with UNESCO. Roberto, welcome to Software Engineering Radio. Obviously, I’ve trimmed your bio, but is there anything that I missed that I should have highlighted?
Roberto Di Cosmo 00:01:29 Well no, I can just sum up, if you want. My life is very three lines: 30+ years doing research and education, computer science, a quarter of century advocating about software and the use of free software in all possible ways. And the last 10-15 years it was just trying to lend a hand in building infrastructure for the common good and software, which is the main work at my hand today.
Gavin Henry 00:01:32 Thank you, perfect. So for the listeners, today we’re going to understand what Software Heritage is. Just a small disclaimer: I’m a Software Heritage ambassador, so that means I volunteer to get the message across. So we’re going to talk about what Software Heritage is. We’re going to discuss some of the issues around storing and retrieving this data at global scale. And then we’re going to finish off the show talking about Software Heritage IDs and where they come in and what they are. So let’s get cracking. So Software Heritage, Roberto, what is it?
Roberto Di Cosmo 00:02:29 Well, okay to put it in a nutshell, Software Heritage is something we are trying to build at the same time a “Library of Alexandria” of source code — a place where you can find the source code of all publicly available software in the world no matter where it has been developed or how or by whom. And this is a time of revolution in infrastructure at the service of different kind of needs. So the needs of cultural heritage preservation because software is part of our cultural heritage and needs to be preserved.
Roberto Di Cosmo 00:02:59 It is an essential infrastructure for open science and academia that needs a place to store the software used for doing research and restorability of this art. It is a tool for industry that needs to have a reference repository for all the components of software that are used today. And it is also in the service of public administration that needs a place for safely storing and showing the software that is used in handling citizen data, for example, for transparency and accountability. So, in a nutshell, Software Heritage what this is trying to address all these issues with one single infrastructure.
Gavin Henry 00:03:38 When we talk about publicly available software, is this typically things that would be on GitHub or GitLab or any of the other free open-source Git repositories or is it just, is it not limited to Git?
Roberto Di Cosmo 00:03:50 Yeah, the ambition of Software Heritage is actually to collect every piece of publicly available software source code, no matter where it is developed. So, of course, we are archiving everything that is publicly available on GitHub or GitLab or GitPocket, but we’re going much broader than that. So we’re goings after tiny small forges distributed around the world, and we’re going after package managers, we’re going after distribution that shares software. There are so many different places where software is developed and distributed, and we actually try to collect it from all these places. In some sense, one infrastructure to bring them all in the same place and give you access to mankind’s software in a single place.
Gavin Henry 00:04:36 Thank you. So if you didn’t do this, what problems arise here?
[...]