Transcript: SESSION IV: FOSS, Blockchain, and AI
November 2, 2018
Jerome Greene Hall, Columbia Law School
Session Description:
Everyone’s doing it. Companies that have not been traditional generators of free and open source software are now major contributors and stakeholders in FOSS Blockchain and AI projects. Companies that have built empires in proprietary software are doing AI development as FOSS. In this session, we discuss the centrality of FOSS to Blockchain and AI and the legal issues that will arise as we go on.
Presented by The Software Freedom Law Center and Columbia University Law School with James Waugh of Bidio, Scott Nicholas of the Linux Foundation, and Susan Malaika of IBM’s Digital Business Group and moderated by Eben Moglen, Professor of Law at Columbia Law School and Founder of the Software Freedom Law Center, and Mishi Choudhary, Legal Director of the Software Freedom Law Center.
See event page for further information.
Transcript
MISHI CHOUDHARY: This is our Session IV: Foss, Blockchain, and AI. The buzzwords are everywhere. Everyone’s doing it. Companies that have not been traditional generators of free and open source software are also now major stakeholders and contributors to Blockchain and AI projects, and the companies that have built empires in proprietary software are doing AI development. Now we will discuss why FOSS is essential to Blockchain and AI, and the legal issues as they arise.
Our first session is going to be about Hyperledger and the Enterprise Ethereum Alliance, how they join forces to drive mass adoption of Blockchain technology. We have with us James Waugh. He started Bidio in 2016 to help artists get sponsored without compromising the integrity of their work in collaboration with Trevor Overman, creator of the multi-dimensional Token structure. He’s now building CRE8.XYZ to facilitate coordination with greater accountability. He’s going to talk about what does this mean for developers? How are these open source communities working together? James, thank you.
JAMES WAUGH: Thank you so much. Appreciate it. I’m James. I’ve been running the Hyperledger NYC community for a little over a year now, and I just volunteer doing that because it’s a great way to meet awesome people working on open source technology. And I love these three buzzwords: open source, Blockchain and AI. Unfortunately, we’re focused on the first two, not all three. Couldn’t jam all three into one talk, but we’re talking about Hyperledger, which is an open source community for developing frameworks and tools that help businesses create their own Blockchains.
First I’m just gonna give you a high-level overview of what Hyperledger is, because there’s a lot of misconception about the name itself, which has the word ledger in it. Hyperledger is not actually a ledger. It’s not a Blockchain or a cryptocurrency. It’s just a growing community of developers working on projects, posted by the Linux Foundation. And these projects are open source frameworks and tools that other developers can use to build their own Blockchains. The governance model is a Technical Steering Committee which is elected every year. And that’s kind of how the decision-making process operates. You have people from members that get elected and then decide whether projects and you know different repositories get merged into the overall code base.
Here’s a breakdown of all 10 projects. You’ve got five frameworks and five tools. I’ll go into a few of these in detail, but the most important ones are the frameworks, you’ll see fabric and sawtooth, are the modular frameworks that you can use to build your own Blockchain. Indy can be used with those frameworks to implement identity frameworks and systems. And I realize for mobile application-focused developers, and then you’ve got Burrow, which is something I’ll focus on later in the talk, but that’s the permission-able smart contract interpreter contributed by Monax. And then all those tools down there, you see Composer, Cello, Explorer and Caliber, those are are also being contributed in an open source environment and facilitating the development of the frameworks along with actual implementations. Here’s just a taste of the governing board members of Hyperledger. You see American Express, so big corporations, but also contributing members who are also corporations like Intel, IBM, Oracle, etc.
And there’s actually been in the lifetime of all 10 projects a 729 unique code contributors representing more than 150 organizations, and total there’s 277 Hyperledger members today. Here’s just a few stats to throw at you to showcase the momentum we’ve built over the past few years: Like I said 277 members, a ton of contributors, all the projects, over 88,000 commits have been made, 11.2 million lines of code, and one of the meet ups, Hyperledger NYC, is part of this growing community of over 140 groups worldwide. So today I thought I’d share a recent development within the Hyperledger and Ethereum communities, which is the Associate Membership formed October 1st between Hyperledger and the enterprise Ethereum Alliance. And so as a developer, I am going to tell you what that means for people building, but I know I want to bring it back to the legal considerations for Blockchain in a business context.
Just to give you an idea of the difference between Enterprise, which is a public Blockchain in a public network versus the enterprise Ethereum stack, you can see here the application layer is totally separate from like what is the public Ethereum implementation and then you can see the blue stuff on these two slides is separate. That’s also, it’s an additional implementation that provides features that permission networks utilize. Really the only thing that’s core to the protocol is that yellow stuff, which is from the yellow paper from years ago. But you can add on trusted execution, private onstay chain, private state, private transactions, these things that the permission networks need in addition to the core technology. That’s just an idea of how Ethereum is different from Enterprise Ethereum Alliance implementation.
Recently, after Hyperledger and the EEA joined forces, the EEA released their 2.0 spec, which introduced the permissioning model. So that’s a huge deal because prior to this, Enterprise Ethereum Alliance didn’t require a permissioning model to be compliant with their spec. Now going forward the two groups are more aligned. Hyperledger being more code first, and the EEA being a standards body, more spec first. Working together we can drive progress and adoption of this technology, because we’ll have more flexibility and interoperability. And that has a lot of implications for various use cases, which I’ll get into.
But just real quick, the permissioning model has four smart contracts. I’ll just introduce that idea: the participant contract, the participant group contract, network and permissioning decider. So the key factor here is the positioning decider actually controls whether invited participant groups get added to the network. So participants within those groups make transactions, but then participant groups actually vote or invite new participant groups into the network and the permissioning decider approves or rejects those invitations or approved invitations.
So really quick, this is the logic of a smart contract. I know I’m at Columbia Law School, so I’ll just, you know, admit straight up: smart contracts are not smart and they’re not contracts, but that’s what we’re calling it. So I know that’s not what I’m here to talk about, like, oh, we’re going to change the law forever, but like because people matter, like that is the most important thing. Government is not the same as governance. I could ramble about that but essentially for the inputs you got the contract I.D., the transaction requests, the dependencies, and the current state, goes through the contract interpreter, which in Hyperledger context would be Burrow, and implementation of their forked permissionable smart contract interpreter, EVM re-purposed. And then outputs either haven’t accepted or rejected transactions, so that’s different from a permission list network. You can actually send something to someone that you’re transacting with and they can they can reject it or they can, you know, they can communicate with you before sending the result to the rest of the network. It’s not about whether they can just reject it. It’s about, they can say like, oh, no, that’s not good. Like without anybody else seeing it. So there’s the attestation of correctness which is involved in the consensus algorithm. And then the state delta. If it’s not accepted, you don’t have a state delta, that’s just like the change in the current state from the input.
Bringing it back to the legal frame of mind. We’ve got three important questions in the permission network context, which are, you know, who has the right to vote on the correctness of a transaction? And how is that right determined? And then what happens when that right is revoked? What is that? Maybe there’s arbitration involved or you know some kind of legal course of action that’s necessary. So diving into some of the frameworks. Burrow is really important in this context of the unification of the Hyperledger with the EEA because it is an EVM fork.
So they actually couple their EVM execution engine with the Tendermint consensus algorithm which is like a delegated proof of stake model and then that’s over in the application consensus interface, application Blockchain interface, so I can in Ethereum, you have ABIs, which are just like application binary interfaces for contracts, the interface with the contract, but this is actually for interfacing with a permission network. So that happened. You know, originally the Burrow virtual machine was a novel concept when they first contributed it to Hyperledger, but since then, you know, there’s a lot of forks in the UVM, and the way Burrow is used is really the most important thing to consider. So Sawtooth last year introduced the Seth transaction family, the Sawtooth Ethereum transaction family. So transaction families are just like smart contracts in Sawtooth. So it’s just like a smart contract that allows you to interact with web3 using the Seth-RPC module. So that, in the Seth transaction family, you see in the top right. That’s what enables you to deploy an EVM bicode smart contract using Sawtooth as a framework. And it is default permission-less, so this was the first time that Hyperledger was pushing permission-less architecture, which is pretty cool because most people usually think about Hyperledger as the permission Blockchain ecosystem. Moving on from Seth from last year, this past month Fabric released 1.3 which had some really cool new features, like the Identity Mixer which uses zero notch proofs to promote identity format which is anonymized, and then it also, more relevant to this conversation, is the support of EDM bicode smart contracts, so now you can write contracts with Solidity or Viper, which are the two programming languages introduced by the Ethereum developer community, and you can use those to implement your own Blockchain with that runtime. So you have like the Solidity contract running on your network, which is permissioned, or not if you don’t want it to be. The idea is that you don’t want smart contract development, you know, in the Ethereum community to be mutually exclusive from permission network development. So in addition to the support for Solidity, Fabric also released an equivalent to Seth, which is called Fab 3, kind of like the Seth-RPC actually and that allows you to interact with web3.js.
Taking a step back from that stuff, you know that’s just the trajectory of Hyperledger, going from permission to more permission-less and inviting Ethereum developers into the Hyperledger ecosystem. And that resulted in some really cool and really valuable collaborations, like the Decentralized Identity Foundation, which actually drove adoption of the Decentralized Identifier Standard or the DID standard. So you’ll see that across both Ethereum, or the EEA and Hyperledger communities. So that’s one thing. Indy is the framework but Sovereign is the first implementation of that. It’s a network that provides correlation resistant, pseudonymous identifiers. And they also have some zero knowledge proof tools, but relevant to the legal considerations, you know, definitely worried about privacy compliance. GDPR is on everyone’s mind these days.
Getting into privacy and confidentiality, you’re probably familiar with how privacy is more social. It’s kind of fuzzy or squishy, you know, it’s not really defined in an objective way, but you can think about confidentiality, so that’s just concerned with the enforcement of acceptable policiees for who has access to what info at what time. You know that allows participants to keep data about transactions confidential, including just the existence of the contract itself. They can choose when and with whom to disclose their data, which is crucial to the GDPR compliance. They can actually avoid putting PII on chain entirely, so you can selectively disclose parts of the data, which as a party is obviously, your right to disclose it is transferred with ownership transfer. In an ideal world we’d like to show that users can’t learn that information that they aren’t supposed to know. We want to do that in a scalable, secure way. So here’s a breakdown of how the EEA has revised their approach to private transactions. There’s an interesting distinction, a little more technical, but just a high-level. Fabric took a different approach in providing these private data collections as opposed to the EEA and their clients. They’ve implemented more, like an actual definition of the private data in the transaction itself. And so there’s two different types.
There is restricted or unrestricted. So metadata, if you’re not familiar, is just like data about the data, about the payload data, so you can see that you must not allow storage by non-participating nodes if it’s restricted private transactions for and that’s for the payload data. So you see, the bottom row is definitely the most important, that’s where the difference is. You should not allow if it’s restricted and you must not allow if it’s restricted payload data. So the biggest advantage of this joining forces will be interoperability. And that’s what open source is all about, I think. And it’s bringing new meaning to that word. It’s actually focused more on the operations of a technology stack, so a business could benefit from being more transparent and sharing data with competitors in various ways. So this is a new possibility that’s being explored and experimented with. So Hyperledger is one of the first groups to start advocating for a more pluralistic view on Blockchain. There’s a lot of maximalists, you’ve probably heard that term if you’re interested in the Blockchain space, but Hyperledger is all about you know mini chains. This idea of cross-ledger transactions is really important.
Quilt is a tool which is used for that kind of horizontal interoperability. So there’s, I won’t go into this too much. This is all the different considerations you need to review when thinking about interoperability of cross-ledger transactions. Here’s other use cases for interoperability that aren’t involving cross-ledger transactions , so like credentials identity, merging or upgrading your ledger, logging and querying the chain itself. We’ve got the transaction interoperability and smart contract interoperability as well. Finally, last two slides, big picture. Interoperability is really important from the developer’s perspective. But how about from the real world perspective? I think the most important thing that I, Hyperledger, and Enterprise Ethereum Alliance can do is educate the world about Blockchain technology. They’ve already started doing so, providing certifications and open source or open MOOCs, these massive open online classes through EdX and other providers.
And finally, governance. I think that’s a big, big part of it. Will we see Ethereum community members join the Hyperledger Technical Steering Committee or vice versa? Well they start to further join forces not just as associate members but actual you know unification? And that’s kind of where I’ll leave it and throw in a plug in for the Global Forum in December. If you’re interested in this and you want to get involved, it’s a great way to tap into the conversations. And it’s extremely open. I got interested in Hyperledger when I visited Chicago for the hack fest a year and a half ago, and I had the best time just listening. It’s remarkable how open and welcoming the community is. And, you know, you’ll have executives from IBM and Intel, you know, kind of like arguing across the room with everyone, you know, there just to witness. So that’s unique. I think it’s the reason why Blockchain is one of the most significant open source communities out there. Yeah, I know I ran over a little bit, but I’d love to answer questions. I know we’ve got another presentation, and I’ll be around afterwards if you have more.
EBEN MOGLEN: So I think what we’ll do is we’ll save all the questions for a panel at the end. Run straight through. Scott Nicholas is the next speaker. Scott is the senior director of strategic programs at LF. And that means he makes projects happen at the Linux Foundation. And lately that’s been the structure of the Deep Learning Foundation, which is an umbrella project for various LF AI machine learning activities.
SCOTT NICHOLAS: Thank you. As Eben said, I spend most of my time setting up open source projects and supporting them once they’re launched. In March of this year, we launched our umbrella project for AI. Umbrella Project, as Mike talked about earlier today, is when we have a single funding effort that supports multiple technical projects in a space. Each technical project is free to have its own technical governance, its own technical roles. The OSI-approved license that works for it, but funding requests come from a single body.
Since we launched LF Deep Learning Foundation in March, I’ve been leading our effort in supporting that. Mishi asked me to come and speak to you today about open source software and why is it necessary or whether it’s necessary for AI. I think it’s very necessary to AI for the same reasons that it’s necessary to other areas of software, with a couple of different kickers. So, in exploring these questions right now, I am going to walk through these points: Why do organizations pursue open source strategies in general? Open source AI today, where we’re at. The promise of open source AI ecosystems, and some personal predictions, that won’t be official LF predictions but my own thinking on where things are going.
All right, why do organizations go into open source? The way I look at this is there’s three drivers, two of which are economic. The third is because it’s the only way. The first is to reduce development costs. We often refer to this as outsourced RnD or external RnD. If everybody has to do the same thing to enable a product, we’re all copying each other. Maybe there’s no secret sauce. Instead of all of us building an ecosystem, why don’t we have some of our developers contribute to an ecosystem operating system, as an example. Why don’t we have some developers contribute to an operating system, and we all use that as a base layer, just to give one example. That takes a large amount of investment that is duplicative, reduces it, and allows each organization to focus on what they do differently, what they do unique, where they want to push their technology.
The other reason organizations are getting into open source and have been getting into open source is because it enables new business opportunities. Mike talked about this morning, when he talks about transformational projects, projects that create sales opportunities, revenue opportunities, new services that didn’t exist before. In AI we’re seeing this take place with respect to many different areas.
One that’s in LFDL we’re witnessing in real time is the development of business models around AI models. How do you build AI models? What are the services that you can wrap around model construction? Can you deliver models? Can you have a free platform to deliver models and then build on a royalty scheme on top of that? Am I moving around too much?
AUDIENCE MEMBER: Yeah.
SCOTT NICHOLAS: [Laughter] It’s exciting stuff. That’s all I have to say. So, new business models. The third reason we see organizations interested in open source is because sometimes it’s the only way to solve a problem. The common example we use here is security. The chain is only as strong as the weakest link. Open chain is another example of a group coming together to make the entire system more secure. These examples port very well over to AI. We see large numbers of examples of organizations that are taking internal development; they’re making it an open source project.
The hope is they will see developer diversity, organizational diversity coming into the project, and something that would otherwise be a captured expense that they would have to bear can be a shared expense throughout the community. They can still innovate on top of that, and that’s where their technical development, their R&D, internally will be focused. But the base layer can be a shared, collaborative effort. We’re not there yet. We’ll talk about where things are at in the moment. Right now, we have a lot of projects that are, that don’t have a lot of developer diversity. They’re heavily weighted towards a single organization, even though they are open source. But that its reason it applies to AI. Clearly, enabling new business models: I just gave the example of the AI model itself.
There’s lots of other ways in which new revenue opportunities for businesses are being opened up through open source collaboration in artificial intelligence. And then, in terms of solving problems that we can only solve together: Susan Malaika from IBM will be speaking next, and I’ll tee up a little bit of this for her. They’ve done a lot of really interesting work in bias and AI fairness. There’s… Unless we have a common language for talking about things like data provenance, we won’t be able to understand weaknesses and limitations that might be present in the model.
And so, together, working in a collaborative open source and open standards environment, we have the ability to address those issues. So, open source today, we see significant fragmentation. I’ll have a slide on this in a second. High proportion of projects with primarily one organizational backer. Code is often developed internally and then is to support, in many cases, a product of service offering and then is open sourced. That’s not atypical to AI. I think a lot of things start out that way. Oftentimes, projects are highly specialized for performance of specific tasks. Not a lot of contributor cross-pollination among projects, that’s something that LF Deep Learning Foundation is specifically focused on, not just on the three projects that it currently has, but across the broader space. Projects are often tightly coupled with original authors. Fast release cycles and significant interest from academic institutions. Fragmentation. These are just frameworks, deep learning frameworks, and it’s not an exhaustive list. It’s a partial list.
There is a duplication of efforts right now, and that’s okay. Fragmentation in and of itself isn’t a bad thing. Over time, through working together, through having these projects coordinate with each other, they can begin to focus on areas where they excel. And one of the promises of open source collaboration is to reduce fragmentation and waste. So, all the traditional reasons for developing open source ecosystems apply to artificial intelligence: reduce fragmentation; reduce waste. The waste reduction isn’t just across open source projects, but it’s also from an internal R&D development perspective. Allow projects to specialize on their core strengths and go upstream for secondary to the project functionality components and enable new business models.
And a couple additional reasons that I’ve touched on briefly, but to sum up as to why the open source software model is particularly valuable for AI. It allows us to address BIBO: Bias In is Bias Out; garbage in is garbage out. If you have ten models for assessing your data and they’re all stacked on top of each other; one is pulling out people in a video stream; the other is looking at faces; if you have bias that creeps into one of those bottom layers, you have the entire analysis chain out of alignment with what you’re trying to solve. We need to share the context of training data. We need to share what it means to have a model trained in a particular circumstance, and data is a key ingredient.
That requires cleaning, sorting, tagging, and provenance tracking of data, all things that we’re familiar with from an open source perspective, in terms of “Where did this code come from?” We have to ask the same questions with respect to data. I will close with personal predictions. We will see increasing coordination across these projects. Fragmentation will be reduced through collaboration in both open source and open standards efforts. Dominant owner projects, over time, will migrate towards community governance. We will see the emergence of data curation projects. The community will focus on AI fairness and bias. And competitive dominance will not just be driven by the technology but also access to data. And thank you.
EBEN MOGLEN: I’m professionally happy that people are speaking rapidly. You understand me: because it’s making it easier for us to keep the time, but it isn’t increasing the clarity with which we are letting these themes emerge. So, soon, I hope, everybody will see that all this adds up to one thing and not just two or three things done rapidly. Susan Malaika, who, I think, is going to put it all together for us, is Senior Technical Staff in IBM’s Digital Business Group and a person whose work is causing more open source contributions to occur at IBM: a very important job I honor and admire immensely. She works on data governance matters mostly now, which brings us to the subject of the day, as far as I’m concerned, in this panel, which is how to understand the complexity of all the data that we generate and the things that we do with it. Susan, I leave it to you.
SUSAN MALAIKA: Thank you, thank you. I’ll just try and even tie things together a little bit even before I begin on my formal presentation. I only just realized kind of a day ago that there was Hyperledger as well as AI in the same session. I will mention that I do actually run hackathons. That’s one of the things that I do, and I’ve run a number of Hyperledger hackathons, and just to put it into context, I’ve also run them in the Middle East, with refugees, for example. And just to give you an idea of the use cases for Hyperledger, they may not be exactly what you expect.
People came up with using blockchain to manage their documents, their personal documents. As refugees, they lose all their materials and so on. And so, it’s not just about finance, but it’s about things as well and managing things. Another example was organs, managing donations of human organs through blockchain. So, to just give, you know, just to give you a little idea of some of the activities that come out of Hyperledger, they’re not just related to business or finance.
The other thing I wanted to say is that I was recently in Delhi at the Freight-Forwarders professional association, a global association. Freight forwarders are the people who fill in all the ships and containers. They’re not the people who manage the containers; they’re the people who fill them. And they specifically asked for a talk on AI and blockchain, and that’s what we covered, the two topics together, which was blockchain as the underlying infrastructure for the freight and the managing of the freight, and then AI systems sitting on top of the blockchain-based transactional systems.
The two technologies do it together. And you can Google me. My Twitter handle is up there, @sumalaika. I tweeted out my slides from that session as well, So, relating Hyperledger and AI together, and also just to tie in a little bit with Scott, the frequency and the amount of open source contributions and consumption at IBM has grown so hugely in the last couple of years that we’ve had to automate the processes. Most of the approvals are automated now. There’s no meeting to discuss whether we’re going to contribute or we’re going to, you know, we’re going to consume this. There are certain parameters, now, that we have. It’s only then that we would actually do a human review.
The reviews now are actually programmatic–and we have a whole explanation and presentation that we could give about how we’ve automated our open source activities inside the company.
Now, I’m going to move on to talk a little bit about AI. AI has caused various predictions, as the one that happened in Wired Magazine a couple of years ago: “no more coding.” It’s all going to be training. You’re going to train software like you train a dog. And that’s what the article goes into. And so, what you use for training is data. And that’s the food to, that you, give the dog. And so, on the left-hand side of the slide, we see the various categories and the terms that people use. So, artificial intelligence is the overall term that people use for doing, predicting for, analyzing the past in order to predict the future, which is netting it out. You give the system data, and it uses reasoning, machine learning, and various techniques to predict the future, so that’s the overarching category of artificial intelligence. And within that, there’s something called machine learning, which is quite well-established.
And what that is is a human decides what features you’re going to, which of that are important to analyze from the past in order to predict the future. It, the human, says “Okay, I think age and gender are important to predict whatever, so these are the… This is the data that I’m going to analyze, those particular characteristics, in order to predict something.” And then the software starts doing the predictions and eats up the data and is focusing on those two features. And then there’s something called deep learning, where the software itself doesn’t rely on a human to identify the features. The software decides. It’s given information saying “This person behaved,” or this is the classic example, is, “This is a photo of a cat; this isn’t a photo of a cat.” And you keep feeding the deep learning neural network pictures of cats and non-cats. And then it starts understanding how to identify a cat without you needing to tell it that a cat has two ears and some eyes and so on. So, those are the sort of categories of the broad-brush categories in AI, and deep learning is the part that’s really attracted the attention. That’s what people are focusing on. And a lot of the frameworks that Scott showed earlier (Pytorch, TensorFlow), that’s what they do; they do the deep learning; they decide what’s important; the software decides and then from what whatever it’s fed and then starts making predictions from the past.
The reason why it was saying it’s “the end of code,” or the person who wrote the article was because what you do now is you just give another data set to one of these frameworks, and it starts predicting something; you don’t need to reprogram; you just take, take on of each framework. So, you take a model that somebody’s already trained. A model is a mix of some data and some rules, some parameters, that we defined, that made it good; the software actually was able to predict, distinguish cats from other things easily so that you make a model. This is a model that predicts cats from photographs. And then you take that model and maybe adapt it a bit to work on something else. So, what’s happening is people aren’t just sharing code anymore; they’re, I mean, they’re continued to share code, but they also share data. And so, one of the renowned places to go and look for data sets is Kaggle, which is a contest that people participate in, and it’s based… You know, and there’s many data sets, and you go in, and you take that data set and participate in a competition that’s focused on a particular data set, and then you get ranked at the end. And so, that’s a place where you can find data sets; however, the… And there’s lots of other places. Governments have open data sets, and the licensing of data sets isn’t particular clearly in how you collaborate on data sets, but there is a license that came out, I think about a year ago, the Community Data License Agreement from the Linux Foundation, which I’ve put on the top right on the slide, and that may be a place to start looking and exploring how to license data sets so that it can be shared properly. Model marketplaces is another area, and I’ll have a couple on the next slide. So, people are putting models in, and you can go in a marketplace on… And you can go and download a model and use it as a basis for an application that requires some element of prediction. And licensing, again, for models is not very clear either. That’s another new area. So, what does it take to trust decision-making by a machine? Some of the questions you may want to ask, like “Is it fair?” And there’s been a lot in the press the past couple of years about very poor predictions because of very poor data sets that the software was trained on.
And because if it was pictures of people, only certain kinds of people were included in the training data sets, and then, so, it didn’t know how to identify people from different ethnicities, for example. It didn’t know that that was a person. “Is it easy to understand?” “Can it explain itself?”
Last night I was at a meetup which was a panel around AI and there were experts who build models on that panel, and they all said they could not… They are the experts, but they don’t know what their model, how it’s actually making predictions. They can’t explain. You know, they train it and so on, but then it starts making predictions, and they wouldn’t be able to explain it. Somehow, we need to factor that in, maybe, when we’re working in this area. It’s becoming very difficult as the area becomes more complex to know why certain things are being predicted. And, “Did anyone tamper with it?” It’s possible to go in, and, let’s say, the software that predicts a cat in the photo, you can mess with it. You can mess with the photos, so you give it pictures of cats, but you mess with the photo a little bit or/and with the input, and it could say it isn’t a cat. And that’s a known way of tampering with an AI system.
Just to summarize some projects which are available now that we’re working on, and we’d be delighted if more people joined us. One is the project Egeria, which is a project at the Linux Foundation, and it’s all about sharing metadata across tools. And so, there have been many attempts in the past to do, to sort of harmonize metadata, especially in companies. Metadata, as was said earlier, is data about data, and when your photos have descriptions in your phone when you take a photo, where it was taken, what time it was taken, and so on, so that’s the metadata for the photo. And there have been many attempts over the years deal with metadata, and they’ve all failed. This project is taking a slightly different approach in that what it’s saying is we don’t need to identify a single place to store the metadata. It’s identifying a way that metadata can be shared. That’s what it’s focusing on. Different tools can share their metadata and ask other tools, “Oh, do you have information about this characteristic? Please tell me about it.” So, Egeria, she’s a goddess, a goddess of wisdom, and that’s the name of the… It’s a fairly new project and love for people to join.
Another couple of projects that have been a relief the last six months from IBM. One is adversarial robustness toolkits, and that’s a toolkit that helps you understand your AI system, the places, the weak points, where people can come in and tamper. Remember I said you could tamper with a system so it stops actually recognizing cats properly. And this is a set of tools, the Robustness Toolkit, to help you understand the points where your system may be a little weak. Similarly, with the AI Fairness toolkit, the AI Fairness 360, it’s also providing a toolset to help you understand that you have biases in your input data sets or anywhere along the pipeline. The pipeline in AI the data collection phase; the building of the classifier, the thing that says is it a cat, is it not a cat; and then the prediction.
It has pieces, tools, the AI Fairness 360, at every stage to help you understand whether you’ve got biases in your pipeline. And then, finally, I just would just like to mention something called Factsheets for AI Services. This is a paper which has been submitted to a conference, with the idea that, just like you have nutrition labels on food, on AI systems you’d have something similar, which would be–The provider of the AI system would answer a certain set of questions. So, this is early days, but I’ve put links. I’ll tweet these slides out in a moment. I’ve put links to all the things I’ve talked about, and, on Wednesday, in New York, we’re running a meetup, specifically hands-on, specifically around the AI Fairness 360 toolkit. So, just either look at my Twitter, @sumalaika, or Big Data Developers in NYC. I also have, like this gentleman here, also run a meetup here too in New York. Anyway, thank you all very much.
EBEN MOGLEN: Thank you very much. Stay, stay, stay. Scott, James, do you want to come up, and we’ll try and make some conversation for people? So, let me see if I can ask the dumb question that will at least get us started. The reason that we need open source is that the complexity of the software development we are pursuing is so great that it could not be done any other way, right? There’s no organization in the world that could manage eleven million lines of code generated over eight months. There’s no possible way that we could understand how machine learning systems work, given that their architects can’t understand them because all they do is they throw neural nets together in a cookbook, and stuff comes out. There’s no way that all these projects could manage that reduction of fragmentation that you’re talking about unless they could all see one another’s code and understand what the common denominators were in their technical approaches. We’d lose our minds if we didn’t have open source. Is that right?
JAMES WAUGH: I believe so. I just want to echo two points they made because it’s all about the training data in this context. You know you got to be more transparent with the context of that data. You mentioned that point. So, it’s not just about sharing the code but also sharing the information you collect. It’s a big part of open source these days.
SUSAN MALAIKA: One reason, one reason for open source could be that you need more people working on the software in order to make progress because it is complicated, but the other is that but through open source you get a bigger ecosystem. Open source attracts more people, and if you do it in-house, closed source, you can never keep up. So, that’s the other aspect.
SCOTT NICHOLAS: Yeah, I would agree with that.
EBEN MOGLEN: That gets us the first point, which is that we’re showing people to code because we wouldn’t be able to manage all the code; we wouldn’t be able to keep our levels of technical progress up, and because the code doesn’t matter very much because it’s all in the data. And this, of course, raises another question. I keep saying every year, “Come back to the conference next year. We’re going to talk about machine learning licensing and free software.” But I keep not doing it because there’s a new license every six months, but we don’t actually know what the principles of data licensing ought to be yet. I think we have some ideas. Let’s just try a couple of the questions which we’re accustomed to with respect to code and see if we can understand the licensing of data. In the world of code, we said copyleft was required because if we didn’t have copyleft, then users wouldn’t have any rights. What is the users’ rights concept in training data and the mixed-provenance data sets? What right ought users of data to have in this world?
JAMES WAUGH: I can start. I was at the same event you mentioned.
SUSAN MALAIKA: You were last night?
JAMES WAUGH: Yeah, I was standing in the back.
SUSAN MALAIKA: I was sitting in the front, at the podium.
JAMES WAUGH: They brought up some really interesting points, and I’m probably just going to echo that, you know, the whole we talk here, but they mentioned the example you brought up. In the context of facial recognition software, it’s a question of whether you should limit the exposure to a certain ethnicity that’s more prevalent in your market. So, if you’re Apple, and you’re building Face ID, and you got a lot of access to white faces but not other, minority groups, should you limit how many, you know, majority-group, you know, subjects you analyze, to make it fair? So, that’s a really interesting question, I think. From a capitalist perspective, they’re going to say “no.” Like, “We’re Apple. We’re going to make it as best as possible for every group,” but does that leave, you know, the minority groups at a disadvantage?
I think it does, and it hurts them in some ways. So, it’s a really important consideration, and, not to ramble on, but we mentioned the data, but I actually think, more importantly, you got to be careful with your assumptions because every assumption is based on assumptions. And you can never make a fully accurate prediction, so you got to be aware of that. You know, you don’t just blindly trust these algorithms.
EBEN MOGLEN: What should the rules be about, when we improve or modify data, to whom we have to give the improvements we have made? There’s going to be a clear distinction in our world between improvements or modifications to data and arrival at inferences or predictions. People are going to try and keep inferences or predictions to themselves because they’re going to make money on the basis of their predictions, but that doesn’t necessarily mean that their modifications to data, which help them to produce those predictions, are also property of the modifier. Is there a role for copyleft-like licensing for data in this world?
SUSAN MALAIKA: Well, the CDLA license does have two flavors. It does have the flavor of, “If you make modifications, you do give it back,” so they’ve… You know, the license hasn’t yet taken off, as far as I can tell, but that’s in there as a thought that there are at least two flavors.
EBEN MOGLEN: And the original IBM thinking about it was copyleft-like licensing for data, and the arrival of the permissive license in the process of license-making was a little bit disquieting to me because, in a world of permissive data licensing, we might as well not bother with licensing at all. It’s just data there’s out there; people make private modifications; they keep them to themselves; they have no obligations of reciprocity, and, in the long run, we are left holding a lot of free software models even, but we don’t know what the data is, and everybody’s data his, her, and its own. At that point, any conception of fairness is gone. Fairness has disappeared into proprietary control. I’m interested in social fairness, but I’m also interested in the general advantage that we have received from this consensus we’ve arrived at about how to make software, a consensus which came in part out of the scientific method, which was a consensus about publishing and sharing data, and it’s interesting to me that we now have to reinvent those fundamental premises.
SCOTT NICHOLAS: Isn’t there also a burden, and a responsibility on the part of the user, however, to understand what they’re taking. To borrow an old telecom term – and this was the point that you had made about making assumptions about what you had – to borrow an old telecom term, these are not God boxes. They are not magical. They will not have the right answer. We, as users, must always fight the assumption that because it comes out of that process it is inherently right. We need to question the data and be part and have a voice in the communities that decide whether the community wants to have a permissive approach or the community wants to have a sharing-based approach.
EBEN MOGLEN: This was an important part of what we found ourselves doing in the early stages of the free software movement. We were changing the technical education system. I started saying twenty years ago that free software was the greatest technical library ever created in the world because it was the only of technical information that existed where you could go from naiveté to the state of the art in anything you wanted to understand that computers made to do simply reading material that everybody had. This point that you’re making about having an educated understanding of what data can do and what pattern recognition in the data can do is another example of a need to expand the educational structure of the world.
Mishi and I were talking in Bangalore in a couple of years ago with the educational nonprofit of a very wealthy Indian IT billionaire, Nandan Nilekani, and we were meeting with EkStep to talk about the way in which EkStep as a charity hoped to change Indian education, and I said, “You know, in the Twenty-first Century, the society with the most data scientists wins, and there are a lot of people in India, and if you made all of them data scientists, then you would win.” And data science is much simpler to teach than complex statistics, so now what we need is for people to be able to learn data science. But, of course, if you want to learn data science, you have to have data. And really what we have to have is a socialized approached to this question: “What do we know about data? How do we learn about data? How do we understand it?” And we have to begin teaching people fairly young the kinds of questions that you’re leading to, right? “How do you know what a data set is? How do you understand its integrity? How do you understand its biases? How do you understand what will happen if it is fed into models of various kinds? Without the openness of all of this, including the copyleft-ness of all of this, we will not be able to keep that educational commitment, because we won’t have enough material for people to learn from.
SUSAN MALAIKA: I have a question. So, what do you think the tipping point, though, was for software? What made it really, sort of the open source movement, take off? And is there a similar thing we should be looking for with data?
EBEN MOGLEN: That’s a terrific question. It’s probably my long life with Richard Stallman, but I would say that the tipping point was the difficulty of writing C compilers that didn’t cause enormous failure, and GCC happened to be the thing which solved the problem of the awfulness of C, and it pulled towards itself everybody in the world, and everybody began contributing to the compiler because if the compiler wasn’t perfect, and if it didn’t work for your operating system, then no program was reliable. So, my own personal supposition, for whatever it may be worth, is that the tipping point was the beauty of the multi-stage compiler and what Richard made up all by himself in the middle of his ruminations, without having ever taken a compiler course.
SUSAN MALAIKA: Could car driving be the tipping point for it?
EBEN MOGLEN: Well, that was part of what I was hoping. We were going to think about with respect to the cars and every challenge that the cars present. But I also think that what you’ve said here is that we’re going to use data in such complex ways. What Hyperledger is is the community of people replacing the database-oriented enterprise software structure with a whole series of new modalities of what software is, all coming together all at once, in all the different ways that you were showing, right? And the database was a comparatively simple way, and it got built inside buildings that look like disk drives, and, you know, it was an in-house kind of a thing. So, the paradigm of software-making came out of two or three places, and Oracle and DB2.
But now that post-database structure, that way of storing data and accessing it that we are calling blockchains, all of the new paradigms of software are being constructed all at once in parallel by large numbers of people working together in a welcoming open source structure. Same thing is happening with respect to the code, for pattern matching, machine learning, all of it. But the data?
SUSAN MALAIKA: No, not yet.
EBEN MOGLEN: Okay, so now we know that data is the new petroleum, and now we know that making an oligopoly out of the oil business is the need of a lot of very powerful people in the world, and we need a revolution about that the same way we needed a revolution about free software code. That’s the hypothesis; what do you think?
SCOTT NICHOLAS: I think data is the commodity, and we will see, in future years (getting back to personal predictions) the rise of very powerful players in the AI space simply by virtue of the data they can tap into.
SUSAN MALAIKA: But data has a problem, though, that maybe software didn’t quite have the same problem, the problems of privacy, the problems of government or entities taking control. And so, that’s the other side of it, which maybe software didn’t have so much.
EBEN MOGLEN: Indeed. Questions? Yes.
AUDIENCE MEMBER: Thanks a lot. That was really a fascinating panel that really identified, in the Q&A now, data is the key concern and the key resource for the revolution that you’ve been describing, and, as was pointed out, the other key theme is really peace, and I’m wondering if there’s a stronger connection than was acknowledged so far between those two in the sense that is it possible – so, that’s the hypothesis – that the consensus on open source software development and big tech companies buying into that consensus is driven, has been made possible, by the realization on the part of these big tech companies that to really retain a comparative advantage, because they have a have advantage in terms of having access to data and are able to retain that advantage?
And you’ve mentioned data sharing, and one can think of use cases and rich private incentive of companies to share data, but we haven’t seen that much of that, and we have this asymmetry with open data initiatives in which governments make data available, and we have, like, really small cases of data philanthropy in the data sets you mentioned that have all these flaws, like the Enron emails that trained all these machines with emails that old white male executives sent to each other. So, I guess my question is, rather than law and privacy law being this obstacle in this world, couldn’t, isn’t there a strong, relatively strong, case for government involvement in the sense of requiring data sharing from private actors in the form of mandatory data sharing? And, if that’s the idea, like, is there… Can we rethink some of the legal apparatus that we have, like forced data localization, not as a thing that threatens the internet but as a data proliferation tool that is actually an equalizer for the digital economy? Thank you.
JAMES WAUGH: I’ll just say really quick, it’s a unique, you know, attribute of data is that it’s non-rivalrous, but it does provide significant advantages in that competitive landscape you’ve mentioned. So, withholding access to the troves of data from India would be a tremendous advantage for their open source community, but it wouldn’t be for the global open source community, I think. I think it comes down to intentionality and, you know, realizing that shared goals exist, and, you know, with a permission network, you can actually control what information you share and still benefit from that transparency.
EBEN MOGLEN: Others?
AUDIENCE MEMBER: One of the things I’m concerned about is if we’re trying to create a community-supported, you know, collection of data, you know, with code, code is sticky. I want to use a Linux kernel. I want to have a kernel-loadable module. There’s only a certain number of ways I can do that, and I know there’s less implications in that. Data doesn’t have that. I can use data billion different ways. I can use it for training, and then I want to add data to that. But I don’t have to actually add it to the data set. That can just go… The next day, train it on the data, and there’s nothing that’s forcing the sharing. So, how are we going to address it?
SCOTT NICHOLAS: More data is incrementally much more useful; it grows in a nonlinear – and I’m not data scientist; I’m an attorney, so take this with a grain of salt – but the usefulness of the data, I would posit, grows nonlinearly. And so, there’s that aspect. The other issue is – and I think this ties in somewhat to the question – we have to have a way of talking about the data. And there needs to be not just contribution of data, but we have to know how to talk about the data and how to know that we are comparing apples to apples and that data sets are compatible, and that’s an opportunity for open source collaboration.
SUSAN MALAIKA: Another thought could be that, “Okay, the pressure… We wouldn’t be putting pressure on people making data available on its own, but the data and the model together.” Maybe that’s another way of looking at it. And so, the pressure would be the two together, and then you get the intentionality a little bit, as well.
SCOTT NICHOLAS: We’re already getting requests to include data with models, and so we’re talking on how to address that.
SUSAN MALAIKA: So, the combination might be the approach.
JAMES WAUGH: And it starts with systems thinking, realizing that you’re part of a bigger market. You know, there’s information you don’t have access to; you might have an advantage with your data set, but the other data sets are out there. So, it comes to a point where there’s a group that coordinates and a group that doesn’t, and they gain advantages, and that’s what triumphs, adoption of the consortium model or cooperative model.
KAREN COPENHAVER: I will say that I don’t know what the points in time or what the events will be. But, well, Eben is obviously completely correct that the inflection point for open source software was this incredibly compelling code that came from Richard and how hard it was to build excellent code. But then there were some events in the broader community that caused companies to really embrace that opportunity. And one was a weird thing called the bubble, which was a time when all concerns about risk changed from concerns about the risk of, you know, of the FUD stuff, you know, “Where did this come from?” and all this sort of stuff, to “If I miss this window, I will miss it forever.” So, time became the biggest risk factor, and the availability of these excellent assets became the solution to that, and that never changed. The bubble went away. A lot of stuff went away. Sockpuppets stopped going public, but that never went away. The second thing that happened was, you know, most easily described through the IBM perspective, but it really, it came from a lot of different places, is that there was a dominant operating system, and there was a position where, you know, everybody else in the industry was trying to build an operating system, and none of their customers cared about that operating system. And the idea, from Irving Wladawsky-Berger, was, “Look at what you could do if you had one operating system throughout that went all the way from pervasive computing all the way up to supercomputers,” and pervasive computing, when he said this, in, what, 1999 or 2000, nobody knew that that was cellphones. But, you know, that concept that the only way to compete with complete market dominance was for everybody else in the market to build a single solution was very important. And that idea of what was possible if they did that was very important. And the piece on the data that I think has some possibility, I think, of being relevant, is that we will have companies with very dominant positions on data. And they will have a fear of what will happen if that position becomes so strong that regulation is absolutely required. And I think you will see public-private partnerships emerge because they will see that, if everybody else gets on the other side, that that market dominance could be altered. And so, they either want to be riding that wave or not. I don’t know what’s going to happen–
EBEN MOGLEN: But I’m going to give you the last word about that, because I think that’s a good place to stop, actually, because I need to kick everybody else off using organizer’s privilege so that I can do a little talking myself. Thank you all very, very much.