Transcript: Keynote Address - “Journalism for an Open Source World” by Julia Angwin
November 2, 2018
Jerome Greene Hall, Columbia Law School
Presented by The Software Freedom Law Center and Columbia University Law School with Julia Angwin of the Markup and moderated by by Eben Moglen, Professor of Law at Columbia Law School and Founder of the Software Freedom Law Center.
See event page for further information.
EBEN MOGLEN: Good morning. It’s nice to see so many friends. I also am very grateful to Max for a wonderful talk. It made it feel very much like law school first thing in the morning. I think we can agree that what would really be good would be to approach these questions after having taught copyright law for a couple of decades in a major law school, and it would be good to have written a couple of the licenses, and it would be good to have advised a lot of the people who did the making of the software. That’s what we do. That’s how we do it. That’s how I’ve always done it.
Yeah of course you have to think about foreign law. That’s why GPLv3 was the project that it was. What really happened, if I can begin by pointing it out this way, and I’m sorry to be bragging first thing in the morning, what really happened was that the communities had the better legal advice for the longest time. I take some responsibility for that. I was busily trying to train lawyers to work in the companies. That’s one of what SFLC’s major purposes was from the beginning, from my point of view, was to train lawyers, and we sent a lot of people out into the world, some of them like Richard Fontana, to the primary commercial distributors of software, some to nonprofits of various kinds. But always I thought the trick was to make really good law school in the way that you have in mind, and we are now in a world in which so broad is the consensus in favor of what we do that we need an enormous cadre of lawyers taught the way you are proposing to teach them, Max. And I think that’s terrific, but it’s a really big job, because there’s a lot of law in the world not just a spread of a couple of cases here and there from the Missouri Court of Appeals or the Ninth Circuit.
And what Mishi and I have been trying to do for the last decade is to make India, which is a place where the software is going to be made, also a place where the law is clearly understood, and it would be fair to say at a venture that Indian copyright law has not been a subject that U.S. corporations were much interested in. But they’re going to be. So I think that the thing to do at the beginning of the morning is to see what the theme is that emerges from this part of the conversation and all the news that we have in the back of our minds at the moment.
The theme is that we now have global consensus in favor of a certain way of making things by collaborating and sharing. There’s nobody left who represents a formal antagonist to that way of making software. This is what it means to be welcoming Microsoft into our home as we are now doing with such satisfaction, on my part, at any rate. We all now agree that there is nobody left in the world of making information technology, nobody left in the world of computer science research, nobody left in the world of security and privacy, who doesn’t agree that the way we make software is by sharing, and that that process of making software by sharing has begun affecting many other parts of society very strongly.
I don’t hold these conferences mostly to declare victories. I hold these conferences mostly to ask questions of all the smart people I’ve met in the course of decades of doing this work. But today I want to make a small claim of victory. Namely, we won.
The way we said Richard Stallman and I and a bunch of other people who really cared about this, the way we said we ought to make this stuff in the world, we now have agreement about. That represents the end of an era in which there needed to be some defensiveness in our posture. We needed to protect ourselves against something. We were concerned that somebody out there rich and powerful and with political influence available to them through the weight of money in society would stop our way of doing things. And so we had a little truculence on us from time to time. I plead guilty. I was the lawyer after all. So I had to put on the suit and go out and be truculent. But there’s no cost for that.
What we wanted to do, we have done in that regard. There are no more enemies. Now this has enormous implications, which we are beginning to explore in various conversations today, because the consensus is not perfect and it isn’t complete and it doesn’t embrace everything. It leaves an awful lot of uncertainty around the edges. There is still plenty of darkness at the edge of town, and the larger the town by mere geometry the more darkness at the edge of it.
But we are not any longer in a world in which the idea is somebody thinks we’re cancer. It’s unimaginable that that was how it was. But that was how it was. My practice was built on having to deal with the idea that somebody out there thought we were cancer and was pretty, well, you remember Mr. Balmer. I don’t need to explain.
But that was then and this is now, and now is the beginning of another phase. We have moved out of the world in which the ideas that we were contesting for had real adversaries, into a world instead of co-operative puzzlement. Globalization in the largest sense. We exist in every copyright system in the world, because in every copyright system in the world, somebody is making contributions to some software or offering pull requests. And we exist in everything from single person projects generated by one genius who just has to do what she has to do all the way up to things as complicated as the manufacture of the dominant single computer operating system kernel involving every kind of hardware at every scale across the entirety of everything we use sand grains to weld together to make.
So, obviously we now have a very big system supported by a very large consensus about which it is correct to say, as Max has done, that it expanded so rapidly from its beginnings that most of the people who were part of it may never have had occasion to think about its first principles or the nature of its law. About that there is good news. We welcome you into a well-built house. It didn’t just work. It just worked because it was extremely well designed by people who were me so I don’t have to claim anything about that. And then with a little bit of thorough legal skill and some moxie, we made people believe it. That’s true. That’s true. And they made trillions of dollars with it, trillions of dollars with it. Entire industries, a whole new paradigm of how I.T. works, which we call the Cloud, which basically means using our stuff in every possible configuration, and depending upon the ability to roll software together in a few nanoseconds and hand it to somebody and say here’s a container you might want, would you like me to execute it for you or would you like it to go somewhere else in the world to start running? Extraordinary technology adaptable and capable of almost everything that we could ever want to do being offered around the world to people as utility. And all of it based upon a set of legal arrangements, which of course were originally machined to U.S. tolerances only, and have remained as simple and non-contractual and please let me add the word, non-contractual because all that stuff you said is fine if they are contracts, which they’re not.
So all of this happened, and now we’re all in it together. We don’t have anybody left who wishes to identify him, her or itself as the antagonist to this, and we haven’t even gotten to 10:00 in the morning yet. So that tells you what an extraordinary time we’re in. The consequence of that is to be seen in a number of different contexts, only some of which are about software.
Yes, David, is there something urgent you wanted to add?
DAVID LEVINE: Yes, well, I just wanted to add you’re talking about the sort of confirmation of sort of where we are, and this week, there was sort of another confirmation that we won, which is that IBM just agreed to pay 34 billion dollars for RedHat, which is the largest amount of money ever paid for a software company and it’s for a company that, our CEO will say, has no intellectual property. We have intellectual people, and it’s all about open source. It’s also seven times more than IBM has ever paid for another acquisition. And so what more confirmation do you need than that we’ve won?
EBEN MOGLEN: I’m very glad you brought it up, because it’s nice that it lies in your mouth to point that out. Of course, the possibility that somebody is paying too much for something has to be considered in a formal structure of analysis here. But let us assume, because I have always felt that it was worth all the money it would cost to buy it whenever IBM finally decided to do it, that it’s a terrific decision at the current market pricing. And you’re right, of course.
I mean there’s many pieces of that, and I’m sure we’ll talk about them. One is that I have occasion to say in many conference rooms over the years that the RedHat company and what I do both provide you with insurance. The difference is that RedHat charges you premiums. The importance of what has happened is partly that everybody understands that the risk of actually being the hand distributing the software has fallen drastically, and it suddenly makes it possible to pay very large prices for things that you used to think you needed to keep separate in case there was a liability issue.
And that too is a very important sign of what has happened to the underlying legal understandings about how all this stuff works. They’re not just good. They don’t just make people a lot of money. They are things on which you would bet very large sums of cash that they will keep working the way they have been working, and the smart money believes that. And I think that’s terrific because I labored long and hard to create something that the smart money could really believe in, and it worked.
But that tells us that we have two things to consider about that consensus. Because as I say, we’re still the early part of the morning and that’s not even the big news of the week, which you conveniently brought in. What we have to ask is: So what does this consensus now represent that we have? That in Microsoft’s coming to ask to be like us, all together in one happy view of all of this, a wonderful thing. That my eyes have lived to see it is a joy to me. What does that consensus now mean?
There are two questions this morning yet to be discussed. One question is how can we get some benefits, some dividend, from this peace for the people who make the software? What more can we squeeze out of the overarching consensus that will allow us to make things still stronger and still better? And the other thing that we must ask about this consensus is what is it doing elsewhere in society? What has this consensus wrought beyond the corners of our industry? This is why it’s such a pleasure to me that Julia was willing to come and talk this morning.
In law school, we value good explaining. The late Marvin Chirelstein, my dear colleague, was the greatest explainer of tax law who ever existed. And one of the finest of all explainers I ever saw in a law school, and deserves undying honor for excellent explaining. Julia Angwin is one of the people who has actually won a Pulitzer Prize for explaining. In 2003, Julie and her colleagues at The Wall Street Journal were awarded a Pulitzer for explanatory reporting. That was explaining the role of corporate corruption in American life. It was done in 2003, perfectly. As you see, the only thing we did was to ignore all of the consequences.
Julia left The Wall Street Journal to go to Pro Publica when we began to watch the changing of journalism and has now become founder of the Markup, a new form of, well, open-source newsroom journalism. I was reminded when Julia and I were chatting before the conference began of a moment in December of 2010, almost exactly eight years ago, when Julia and I were standing in a House hearing room in the Longworth Building for a hearing on privacy that was about to commence. Facebook was trying to get me thrown out of testifying on the grounds that I was a liar, if you can imagine somebody trying to prevent congressional testimony on the grounds of untruthfulness. But it didn’t work, and I testified truthfully as far as I’m concerned.
But Julia and I, in the early days in December 2010, we were talking about the effect of WikiLeaks on journalism. And we were very sure about it. And we were completely wrong. And the history of why we were completely wrong and what happened next is really the history of how somebody with a Pulitzer Prize for explanatory journalism becomes the world’s leading exponent of what it means to run an open source newsroom. And so while we’re going to spend an awful lot of today on other forms of the consequences of this consensus, I thought we should begin by understanding the echoes of our consensuses. They roll off the hills and slay and create fake news at the same time. Julia¦
JULIA ANGWIN: Thank you, Eben. As usual, your introductions are so spectacular that I’m not sure I can follow up. But let me see if I can get my slides. It really feels like much longer ago than 2010 that we were in that hearing room talking about privacy. That was hilarious because that was when the Obama administration was first announced. I think their attempt at legislation for the Privacy Bill of Rights, which was too weak and was defeated and now seems, defeated by both sides. It was too strong for the corporations and too weak for the advocates. And now it seems like so strong that we would never actually get to it.
As you guys probably know, we’re the only Western nation without a baseline privacy law protecting consumer data. But that’s not what I’m here to talk about. So I’m going to just talk about journalism for an open source world and maybe that doesn’t quite make sense. And I’ll explain what I mean by that.
I’m going to start with actually just a small bit about who I am, why am I here. I’m not a lawyer. I grew up in Palo Alto, and my parents were really like they had moved there to be part of the personal computer revolution. They were super excited, you know, a mathematician and a chemist. And they wanted to be involved because computers had gone from the size of this lectern, right, to you know a box like this. And so this was our first computer at home.
I learned to program in fifth grade. I never had a typewriter. We had at any given time, five or six computers in our house. And I actually thought that there were two choices in life: you could choose to go into software or hardware, and that was it. I didn’t really understand that there were more options because that’s what I knew, and also, I knew the obvious answer was software because those are two worlds, you know, very separate. And my family was a software family.
I went on to college at the University of Chicago where I studied math. They didn’t allow you to study computer science, but I did programming in LISP, an adorable language that we all are fond for. But I felt like rebelling. And so I was also editor of the college newspaper. When I graduated, although I had a job waiting for me at Hewlett-Packard, I actually decided to try my hand of journalism for a little while. I thought I’ll just have some fun, then I’ll go back to software. I never went back. But I did find I first went to San Francisco Chronicle, because they were so, I found, that every newsroom was so excited that there was somebody who knew something about computers in the newsroom.
I wasn’t actually a very good programmer, I can just say that. But I knew what computers could do and couldn’t do, which actually is like a weird thing. A lot of people just still think it’s magic, and they think computers can do everything. And they don’t have an understanding of what the constraints are.
And so eventually I ended up at the Wall Street Journal. I joined in 2000, and they asked me to cover the Internet. And I was like, well anything in particular about the Internet? And they were like, no, just anything, you know about computers. It was the Dot Com boom, and their money, they had so much advertising dollars, that they were just collecting reporters who knew what the Internet was as fast as possible. And I sat there for 14 years. I did a series there on privacy and etc. And during that time I developed a thought about what journalism should be, and what I found, and this is since then as well, which is that journalism has changed. Our role in the world has changed, but we haven’t adapted to it as an industry. And so I’m going to talk a little bit about that and come back to sort of to why I’ve decided to found my own newsroom to make a better type of newsroom.
So, traditionally, journalism has been what I call a witness plus publisher, right? So you go somewhere that people can’t ordinarily go, you witness it, and you publish it. And these two things were both things where journalists had an advantage. They could get places other people couldn’t go, because you needed a press pass or you needed a travel budget. And then they needed a way to transmit their images or their stories back and publish. Now both of those things, as you probably already know, are no longer monopolies held by journalists. Right? Anyone can witness, publish, and in fact some of the most important events are witnessed by citizens because you can’t have enough journalists to be ready for when the cop sprays pepper spray into the guy’s eyes. And so witnesses are everywhere and publishers are everywhere. That means that journalism probably needs to change, a little bit, about what it thinks itself as.
And I guess I would argue it hasn’t changed enough. But this is one of the reasons we have the world of fake news, or whatever you want to call it, actually is because of this ability to witness and publish. You can create fake witness and publish. And so we have this incredible decentralized world that means that people when they go to look for news, they have a million sources, hard to distinguish, because what some citizen witness is actually the truth and some of it is fake. And how do you distinguish? And so I think that in this situation journalists should think about a different role for themselves, actually. Increasingly we still need to witness certain things or certain places will still always be able to get him.
But we also need to do things that no one else can do. And one of those things is forensics, the idea of verifying and authenticating witness accounts, and building a forensic trail. If you’re leaked a document, one of the most important jobs as a journalist is to verify its authenticity. This is why Dan Rather lost his job–he wasn’t able to adequately verify the authenticity of a document. And that’s something that I think journalists are going to need to do more and more of and are not particularly well equipped to do because there’s not a forensics desk in most newsrooms. And there should be.
And similarly, we need to go and witness things that the public can’t see right. We don’t have a competitive advantage as journalists in places where citizens are all standing around with their cameras. But there are places we can go that other people can’t go. I would say one of those places is actually what I’ve made my career on: going inside the machine. A lot of my investigations have been about how algorithms actually work. And so that’s something that is very hard for a regular person to figure out. But when I have been able to apply public records and computer scientists problems, and I’ll talk about some of that work later, we have been able to find things out about what’s happening in automated systems that people can’t see.
There’s new kinds of journalism that need to be done, in my opinion. And I think also some of what needs to be done is we need to have a new language around news. The language of news has always been about breaking. Things are breaking. It’s like a wave is breaking onto the shore. When [my son] runs out and chases the wave back out to the water–that’s what journalists sort of have this mental model, like you’re chasing this thing as it rolls away from you. And then you send back a little dispatch, like how far did the wave get, you know? And I just think that that model really assumes that news is something external to you and it is like an event that happens. And sometimes that’s true.
If the Challenger explodes, that’s a news event that happens that you just have to address at the moment. But many of the issues that we face as a society, some of the biggest issues, are not really events, right? But there are things the news needs to investigate. I personally would like to know about Trump’s taxes. What did he pay? When did he pay them? But that’s not an event. That’s not a breaking event, but it is something that we need to know as a society. And there are things we need to know like are we spending our budget appropriately as a nation, as a municipality. These are the kinds of things journalists should be looking at, should be acting as watchdogs on, and they are not breaking news.
And so, I have been proposing that we should think of ourselves more as scientists and base our news on the scientific method, because the scientific method is a proven model. Take a hypothesis and then you test it. And do you have enough data to support that? Hypothesis: the U.S. has a secret program to tap, to log every single phone call of Americans and share it with the NSA, a violation of what we understand to be the rules governing intelligence collection within the U.S. In that particular case when Snowden leaked his documents, there was only one document. The Fisa order, the secret court order that allowed for the approval of that program. And so the evidence needed to prove that hypothesis was one data point. But in some cases, you need 10,000 data points.
When I did a story about software used in criminal sentencing, and I asked the question of the data: is this software providing biased results towards black defendants? We needed 10,000 data points to prove that hypothesis. I think that you can still do all the stories that we called news but if you frame it around hypothesis, you have a little bit more control. Journalists are asking the questions that are most important as opposed to chasing a lot of events which are obviously manufactured to distract us.
I mean it’s certainly true, when I was looking at the paper this morning, that all sorts of things are not events that we need to address but make it onto the front page. Like the caravan of immigrants. I’m not the first person to propose this new model for journalism. Actually this wonderful man Philip Meyer proposed it in 1973. He wrote a book called Precision Journalism and basically his idea was that journalism should be based on the scientific method. And I love his quote. I don’t know if you can actually read it, but he says at the end a journalist has to be a database manager, a data processor, and a data analyst. And I think that’s really true.
Journalism is actually about collecting data. Every time a journalist conducts an interview on someone, that is a data collection. And we just have generally had a rule, sadly, in journalism that you kind of needed three interviews for a story. Sample size 3. My argument is sample size 3 is appropriate for some hypotheses and for some hypothesis you need a larger sample size. That is basically how I would like to reinvent journalism.
But his dream was not realized, in case you’re interested, and this may be just too wonky for all of you lawyers. But maybe you’re interested. Basically, the way most newsrooms work is they have got data analysts in the newsroom, and they put them in something called the Data desk. And the data desk is kind of far down the chain. So first a journalist goes out, and they do their three interviews, and they get a hypothesis. Ok, I think this thing is happening. Then they go to the data desk and say, find me the data to support my hypothesis, and they kind of order it. It’s kind of like a hamburger window, and it’s also hard to get to the data, just like a long line. It’s a little bit like the DMV because everybody in the newsroom wants data, so you have to get in line and you wait until you get your turn. And then the data guys are really overworked, and he says, it’s always a he, but there’s sometimes a she, that that data set doesn’t exist. That data almost never exists, or if it does exist, that data shows something other than your hypothesis. And then there’s a fight, and then in that fight, unfortunately, the journalist usually wins, because the managers of the newsroom are journalists. They’re not data people. And so those numbers seem maybe irrelevant to them, or they say you can write around them.
That’s a very common thing: write around the facts. And so, this model is the one I grew up in in every newsroom I’ve been in. And I just think it’s the wrong model. And I think that’s probably because I grew up in a software culture. And so I don’t understand why all of our beautiful automated tools to help us understand the world are so far down the line of production.
Also, it’s worth pointing out that people don’t, good programmers don’t stay on the data desk, because you’re working for one third probably your market rate. And because you want to do good, but then you’re dealing with a bunch of people who aren’t data literate. And oftentimes when you do process the data for them, they don’t like your conclusions or they write around them. And so what I’ve found is that most talented programmers leave the data desks fairly soon within five years of starting.
I have decided to do a different approach. And I’ve done this now for about ten years, first at the Wall Street Journal and then at Pro Publica, where basically what I did is I stole people from the data desk and borrowed them for my team projects. And so what I would do is bring them in from the beginning and together me and a programmer would develop a hypothesis, we’d think about what are the different ways we could collect data to test this hypothesis, and fail a whole bunch times, and eventually come up with an approach that worked and do our investigations together.
The investigations I’ve done at Pro Publica, here’s a few examples. Machine bias is probably our best known one where we looked at software used in criminal sentencing. How many people know about this software? Should I explain it? Ok, nobody does. Okay so there’s software called risk assessment tools. And they’re used throughout the United States at various stages of the criminal justice system, often at pre-trial, sometimes at sentencing, often at parole. And the idea is that they ask a whole bunch of questions of a defendant about their home life, do they live in a risky neighborhood, do they have stable housing, do they have drug problems, do they have a job? The questions vary based on the tool. There are dozens of these tools out there. Most of them homegrown private companies, closed source, black box algorithms, and then the software spits out a score. One through ten this person is likely to commit a crime in the future. And that future, by the way, is defined as during the next two years, and it’s based on theories of criminality that, you know, basically if you have like unstable home life or lots of family members who are in prison, you’re maybe more likely to commit a crime. And judges look at them. For a pretrial release, they would say, you’ve been arrested, you know you score 5. They take that into consideration when setting bond or letting somebody out.
And this software has spread like wildfire throughout this country for the past 15 years, mostly because jurisdictions have wanted an excuse to let more people out of jail because their jails are overcrowded. And so it’s actually been a tool of reformers seeking to combat mass incarceration. And they want to be able to say to the public, don’t worry, we’re only letting out scientifically proven, non-risky people. And so this software provides us perfect cover like, oh look, these people are just all below the score of five. So it’s fine. However, it was not at all clear to me whether this software was biased against black defendants, which is obviously a huge question for our criminal justice system and every stage of it.
And Attorney General Eric Holder had actually asked the U.S. Sentencing Commission to investigate whether these types of tools were exacerbating racial inequalities. However, the Sentencing Commission did not take up that request and did a different study on something else. I was like, ok, I’ll do it. I’m that girl. I sit in the front of class too and wave my hand. And so, we went and FOIA-ed a bunch of scores and tested and found that in fact the software was biased against black defendants. They were more likely to be over-classified as risky when they were not, twice as likely more likely to be classified incorrectly as high risk when they were not compared to white defendants, and white defendants were actually twice as likely to get an unjustified low-risk score when they were actually higher risk. So there were huge disparity in the error rates for the software. And by the way, the software was also only 60 percent accurate under its own metric of success, which was twenty two years later, within the two years, did you commit another crime or at least get arrested for another crime.
That’s the kind of investigation that took two programmers and me and a researcher a year or a little over a year to do. And we all did it together as a team. And so those are the kinds of things that I have done over time, and I’m not going to bore you with all of them. I’m throwing this one in only for Eben’s fun, as one of my favorite stories where me and a programmer went through the Snowden documents and compiled every single bit of evidence that suggested that AT&T had gone above and beyond the call of duty in helping the NSA. And that was super fun and required a programmer too because so many of those documents were extremely technical.
And then at the Wall Street Journal, I had done the series on privacy, and all of that was done similarly with people that I borrowed and stole from the data desk, where we basically would run big scans of which Internet companies were stealing your data, where we found Google, this is my one of my favorite stories, where we found Google had done this secret thing they were placing in ads that would secretly turn off Safari’s third party cookie blocker so that they could serve their cookies onto that page, which then they paid a 22 million dollar fine to the Federal Trade Commission for which was five-hours-worth of Google revenue at the time. But I don’t believe they have done it again.
In my investigations, we have always chosen to publish our data and code whenever possible, so we have a model of full transparency. Post everything on GitHub. So we post the original data when we can. Sometimes if you buy it or it comes under certain terms, you can’t post it. And I believe that this model, actually, what I’ve really found is that we’ve been able to get a lot of impact by being sort of transparent in our methods and being so quantitative. And the reason I say that is because I feel that sometimes journalists can be like, there’s a lot of stories these days like is Facebook too big? Well that’s a solvable question, right? Like too big, it’s not.
But when we publish our stories they’re like very precise, like the error rates in the criminal sentencing software were unbalanced and they could be balanced, and if they were balanced it would make a difference for fairness. But the company had chosen to not consider that as part of its idea of fairness, and so that is something that could be solved and is actually it’s been two years since we published that article. But there is a movement that is starting to push for that algorithm to be tweaked throughout the criminal risk assessment tools area.
And so I feel like by presenting our findings in the most quantitative way with such a specific diagnosis, it does allow for solutions. Society doesn’t always choose to solve the things that we present as problems, but at least we have presented them that way. We did a big study about car insurance where we showed this was another project of two programmers, me and a researcher for 18 months. And we showed that there’s always been this fact that if you live, you are the same driver, but if you move from a more white neighborhood to a more minority neighborhood, your car insurance rates go up, generally. And the reason for that is most of the car insurance companies in the U.S. have a pricing mechanism that is just based on the zip code of where your car is housed or garaged.
And so the idea is that there’s some inherent risk to those neighborhoods. So what we did was we got all the car insurance rates across the nation for the top 30 companies by zip code. And then we actually did FOIA requests in all 50 states for the true risk, meaning how much did insurers pay out for accidents in each zip code in aggregate, and compared and found that there was a disparate, that the pricing disparity, was not based on the true risk of accidents in those areas. Now, once again many of those carriers have chosen to ignore our findings. But California did force the top two companies that we highlighted to change their their premium pricing policies.
My point is that this type of journalism can lead to some impact, which is I think ultimately what journalism should be for: to highlight changes so that people in society can choose to debate on the facts when they’re making policy. But, as you probably know, there’s not enough of this type of journalism out there. I don’t know if you can see these, but basically the one that’s collapsing on the left, that’s revenues for the newspaper industry, and the one that’s sort of sliding down is the number of employees in newsrooms across the nation.
And this is largely because the revenue model has collapsed for journalism, and there are a lot of reasons for it. But one of the main reasons is that online these companies have not been able to sell ads online for the prices that they sell in print. And that’s largely because the model of distributed advertising, so where companies follow you around and paste ads on your behavior, means that the Wall Street Journal for where I worked for 14 years, like they used to have a lock, like if you wanted men with Mercedes who golfed, you would go buy it at the Wall Street Journal. Now if you want men with Mercedes who golf, you can just follow people around on the Internet and make that assumption.
That has blown up the model of newspapers being able to sell their audience. And so the industry is collapsing, and that means that it’s really hard for people like me to make the argument that not only should we do journalism totally differently, but we actually have to do it really expensively because everything’s going to take a really long time and we’re going to need a whole bunch of programmers who make even more money than your normal journalists. So that is why I’m starting a new site to do this, to really invest. I’ve got donors to invest in this particular model, so I’m super lucky. Craig Newmark of Craigslist gave us 20 million dollars. And me and my two cofounders, we’ve raised about 23 million so far, and we’re going to start publishing next year in 2019.
It’ll be a newsroom of half programmers, half journalists, traditional journalists, meaning people more focused maybe on writing and analysis. In the end, I would like there to be no distinction actually between the programmers and the journalist. I would like everyone to have both skills, but that’s not the reality right now. And as you can see, this is a non-profit. It’s going to be Creative Commons-licensed content, so free to republish anywhere ,and just we’re doing it as a public service. We’re going to focus on the impact of technology on society. Our goals are to, first of all, study the impact of technology on society by using technology to investigate, but we also want to serve as a model for other newsrooms. I think of FiveThirtyEight, when they started, they were the first people to like basically bring sort of sophisticated statistical analysis to a news context, and then everyone copied them. The Upshot at New York Times, Vox. And so I actually think one of our success markers might be if people copy us, because I would like other newsrooms to be able to integrate programmers and journalists together. I think that’s like the future of everything is to bring more tech literacy to it. And so I’m hoping that we will serve as a model and also train another generation of reporters and data journalists who will probably go on to work other places that pay more.
Our philosophy is to embrace the scientific method. One thing that means is that we don’t have to do quotes from both sides of…We don’t have to do “on the one hand, Democrats say, on the other hand, Republicans say,” right? We’re just going to say our hypothesis is this and our findings are this and then the limitations of our findings. The limitations are fine. There will always be limitations. Our sample size is good, but not great. It’s not not statistically valid because it was an opt-in, but it gives an indication of where trends are going. You know, there will always be some limitations, but those we can just be honest about and not have to do a he said, she said.
We’re going to do something we call adversarial peer review, so whenever we do one of these big data analysis, we actually bring it to the person that we’re investigating. We bring them our data and our code months before publication and allow them to test it. To me, that’s the best form of peer review because they have the most incentive to find flaws. It’s not like sending into your colleagues in academia where they just read it and they say, nice job, Bryan, or whatever. We have found that we brought all our findings to the insurance industry. We brought all of them to the criminal software company, and that is a great way to test our findings. And then also we are hoping that we can maybe move the conversation a little bit away.
Journalism tends to focus on villains, individual villains, which is of course always a great story, but many of the issues in our society are systemic. And so by bringing data to the table, I’m hoping we can bring some of those structural problems to the fore in ways that hopefully aren’t going to be so boring to read about. We’ll try to make them sexy, but also not just always focus on just one bad actor.
This is our website, The Markup. We call it The Markup because of HTML. It’s the language of the web. It’s also what you do when you mark up a story when you edit it. And I just thought you guys would appreciate that it is tracker-free. We are going to make every effort to treat our readers with dignity. There are some tools that we have to use that will involve setting third party cookies. For instance, on our donation page, Stripe requires a cookie set, so we have actually engineered it so that it doesn’t happen until you press the donate button, because they have it set, no matter what. We do what we can, but we hope that that provides some relationship between the reader and the site that is built on trust. We have an ethics policy. We have data ethics policy, which nobody has. You can read it. It says we’re not going to do P-hacking, it says we’re going to clean our data. It’s a delight. Honestly, I read it sometimes just to make myself feel better at night. Also we’re in this great period where we haven’t publish any stories yet, so we haven’t made any terrible mistakes. Everything seems really great right now.
I’m going to end on this kind of ridiculous note, I’m so sorry to say this, but sometimes I just I say, oh, we’re doing scientific method. But then I actually just joke what we’re really trying to do is bring the best parts of software culture into the newsroom. And agile journalism, agile programming has become somewhat of a joke. I mean it’s still a thing. But it has become such a cliche. But I do joke sometimes that we are trying to do agile journalism. We’re trying to bring that whole, the good parts of failing fast, not the break things and fix them later part, but some of the good parts of journalism and teamwork that Silicon Valley culture and open source culture has brought to us. I don’t know what our time is left but I would love to take questions if anyone has any.
AUDIENCE MEMBER: Like a lot of people, yourself I’m sure, I’m very distressed the way media has gone, in part because of the financial pressures on the media. So I’m a lifelong reader of the Times and dismayed that the Times are becoming something of an echo chamber in response to its own readership. So instead of one story on the latest ridiculous thing Trump has done, there were three stories, where the result is there’s three stories of space in that day’s edition that could have covered something else. At a minimum there’s an opportunity cost to serving your economic base, which leads me to think that the future is in nonprofit journalism if we’re to have any really solid journalism, which leads me to my question which is do you see a future for citizen journalism? Could you imagine an organization like yours, for example, having a community contribution element so that you could cover more? Would it be realistic to get people to buy into your model and be able to work with them in a way that you are satisfied and thereby bring even more quality journalism to the marketplace.
JULIA ANGWIN: Yeah that’s a great question. I’m going to start with the business model and then go to that. So, you know, there’s really only three options for the business model. One is the for-profit model that has currently been failing journalism. One is the European model, which is actually government-funded primarily. And then there’s what we have, there’s a merging philanthropy model in the U.S. All of them have their flaws right. Philanthropy model also has its concerns, right? Like thank God Craig Newmark said to me he doesn’t want to have any involvement in the journalism. He’s fine, you know, he’s not going to dictate stories. But like, who knows, right? Like other donors could come along and demand things. That model has its peril. All of them have their perils, right? And I think it’s just a matter of picking your poison.
We’re going to try as hard as possible to convince our readers to donate, because I think being aligned with your readers is a great model. When you come to citizen journalism–my co-founder Sue Gardner is actually the woman who built the Wikimedia Foundation, so she knows a lot about running communities and citizens. I would call that citizen journalism. And so I think we don’t necessarily see this vision that way, but we think that that’s a cool thing in the world, but it is extremely hard. Wikipedia is really the only successfully managed crowdsource publication that I know of. And it has a whole bunch of quirks that make it that way. And I have always wanted to see more things like that come up, and I haven’t. And so, I don’t know how easy it is pull that off.
AUDIENCE MEMBER: Hi, Julia, I have two questions. One, how can I subscribe? And two, when we think about the data work that you’re doing and these black boxes and we think about open source software is there a future where so many things are kind of being run by open source code underneath and then dictated by a bunch of data we can’t see, the training set, so that the open source nature of it doesn’t actually give us a lot of understanding into it? And as more things sort of start looking like that, what is the role for open source look like in that world?
JULIA ANGWIN: Such a great question. That’s already happening, right? So a lot of the machine learning classifications that people use routinely at Google and other places are already open source, and we use them too. And we also build machine learning to do tests and things like that. And but the truth is that the real power is in the training data. Right? And that’s sort of like, you know, the classic story is the Google image recognition that tagged to black faces as gorillas and that had to do with the training data, and that actually was a really sort of poignant post by one of the engineers involved in that who talked about the fact that it is not even really the fact that they had poor training data, it has to do with the way film and cameras were developed to how they recognize faces and based on all the way back when Kodak chose a certain skin tone as the perfect skin tone for cameras.
And so these kind of original sins, they travel through data and through time. And that’s why I personally think like of course it would be really great for companies to make these training data sets open, and maybe you can convince them to do that even, but I feel like those are people’s competitive advantage. And so I feel like that’s not likely. And so I guess what I see more of is it’s important to do auditing. Right, so what I see myself as is actually an auditor. And because there’s no auditors out there right now. We don’t have a government that’s building an auditing division, maybe they will one day or something, but auditing outcomes is kind of the only way we can really have visibility, I think, into those training data at the moment. You could subscribe to the Markup, just sign up. Just sign up, send your email address, and we will bother you when we start publishing.
AUDIENCE MEMBER: How do you avoid the appearance of the aforementioned echo chamber in this model, right? So, the solution that you’re proposing seems in part orthogonal to that question, but to the extent that that bears on the state of the journalism market today, does what you are aiming for involve trying to avoid this?
JULIA ANGWIN: I don’t know if I totally understand the question, but let me just try to address it and then tell me if I’m not answering it. But let me just say, I don’t really believe in objectivity, so if we’re talking about objectivity and some neutral fairness, I don’t believe anyone has that, and I certainly don’t have it myself. So, we’re not going to aspire for that. Our choice, our hypothesis, the questions we ask will ultimately be biased by our own biases. We are going to ask the questions we think are most important, and we are also going to seek input about what those questions should be.
But it’s not, at the moment, we don’t view it as a democracy, right, where people are going to vote about what we get to investigate or anything like that. But I think we have a narrow mandate, which is the impact of technology on society. And so we’re planning to investigate three areas to begin with. One area is what I call poverty profiling, this idea of those algorithmic investigations, how we use software and automated tools to distinguish what we call the deserving poor from the undeserving poor. This is a special feature of America that we love to do, and I think it’s an important topic to investigate, because we’re actually putting a lot of automated tools into that area first. Secondly, we’re going to be investigating the platforms themselves. The tech platforms don’t have a lot of places holding them accountable. And so that’s something we have expertise to do, and we will do. And thirdly we’re going to cover the issues broadly known as privacy and cyber security. I hate both of those words, but essentially the abuse of people’s data in various ways.
AUDIENCE MEMBER: We broke a story this time last year in the way you identified it, which was sort of a general ecosystem systems thing we were pointing out, which was trackers in Google Play. Since then we have had a lot of journalists reach out to us. And the trickle is becoming more than a trickle. There are a lot of journalists who are I think identifying what you’re saying, which is they need to be working with technical people. They need to be crunching real data and looking at things in this sort of sort of way. So, again, very exciting what you’re doing. The question for me, I guess, is as these requests are coming in as people want to sort of stick, you know, reputation on these stories and the reasons they reach out is because of prestige of the university and so on and so forth, which, you know, whatever. But are you planning on working with other groups? You know let’s say the EFF or some of these digital rights groups? Is there a way that they can sort of work with you to be those data people. Or is everything going to be in-house?
JULIA ANGWIN: That’s a great question. We haven’t really made a determination about that. My feeling is that I tend to be not so dogmatic. It’s sort of like the scientific, right? Like on this particular case, if there is some data we need or expertise we need, we would probably go out and seek a partnership on it. I would also say though that the friction involved in partnership is really high, always. So, you know, we’re going to have this great advantage of having our own in-house expertise, which most journalists don’t have. I’m hoping in a way that we bring a model that then spreads to other newsrooms and means that the people calling you are smarter. That would be a great outcome.
EBEN MOGLEN: Julia, let me let me ask a question about the consequences of this in the longer term, because we’ve all seen pictures of those few people against the window with a start-up and an investor, and we know what happens after that.
JULIA ANGWIN: I know it’s sad.
EBEN MOGLEN: No, I’m not so sure. About the time that they get finally figure out what went wrong with Les Moonves, they’ll discover that journalism has changed, and it’s all gone upside down on them and open source will win. I know that story, but I want to ask a little bit about how we deal with the question of journalist access. Two generations ago, the Supreme Court said in Houchins against KQED that journalists have no particular right of access to public places or things, constitutionally speaking. Public radio wanted to be in the prisons of California, which was a useful thing to do, but they didn’t have a right to go. We have a statute. We have the Freedom of Information Act, and we have state FOIAs, and we have Sunshine laws, and so on, but we are going to have a policy crunch out of this new consensus that we’re part of. You’re right of course that it won’t look like privacy law from 2010 or anything like that, but we are going to have a major effort at federal legislation. Our friends in the platform companies scared by California state law are going to seek to preempt it with federal legislation, and we’re going to have a great big moment for discussion about data rights in the United States.
Mishi and I are deeply engaged in thinking about data rights in India where we will also, as a result of changes in law, have an enormous moment of legislation. What is the right way to think about access to data for scientific journalism of your kind? Let us suppose that we were attempting as part of the policy compromise, which is going to be a great big Christmas tree of policy compromises around data. Let’s suppose that we got ambitious knowing that your kernel is going to grow into a great big oak, what should the right of access to data look like for people like you? Public data, private data. What should we think of now in the First Amendment for the 21st century? The Supreme Court that made Houchins against KQED was not more friendly to journalists than the current Supreme Court. But let’s ask what Congress should do for us? What should the right of access to data be for people in your line of work?
JULIA ANGWIN: Well, I mean jeez…So, three days after we announced that we had got this great funding, and there were tons of big stories. Forbes wrote a story that said, “Will the Markup will be sued out of existence before it begins publishing?” And the reason they wrote that story was because when you do automated data collection like we do across the Internet, you can often be accused of violating terms of service. And as you probably all know that’s in the U.S. Computer Fraud and Abuse Act, makes that sometimes a criminal act. And so that is an existential legal risk for my startup. I wish it weren’t. I wish I wasn’t saying those words, but that’s true. And so that is a real challenge because there have been one court case that has been friendly to journalists on this front saying that if the data is publicly available, there is no difference in collecting it in an automated way than in a not automated way, but that is not the consensus of most courts. And so this is something that is a huge, huge risk, because right now our ability to monitor the public square, which is the Internet right now, in any meaningful way relies on automated collection techniques, and so I don’t know. I mean, we’ve bought an enormous amount of insurance. [Laughing] Fingers crossed. I don’t know. I would like it, so if you could put together a Christmas tree that takes that off the table, that would be fantastic.
EBEN MOGLEN: And, of course, we’re going to fight about terms of service in a variety of contexts in all of this, because as you’ve pointed out, there’s no privacy law in the baseline. There’s FTC’s view that people who make promises should keep them, which has generated a market for promises that you couldn’t break if you tried, but would sound good, and we’re going to have a fight about that. And one of the consequences ought to be something that says that there are people called journalists and they do stuff that involves informing the public. And terms of service cannot prevent them from doing what they do in this, that, or the other way. That’s the ball on the Christmas tree, I think.
JULIA ANGWIN: Yeah.
EBEN MOGLEN: But we’re going to need a bunch of people who understand that that’s an issue of great importance.
JULIA ANGWIN: Right.
EBEN MOGLEN: And this is one of the places where the new consensus might help. One more question, and then we should let Julia go.
AUDIENCE MEMBER: Hi, my question is closely related to yours. You were talking about terms of service and acceptable use policies. My focus is more on customer citizen privacy. From any given large dataset across the population, if you collect enough data, I guarantee that I can extract one person, even if you don’t have their SSN, home address, and biometrics in it. Do you have any practical or actionable or suggestive approaches towards solving this problem of, you want access to this data. We would like to have journalists to have access to this data to do this kind of auditing. But what do we do about this other intractable issue that people don’t want to have that kind of data about themselves up for anybody who calls themselves a journalist analyzed?
JULIA ANGWIN: That’s a great question. You know, we try to be extremely selective about data we collect. So for instance, last year we built a tool called the Facebook political ad collector, because after the 2016 elections there were a lot of rumors, which have all borne out, that there was dark stuff going on on Facebook in ads. But no one knew because those ads are ephemeral, they’re targeted, and they disappear, right? So my colleague Jeff and I, we built this tool that’s a browser extension that people could add to their browser, and when they’re on Facebook, it actually would look at the ads. We built machine learning to identify which ones are political and contribute those political ads to a public repository. So we could start building a public repository of political ads. What we did though was we actually stripped out every single possible identifier before it arrived in our system. So no IP address, no Facebook identifier. The only ID that arrived was the unique ID of the advertisement.
And so we are in the practice of building tools that bring us the data we need and not the data that’s radioactive, right? And that’s actually part of the scientific model, right, which is like when you have a hypothesis and you’re really clear about the question you’re asking, you can be really clear about what you need to answer it, right? And so, in every case, we would like to discard as much of that data that’s personally as sensitive as possible. However, there are going to be cases like the criminal risk scores, which are the complete opposite. In order to test bias, we needed the race of everybody, right? And we actually knew their names because we needed to check, did they go on to commit that future crime, to tell whether that software was accurate. Now all that data was already public, right? You can go, and this was in the Florida jurisdiction, you can go in, look up anybody’s name on their website, and see their criminal record. And as you know, sadly you can usually see people’s mugshots. I mean all of this criminal data is available in the U.S. That’s not normal for other countries. But here we are into that. And so the only thing additional that we brought to that data set was we had done a public records request to get the score that was assigned to those people, the risk score.
Now in every case, the person involved, when I interviewed them, they never, they did not know they were scored. In fact, in that jurisdiction, in Broward County, the public defenders didn’t even know the score was being used, so it was news to everybody. So we made a pretty controversial decision to publish those scores, because the criminal records were already available, and we thought maybe those people would like to contest them. And there were 10,000 of them. That’s a controversial decision, but it was one we thought about. And you could take a side , but I guess all I would say is we try to make a really considered decisions about personal data and that’s like the best you can hope for I guess.
EBEN MOGLEN: I think we might say that when we get to the policy conversation, the question of whether people have a right to be anonymous is going to be a very important early decision. I don’t think American society is prepared to acknowledge the right of anonymity by legislative fiat at the moment. But I think it’s going to be very important to have the conversation. It bears not on just what journalists collect, after all, it’s a uniform subject throughout society, and it’s not going to be dealt with at the level of the journalists’ access. It’s going to be dealt with at the level of what research ethics are required for people receiving federal money. And it’s going to be dealt with on the basis of whether we’re prepared to create statutory rights for anonymous behavior in society or whether the forces that believe that everything should be attributable to somebody are going to prevail in the setting of our basic data policy. Next five years in the United States are likely to tell us the answer to that question, and it’s good that there will be some serious reporting. Hypothesis: data is the new petroleum and the people beginning to engross the world’s data are like the people in…world’s petroleum, that is to say they’re not friendly to journalists, and they might sometimes do harm to them if they were permitted. It’s a courageous thing to do, I have to say. It’s also fun, and I hope it’s right, but it’s courageous.
JULIA ANGWIN: Thank you.
EBEN MOGLEN: And I admire the hell out of it. Thank you for coming to talk to people about it. It was wonderful. [Clapping] Ok, I think we should take a five-minute break, and then we will begin with all kinds of arm-waving law.