Black Duck Report is Meaningless Without Source Code
By Aaron Williamson | July 6, 2009
Black Duck Software recently published some summary statistics about free and open source software license adoption, based on data it collected by crawling the web. The report lists “top 20 licenses that are used in open source projects” and the proportion of projects which use each license, as well as historical figures purportedly representing the number of projects using and planning to use GPLv3 variants for each month of the last two years.
In its press release, Black Duck focuses on a supposed 5% decline in the use of GPL-variant licenses in the year since its 2008 report. Taking Black Duck’s cue, commentators have drawn all sorts of conclusions from this figure, including that licensing is becoming increasingly irrelevant as web services replace traditional software, or that more software is being produced by universities. Black Duck’s own (carefully implicit) conclusion is that the community is simply warming up to proprietary software companies: “Many developers are selecting licenses that are less restrictive, a move that underscores the broader adoption and value of open source in today’s multi-source development environments.”
Any of these conclusions might be reasonable if the 5% figure was meaningful, but Black Duck has given us no reason to believe it is; if anything, their own statements suggest it isn’t. Programatically capturing data on the prevalence of various FOSS licenses is inherently difficult. At its edges, it’s an artificial intelligence problem (specifically a natural language understanding problem), because the many home-grown and modified licenses in the world don’t necessarily adhere to standard language. But even the core task of cataloging the use of the most common licenses is fraught: by the wide dispersal of projects across centralized hosting services and single-project sites (and the movement of individual projects between them), by inconsistencies in how developers apply licenses to code (e.g. idiosyncratic headers and directory structures), and by countless other variables.
Black Duck’s techniques and algorithms for dealing with these difficulties did not emerge fully formed. The company’s engineers no doubt continually refine the license-identification code. If these refinements affected the data on all licenses equally, their effect on the figures from year to year would probably be insignificant. But in reality, GPL variants are much easier to identify than the whole set of permissive licenses whose use has supposedly increased over the last year. Each GPL variant’s text is fixed. On the other hand, there exists a whole category of licenses which are popularly referred to as “BSD-style” licenses because, while the individual licenses resemble the original BSD license in scope and style, they have been adapted and rewritten liberally by various developers, universities, and companies. These variations make permissive licenses particularly difficult to identify, and for this reason improvements in Black Duck’s algorithms are likely to disproportionately capture more previously unidentified uses of permissive rather than GPL-variant licenses.
Black Duck’s dataset has also changed: the company has begun crawling 300 new sites (7.5% of its current total) just since last year’s report. We the nonpaying public have no way of knowing the extent of the effect, because Black Duck’s system is a black box: the company doesn’t disclose how the inclusion of new sources of data affects its numbers.
For these reasons, it is impossible to know whether the 5% GPL delta is meaningful until we know how the source data and the algorithms have changed from one year to the next. The process of cataloging and quantifying the use of FOSS licenses is a scientific one, requiring the application of principles of computer science and statistical analysis. As with any scientific pursuit, the methods used must be verifiable before the results can be considered trustworthy. I encourage Black Duck Software to release its own software under a free software license—whether by joining the alleged groundswell and using a permissive license, or by resort to a retrograde copyleft license—so that its methods can be evaluated by the community (not to mention its customers) and its reports can be rendered meaningful.
Correction: this post previously said that Black Duck only began crawling Microsoft’s CodePlex site in May 2009. The press release cited in fact says that Microsoft began pushing data from CodePlex to Black Duck in May. Peter Vescuso of Black Duck says that the company has been crawling CodePlex “for years.”
Please email any comments on this entry to press@softwarefreedom.org.