Free data
November 21, 2005
Like many policy-oriented research organizations in Washington, DC, and around the country, the Center for Economic and Policy Research (CEPR), where I work, uses a lot of large-scale survey data, including the Current Population Survey (CPS), the Survey of Income and Program Participation (SIPP), and the Public-Use Microdata Sample (PUMS) of the decennial census.
Over the years, my CEPR colleague, Heather Boushey, and I have spent thousands of hours reading, coding, recoding, decoding, and analyzing data from these surveys. While extensive documentation exists for most of the commonly used national surveys, the only way to really master the ins and outs of the data sets is through a combination of learning from the experience of other researchers and putting in the hours working with the data first hand. Over the years, we've benefited enormously from the tremendous staff at the Bureau of Labor Statistics (BLS) and the Census Bureau, as well as our friends and colleagues at the New School (Heather), the Centre for Economic Performance (John), and the Economic Policy Institute (both of us).
A couple of years ago, Heather and I decided to port the philosophy of free software, pioneered by Richard Stallman, the Free Software Foundation (FSF), and many others, to the world of policy research. As the Free Software Foundation explains:
Free software is a matter of the users' freedom to run, copy, distribute, study, change and improve the software. More precisely, it refers to four kinds of freedom, for the users of the software:
- The freedom to run the program, for any purpose (freedom 0).
- The freedom to study how the program works, and adapt it to your needs (freedom 1). Access to the source code is a precondition for this.
- The freedom to redistribute copies so you can help your neighbor (freedom 2).
- The freedom to improve the program, and release your improvements to the public, so that the whole community benefits (freedom 3). Access to the source code is a precondition for this.
We thought policy researchers that use many of the standard, but often complicated, government microdata sets such as the CPS and the SIPP would benefit significantly from "free data extracts" based on these principles of free software.
Currently, extracts of standard data sets typically take one of three forms. Some extracts are strictly proprietary, available publicly, but with restrictions on how the extracts are used, copied, changed, or redistributed. The Unicon Corporation, for example, sells excellent extracts of the CPS and other data sets for a reasonable fee, but with substantial restrictions on their use (including modifying and copying the software). The second kind of extracts are quasi-proprietary extracts, which are typically developed, maintained, and held in-house at major policy-research organizations or universities. In the policy world, these extracts tend to be fairly jealously guarded since they embody years, sometimes decades, of institutional knowledge and expertise, which give those who maintain and update the extracts a substantial advantage over their "competitors" in the policy research world. In more academic research centers, these quasi-propietary extracts tend to be more accessible, but less formally maintained, documented, and updated. They float around academic departments and move across departments and universities informally.
The third and rarest kind of extracts are "free" extracts in the same sense that the Free Software Foundation uses this term. Probably the best-known examples in economics are the CPS extracts sponsored by the National Bureau of Economic Research (NBER), which maintains, updates, and posts all the code to produce various extracts of the CPS and some other data sets --all under the FSF's GNU General Public License (GPL). Our CEPR extracts of the CPS and SIPP, which we've made available for over two years under the GNU GPL, are another example.
We believe that free extracts have considerable private as well as social advantages over the currently much more common proprietary and quasi-proprietary forms of extracts. First, extracting, coding, and mastering a new data set can involve hundreds of hours of researchers' time. Free, fully documented, "open source" extracts could potentially save new researchers, as a group, thousands or even tens of thousands of hours. To the extent that social-science research actually performs a socially useful function, these are tens of thousands of hours that could be dedicated to understanding social phenomena, not reinventing data extracts created in nearly identical form already by a host of earlier researchers.
Second, free extracts would be far more reliable than proprietary or quasi-proprietary extracts. Strictly proprietary extracts typically shield the code used to produce them. Without access to the code, researchers are far less likely to spot coding errors (and are in the uncomfortable position of having to trust the coding and judgment of those producing the extracts). Quasi-proprietary extracts can also be less reliable. The closed, quasi-proprietary extracts commonly used by non-academic policy researchers are frequently the product of a relatively small group of researchers and programmers with little outside review or feedback on the actual code. Meanwhile, the more open quasi-proprietary extracts found in more academic contexts typically lack a central coordinator, maintainer, and updater with sufficient organizational capacity and institutional interest in ensuring that corrections and improvements in the extracts appear in subsequent versions.
Third, free extracts can help to spur innovations and improvements. Free extracts allow new researchers to get up to speed much more quickly than they otherwise would. Researchers can then devote at least some of the time they save at startup to create new variables or develop new procedures for processing the data (to correct for top-coding or survey changes, for example.)
Finally, free extracts directly benefit those who make them free. At CEPR, we believe that making our extracts free makes sense socially, primarily because free extracts will help to improve the quantity and quality of social-science research. But, we also realize that CEPR's research will be better if a community of researchers develops around our particular extracts. The community of outsiders who benefit from our code also have an interest in spotting coding errors and writing improvements and extensions to our programs, thereby improving the quality of the extracts that we use every day in our own research. Furthermore, to the extent that free extracts become the standard in policy research, CEPR will also benefit substantially because the "entry cost" to using a new data set for a particular piece of research will fall dramatically, possibly allowing CEPR to work for the first time with the General Social Survey, or the notoriously tricky Survey of Consumer Finance, for example.
I'm writing about free extracts now because CEPR has just released the new and improved version of the CEPR data extracts, available at: www.ceprDATA.org. The newly re-launched site will eventually hold all of the code for all of the data sets that CEPR uses to produce our research reports on the US labor market. For the moment, www.ceprDATA.org, which Heather, Ben Zipperer, and I will maintain, contains my extracts of the CPS Outgoing Rotation Group (ORG) and Heather's SIPP extracts (a monumental amount of work!). If you work with either of these data sets, please take a look at our extracts and let us know what you think.