The NIH recently issued a request for information (RFI) seeking input on the need for an administrative data enclave. The RFI is here and a blogpost related to the RFI is here. Given the lack of information about the NIH-funded workforce, and particularly the non-investigator workforce it supports, we have submitted a response, detailed below the text for the RFI copied below.

RFI

Purpose

The National Institutes of Health (NIH), Office of the Director, Office of Extramural Research (OER) issues this Request for Information (RFI) to gauge interest in NIH expending funds to develop, host, and maintain a secure environment (data enclave) that would allow approved research organizations-controlled access to structured, de-identifiable NIH administrative and scientific information not made available to the public. (NOT-OD-19-085)

Background

The NIH is committed to transparency about its research investments and currently makes grant award information available to stakeholders (e.g. grantee institutions, researchers, professional organizations, the public) through web-based self-service tools. Currently RePORTER provides the public a searchable public repository of NIH-funded projects, and ExPORTER provides bulk files on funded projects for download. These tools contain non-sensitive information on NIH funded projects, including the institutions and principal investigators funded by NIH, with project abstracts and basic administrative data on those grant awards.

In recent years NIH has noted an increasing demand for access to sensitive information collected via the grants process. Such data includes information on peer review outcomes, progress reports, as well as, demographic information such as age range, sex/gender, race and ethnicity of individuals listed in NIH grant applications, etc. A recent report released by the Advisory Committee to the NIH Director on Next Generation Researchers calls for an increase in NIH administrative data for members of the biomedical research community.  To address this demand, the NIH is considering making sensitive data available in accordance with the federal system of record; collection, maintenance and dissemination of the data governed by the Privacy Act 1974, as amended, in a secure data enclave accessible only upon request through an approved Special Data Access Agreement (SDAA).

Data Access

NIH has an obligation to ensure the protection of sensitive information collected through the grants process in accordance with federal laws.  Prior to receiving data access, organizations would be required to enter into an SDAA with NIH to gain access to records that would be made available in a controlled virtual environment or designated physical location. The SDAA would require inclusion of a previously funded research plan that would need to be approved by the organization for submission to NIH. The SDAA and accompanying research plan would undergo review by an internal NIH committee and the OER Privacy Officer before access would be granted.

There will be costs, likely in the millions of dollars, for NIH to create a secure data enclave that allows NIH-funded researchers interested in using the secure data enclave to establish “seats”. These “seats” would allow researchers to access data within the secure enclave and would provide researchers the ability to export data. The organization and researcher would agree in writing to adhere to Federal IT security and privacy policies as well as other statutes, policies, and regulations as appropriate. An NIH federal official or authorized system manager would need to pre-approve any de-identified aggregate data exported from the secure environment.

Information Requested

The NIH seeks input on any of the following:

  • Examples of NIH mission relevant biomedical and behavioral research using a data enclave that cannot be pursued currently.
  • Whether the benefits of the proposed data enclave are worth repurposing NIH research funds to establish, maintain and operate the data enclave.
  • Preferences and considerations about accessing a data enclave only at a designated physical location or within a virtual environment.
  • Quantity of seats desired if NIH decides to make a substantial investment to sponsor access to sensitive data as allowable under the applicable federal laws in a secure virtual or physical environment.
  • Examples of procedures an organization would implement to ensure the highest level of data protections, as well as to monitor, document, and notify NIH of any unauthorized and/or inadvertent data breaches.
  • Examples of outputs from approved research and how these may be shared with NIH.

See Guide Notice NOT-OD-19-085 for more information.”

 

FoR Response

Examples of NIH mission relevant biomedical and behavioral research using a data enclave that cannot be pursued currently.

Our organization has been part of a community of researchers attempting to address the gaps that currently exist in the literature about the size, structure, and diversity of the biomedical research workforce. As an example, data that are not currently available include basic information about the NIH’s workforce, particularly its training workforce. The NIH uses language referring to its workforce to refer to its funded investigators, whereas in fact this represents a small portion of the actual workforce that NIH supports.

The number of pre-independent researchers currently supported by NIH is an example of an unknown, that hinders many efforts to understand the nation’s investment in biomedical training and labor. In order to identify these individuals, it is necessary for graduate students and postdoctoral researchers on both training mechanisms and research project grants to be identified and counted by the National Institutes of Health, for while a differentiation may be made on the basis of “trainee” vs “staff” depending on the type of funding support, in practice all carry out the work and research of similar nature. Indeed, under Title 2, section 200.400(f) of the Code of Federal Regulations, “For non-Federal entities that educate and engage students in research, the dual role of students as both trainees and employees contributing to the completion of Federal awards for research must be recognized in the application of these principles.” Identifying these researchers would address issues brought to the Advisory Committee to the Director in the Biomedical Workforce Working Group Report of 2012 (https://acd.od.nih.gov/documents/reports/Biomedical_research_wgreport.pdf) and allow analyses building on recent work investigating the importance of, for example, training awards in the NIH-funded biomedical workforce (https://www.biorxiv.org/content/10.1101/622886v1). Causal analyses of various funding mechanisms and analyses of the types of research and researchers who submit to NIH, and who do or do not obtain funding, could be possible (and are desirable) using such data.

In addition, such a data enclave could facilitate other causal analyses of the return to investments in biomedical research in terms of scientific output, technological improvements, and the economy.

 

Whether the benefits of the proposed data enclave are worth repurposing NIH research funds to establish, maintain and operate the data enclave.

Supporting such critical infrastructure would be a valuable way of supporting and encouraging research that has high potential value to NIH. NIH’s inability to accurately report the number of postdoctoral researchers currently in its workforce has become a cause for concern across the research enterprise and has led to greater Congressional interest (and is in part an issue that has led to the Congressional push for the Next Generation Researchers Initiative, and the related National Academies of Sciences, Engineering and Medicine study on barriers facing the next generation of researchers (https://www.nap.edu/catalog/25008/the-next-generation-of-biomedical-and-behavioral-sciences-researchers-breaking), which re-identified the lack of transparency about the pre-independence workforce as a major factor preventing success in fostering the next generation of researchers). Therefore it would appear to be in NIH’s interests to ensure such a data enclave allowing such research to take place, particularly if this research is allowed to take place outside of NIH, thus reducing NIH staff costs to carry out research about the system and instead allow the wider research community to conduct such studies.

 

Preferences and considerations about accessing a data enclave only at a designated physical location or within a virtual environment.

The NIH could leverage existing secure data systems. A virtual environment such as the NYU Administrative Data Research Facility (ADRF, https://coleridgeinitiative.org/assets/docs/adrf_summary.pdf) or the Institute for Research on Innovation and Science (IRIS, http://iris.isr.umich.edu/) would greatly improve access relative to a physical enclave. Among physical enclaves, the Census’s Federal Statistical Research Data Center (FSRDC, https://www.census.gov/fsrdc) network would be the most natural environment. All three enclaves already contain a wealth of complementary data, are designed to facilitate secure access and data linkages, and have established affiliated researchers. They are likely to be less expensive to operate than an NIH-only system, to be more accessible to more researchers, and already house a wealth of complementary data.

Indeed, we would again point the attention of NIH to the National Academies of Sciences, Engineering and Medicine report, “Breaking Through” (https://www.nap.edu/catalog/25008/the-next-generation-of-biomedical-and-behavioral-sciences-researchers-breaking), which was mandated by the US Congress under the 21st Century Cures Act to recommend steps to NIH to improve the transition to independence of early career biomedical and behavioral sciences researchers. The report contains the following recommendation, which is directed primarily at NSF, which has the mission of tracking the nation’s scientific workforce:

“Recommendation 3.4:

The National Science Foundation (NSF) should develop and implement a plan to improve sectorwide data collection and analysis in a manner that is easily accessible by policymakers and integrates data from numerous other sources. NSF should expeditiously link the Survey of Doctorate Recipients and the Survey of Earned Doctorates to U.S. Census data, and those linked data, under strict confidentiality protocols, should be made available for qualified researchers to use at Federal Statistical Research Data Centers to better understand the biomedical workforce.”

We would endorse NIH seeking to establish a collaborative effort involving any or all of the organizations mentioned here.

 

Quantity of seats desired if NIH decides to make a substantial investment to sponsor access to sensitive data as allowable under the applicable federal laws in a secure virtual or physical environment.

The estimates of the size of the current research community suggest that demand is likely to be on the order of 20-40 research teams per year, with teams having multiple users provided that costs and terms of access are reasonable and the data can be or are linked to other data of interest.

 

List examples of outputs from approved research and how these may be shared with NIH.

NIH has convened a number of conferences and working groups over the years drawing on Science of Science and Innovation researchers, and we would highlight the work of the SciSIP community that NIGMS is currently supporting (https://loop.nigms.nih.gov/tag/science-of-science/). A meeting of this community took place at NIH in 2016 with attendance from members of our organization, and a standing annual meeting would be a natural way to identify ideas of interest to NIH, provide feedback to researchers, and for researchers to disseminate their findings. Indeed such research would likely be of use to working groups at the National Institutes of Health, particularly in facilitating prediction of the effects of policy recommendations from groups such as the Next Generation Researchers Initiative, and could be integrated into such discussions with inclusion of researchers using the data enclave for this very purpose.

 

Examples of procedures an organization would implement to ensure the highest level of data protections, as well as to monitor, document, and notify NIH of any unauthorized and/or inadvertent data breaches.

As mentioned above, NYU’s ADRF, Michigan’s IRIS Enclave, and Census’s FSRDC network already implement best practices to ensure the highest level of data protections, as well as monitoring, documenting, and notifying stakeholders of unauthorized access and/or inadvertent data breaches.

 

Examples of outputs from approved research and how these may be shared with NIH.

See above.

 

We would like to thank Dr. Bruce Weinberg for sharing comments and thoughts related to this RFI in a public forum.