Collaborative Research: SaTC: CORE: Medium: A Large-Scale, Longitudinal Resource to Advance Technical and Legal Understanding of Textual Privacy Information

Project: Research project

Project Details


A key aspect of consumer privacy protection is transparency. Yet despite a wealth of privacy policies and other texts that are pertinent to organizations' privacy practices, analyzing those documents has been an obstacle to understanding digital privacy at scale. Efforts to leverage information written in privacy policies, terms of service agreements, cookie notices, privacy laws, and other privacy-related documents suffer from a lack of existing resources with sufficient breadth and depth to cover the privacy landscape. In response, this project is creating an infrastructure and repository for the large-scale collection of privacy-related documents online. The collection enables surveying the privacy landscape with previously untenable coverage and accuracy, which also supports legal and public policy analysis. Additionally, it enables researchers to build technologies that bridge the gap between internet users' privacy expectations and the contents of the documents that influence or describe organizations' privacy practices. Research topics being addressed include advancing natural language processing of legal text, identifying privacy norms and outliers for sectors of commerce, and finding ways to effectively communicate changes in privacy-related documents to consumers, researchers, and policymakers.

The research team is building a large-scale, longitudinal, annotated, and searchable resource of privacy-related documents: privacy policies, terms of service agreements, cookie notices, privacy laws in the U.S. and around the world, regulatory guidelines, and other related texts on the internet. The team is advancing natural language processing (NLP) techniques for large-scale interpretation of privacy-related documents, as well as analyzing the state of privacy at an unprecedented scale and removing barriers for creating tools that use large amounts of data about privacy practices and regulations to provide insights and recommendations. A core focus of the project is the dissemination of resources for privacy researchers, practitioners, and policymakers, including search engines, corpora, pre-trained language models, APIs, and analysis tools and results. This research is helping to realize long-standing goals to make privacy manageable for consumers, regulators, and others invested in the world's evolving information society.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Effective start/end date7/1/216/30/24


  • National Science Foundation: $720,744.00


Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.