TY - JOUR
T1 - Analyzing privacy policies at scale
T2 - From crowdsourcing to automated annotations
AU - Wilson, Shomir
AU - Schaub, Florian
AU - Liu, Frederick
AU - Sathyendra, Kanthashree Mysore
AU - Smullen, Daniel
AU - Zimmeck, Sebastian
AU - Ramanath, Rohan
AU - Story, Peter
AU - Liu, Fei
AU - Sadeh, Norman
AU - Smith, Noah A.
N1 - Funding Information:
This research has been partially funded by the National Science Foundation under grant agreement CNS-1330596. Authors’ addresses: S. Wilson, College of Information Sciences and Technology, Westgate Building, Pennsylvania State University, University Park, PA 16802 USA; email: shomir@psu.edu; F. Schaub, University of Michigan, School of Information, 105 S State St, Ann Arbor, MI 48109, USA; email: fschaub@umich.edu; K. M. Sathyendra, D. Smullen, R. Ramanath, P. Story, and N. Sadeh, School of Computer Science, Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA 15213 USA; emails: kanthashree.ms@gmail.com, dsmullen@cs.cmu.edu, ronramanath@gmail.com, pstory@andrew.cmu.edu, sadeh@cs.cmu.edu; S. Zimmeck, Department of Mathematics and Computer Science, Wesleyan University, Science Tower 655, 265 Church St, Middletown, CT 06459-0128 USA; email: szimmeck@wesleyan.edu; F. Liu, Computer Science Department, University of Central Florida, 4328 Scorpius St, Orlando, FL 32816-2362 USA; email: feiliu@cs.ucf.edu; N. A. Smith, Paul G. Allen School of Computer Science & Engineering, University of Washington, Box 352350, 185 E Stevens Way NE, Seattle, WA 98195 USA; emails: nasmith@gmail.com. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org. 2018 Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM 1559-1131/2018/12-ART1 $15.00 https://doi.org/10.1145/3230665
Funding Information:
This research has been partially funded by the National Science Foundation under grant agreement CNS-1330596.
Publisher Copyright:
© 2018 Association for Computing Machinery.
PY - 2018/12
Y1 - 2018/12
N2 - Website privacy policies are often long and difficult to understand. While research shows that Internet users care about their privacy, they do not have the time to understand the policies of every website they visit, and most users hardly ever read privacy policies. Some recent efforts have aimed to use a combination of crowdsourcing, machine learning, and natural language processing to interpret privacy policies at scale, thus producing annotations for use in interfaces that inform Internet users of salient policy details. However, little attention has been devoted to studying the accuracy of crowdsourced privacy policy annotations, how crowdworker productivity can be enhanced for such a task, and the levels of granularity that are feasible for automatic analysis of privacy policies. In this article, we present a trajectory of work addressing each of these topics. We include analyses of crowdworker performance, evaluation of a method to make a privacy-policy oriented task easier for crowdworkers, a coarse-grained approach to labeling segments of policy text with descriptive themes, and a fine-grained approach to identifying user choices described in policy text. Together, the results from these efforts show the effectiveness of using automated and semi-automated methods for extracting from privacy policies the data practice details that are salient to Internet users' interests. 2018 Copyright is held by the owner/author(s).
AB - Website privacy policies are often long and difficult to understand. While research shows that Internet users care about their privacy, they do not have the time to understand the policies of every website they visit, and most users hardly ever read privacy policies. Some recent efforts have aimed to use a combination of crowdsourcing, machine learning, and natural language processing to interpret privacy policies at scale, thus producing annotations for use in interfaces that inform Internet users of salient policy details. However, little attention has been devoted to studying the accuracy of crowdsourced privacy policy annotations, how crowdworker productivity can be enhanced for such a task, and the levels of granularity that are feasible for automatic analysis of privacy policies. In this article, we present a trajectory of work addressing each of these topics. We include analyses of crowdworker performance, evaluation of a method to make a privacy-policy oriented task easier for crowdworkers, a coarse-grained approach to labeling segments of policy text with descriptive themes, and a fine-grained approach to identifying user choices described in policy text. Together, the results from these efforts show the effectiveness of using automated and semi-automated methods for extracting from privacy policies the data practice details that are salient to Internet users' interests. 2018 Copyright is held by the owner/author(s).
UR - http://www.scopus.com/inward/record.url?scp=85058278785&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85058278785&partnerID=8YFLogxK
U2 - 10.1145/3230665
DO - 10.1145/3230665
M3 - Article
AN - SCOPUS:85058278785
SN - 1559-1131
VL - 13
JO - ACM Transactions on the Web
JF - ACM Transactions on the Web
IS - 1
M1 - 1
ER -