Analyzing privacy policies at scale: From crowdsourcing to automated annotations

Shomir Wilson, Florian Schaub, Frederick Liu, Kanthashree Mysore Sathyendra, Daniel Smullen, Sebastian Zimmeck, Rohan Ramanath, Peter Story, Fei Liu, Norman Sadeh, Noah A. Smith

Research output: Contribution to journalArticle

2 Citations (Scopus)

Abstract

Website privacy policies are often long and difficult to understand. While research shows that Internet users care about their privacy, they do not have the time to understand the policies of every website they visit, and most users hardly ever read privacy policies. Some recent efforts have aimed to use a combination of crowdsourcing, machine learning, and natural language processing to interpret privacy policies at scale, thus producing annotations for use in interfaces that inform Internet users of salient policy details. However, little attention has been devoted to studying the accuracy of crowdsourced privacy policy annotations, how crowdworker productivity can be enhanced for such a task, and the levels of granularity that are feasible for automatic analysis of privacy policies. In this article, we present a trajectory of work addressing each of these topics. We include analyses of crowdworker performance, evaluation of a method to make a privacy-policy oriented task easier for crowdworkers, a coarse-grained approach to labeling segments of policy text with descriptive themes, and a fine-grained approach to identifying user choices described in policy text. Together, the results from these efforts show the effectiveness of using automated and semi-automated methods for extracting from privacy policies the data practice details that are salient to Internet users' interests. 2018 Copyright is held by the owner/author(s).

Original languageEnglish (US)
Article number1
JournalACM Transactions on the Web
Volume13
Issue number1
DOIs
StatePublished - Dec 1 2018

Fingerprint

Internet
Websites
Labeling
Interfaces (computer)
Learning systems
Productivity
Trajectories
Processing

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications

Cite this

Wilson, S., Schaub, F., Liu, F., Sathyendra, K. M., Smullen, D., Zimmeck, S., ... Smith, N. A. (2018). Analyzing privacy policies at scale: From crowdsourcing to automated annotations. ACM Transactions on the Web, 13(1), [1]. https://doi.org/10.1145/3230665
Wilson, Shomir ; Schaub, Florian ; Liu, Frederick ; Sathyendra, Kanthashree Mysore ; Smullen, Daniel ; Zimmeck, Sebastian ; Ramanath, Rohan ; Story, Peter ; Liu, Fei ; Sadeh, Norman ; Smith, Noah A. / Analyzing privacy policies at scale : From crowdsourcing to automated annotations. In: ACM Transactions on the Web. 2018 ; Vol. 13, No. 1.
@article{aea0d0d2796b4e438ee9b810c0c2e7f5,
title = "Analyzing privacy policies at scale: From crowdsourcing to automated annotations",
abstract = "Website privacy policies are often long and difficult to understand. While research shows that Internet users care about their privacy, they do not have the time to understand the policies of every website they visit, and most users hardly ever read privacy policies. Some recent efforts have aimed to use a combination of crowdsourcing, machine learning, and natural language processing to interpret privacy policies at scale, thus producing annotations for use in interfaces that inform Internet users of salient policy details. However, little attention has been devoted to studying the accuracy of crowdsourced privacy policy annotations, how crowdworker productivity can be enhanced for such a task, and the levels of granularity that are feasible for automatic analysis of privacy policies. In this article, we present a trajectory of work addressing each of these topics. We include analyses of crowdworker performance, evaluation of a method to make a privacy-policy oriented task easier for crowdworkers, a coarse-grained approach to labeling segments of policy text with descriptive themes, and a fine-grained approach to identifying user choices described in policy text. Together, the results from these efforts show the effectiveness of using automated and semi-automated methods for extracting from privacy policies the data practice details that are salient to Internet users' interests. 2018 Copyright is held by the owner/author(s).",
author = "Shomir Wilson and Florian Schaub and Frederick Liu and Sathyendra, {Kanthashree Mysore} and Daniel Smullen and Sebastian Zimmeck and Rohan Ramanath and Peter Story and Fei Liu and Norman Sadeh and Smith, {Noah A.}",
year = "2018",
month = "12",
day = "1",
doi = "10.1145/3230665",
language = "English (US)",
volume = "13",
journal = "ACM Transactions on the Web",
issn = "1559-1131",
publisher = "Association for Computing Machinery (ACM)",
number = "1",

}

Wilson, S, Schaub, F, Liu, F, Sathyendra, KM, Smullen, D, Zimmeck, S, Ramanath, R, Story, P, Liu, F, Sadeh, N & Smith, NA 2018, 'Analyzing privacy policies at scale: From crowdsourcing to automated annotations', ACM Transactions on the Web, vol. 13, no. 1, 1. https://doi.org/10.1145/3230665

Analyzing privacy policies at scale : From crowdsourcing to automated annotations. / Wilson, Shomir; Schaub, Florian; Liu, Frederick; Sathyendra, Kanthashree Mysore; Smullen, Daniel; Zimmeck, Sebastian; Ramanath, Rohan; Story, Peter; Liu, Fei; Sadeh, Norman; Smith, Noah A.

In: ACM Transactions on the Web, Vol. 13, No. 1, 1, 01.12.2018.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Analyzing privacy policies at scale

T2 - From crowdsourcing to automated annotations

AU - Wilson, Shomir

AU - Schaub, Florian

AU - Liu, Frederick

AU - Sathyendra, Kanthashree Mysore

AU - Smullen, Daniel

AU - Zimmeck, Sebastian

AU - Ramanath, Rohan

AU - Story, Peter

AU - Liu, Fei

AU - Sadeh, Norman

AU - Smith, Noah A.

PY - 2018/12/1

Y1 - 2018/12/1

N2 - Website privacy policies are often long and difficult to understand. While research shows that Internet users care about their privacy, they do not have the time to understand the policies of every website they visit, and most users hardly ever read privacy policies. Some recent efforts have aimed to use a combination of crowdsourcing, machine learning, and natural language processing to interpret privacy policies at scale, thus producing annotations for use in interfaces that inform Internet users of salient policy details. However, little attention has been devoted to studying the accuracy of crowdsourced privacy policy annotations, how crowdworker productivity can be enhanced for such a task, and the levels of granularity that are feasible for automatic analysis of privacy policies. In this article, we present a trajectory of work addressing each of these topics. We include analyses of crowdworker performance, evaluation of a method to make a privacy-policy oriented task easier for crowdworkers, a coarse-grained approach to labeling segments of policy text with descriptive themes, and a fine-grained approach to identifying user choices described in policy text. Together, the results from these efforts show the effectiveness of using automated and semi-automated methods for extracting from privacy policies the data practice details that are salient to Internet users' interests. 2018 Copyright is held by the owner/author(s).

AB - Website privacy policies are often long and difficult to understand. While research shows that Internet users care about their privacy, they do not have the time to understand the policies of every website they visit, and most users hardly ever read privacy policies. Some recent efforts have aimed to use a combination of crowdsourcing, machine learning, and natural language processing to interpret privacy policies at scale, thus producing annotations for use in interfaces that inform Internet users of salient policy details. However, little attention has been devoted to studying the accuracy of crowdsourced privacy policy annotations, how crowdworker productivity can be enhanced for such a task, and the levels of granularity that are feasible for automatic analysis of privacy policies. In this article, we present a trajectory of work addressing each of these topics. We include analyses of crowdworker performance, evaluation of a method to make a privacy-policy oriented task easier for crowdworkers, a coarse-grained approach to labeling segments of policy text with descriptive themes, and a fine-grained approach to identifying user choices described in policy text. Together, the results from these efforts show the effectiveness of using automated and semi-automated methods for extracting from privacy policies the data practice details that are salient to Internet users' interests. 2018 Copyright is held by the owner/author(s).

UR - http://www.scopus.com/inward/record.url?scp=85058278785&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85058278785&partnerID=8YFLogxK

U2 - 10.1145/3230665

DO - 10.1145/3230665

M3 - Article

AN - SCOPUS:85058278785

VL - 13

JO - ACM Transactions on the Web

JF - ACM Transactions on the Web

SN - 1559-1131

IS - 1

M1 - 1

ER -