Published on December 14, 2015. Views: 24638. Downloads: 4409. Suggestions: 0.
No Encore for Encore? Ethical questions for web-based censorship measurement
Arvind Narayanan and Bendert Zevenbergen
A pair of computer scientists recently developed a clever way to measure Internet filtering and censorship worldwide, including countries such as China and Iran. Their system, named Encore, does this by executing a snippet of code on the web browsers of people who visit certain web pages— without the consent of those individuals. It caused a minor furor over research ethics in the computer networking and Internet measurement research communities.
Results summary: We analyze this conundrum through the lens of established ethical principles, while keeping in mind the peculiarities of Internet and big data research: its global reach, large scale, and automated nature. We also comment on the unusual model that computer scientists use for ethical oversight. We hope that the questions we raise will be useful for researchers facing similar dilemmas in their own work, as well as for students of research ethics, both in technical disciplines and in fields such as law and philosophy.
Anyone who administers a web page can copy-paste the above snippet into the source code of the page. It comes from the Encore project at the Network Operations and Internet Security Lab, now at Princeton and formerly at Georgia Tech. Its effect is to inject an invisible element into the page, which will then instruct the visitor’s browser to download and execute a piece of code . The code in question performs censorship measurement: it further instructs the visitor’s browser to access content from one of various potentially filtered websites—again invisibly—and report back to the research team’s server whether the access attempt was successful. By aggregating data from visitors to websites that deploy this measurement code snippet and inferring these visitors’ locations based on their IP addresses, researchers can obtain an accurate and up-to-date view into web filtering worldwide.
Figure 1. How Encore works. Reproduced from Burnett and Feamster’s paper .
The researchers, Sam Burnett and Nick Feamster, used this technique to conduct measurements for a period of seven months, as of January 2015, via installations by at least 17 volunteers. They recorded measurements from 88,260 distinct IP addresses in 170 countries, with China, India, the United Kingdom, and Brazil each reporting at least 1,000 measurements, and more than 100 measurements from Egypt, South Korea, Iran, Pakistan, Turkey, and Saudi Arabia.
Encore revealed valuable information about the censorship activities of these governments, but it did so by altering the behavior of computers in ways that users were probably not anticipating and had not consented to. Encore is thus one example of a growing ethical conundrum for computer science and computer security researchers: should researchers be permitted to surreptitiously alter the behavior of Internet-connected devices in order to gain scientific data about the behavior of users and networks? If they should be allowed to in at least some cases, what are the criteria for determining proper and improper uses of these techniques, and who should enforce such standards? Encore also throws into sharp relief the conflict between two objectives: building automated, large-scale, globally applicable measurement tools and carefully analyzing ethical issues with consideration to all relevant stakeholders, laws, norms, and social, cultural, and political contexts.
The architecture of the Internet, and the web in particular, affords a variety of ways of observing the behavior of networked devices on a large scale without the cooperation of users. Indeed, the multi-billion-dollar online ad targeting industry is built on this idea. Invisible “third parties” track our devices as we browse the web to build profiles of our interests and behavior; the average top-50 website contains 64 tracking mechanisms . Meanwhile, analytics firms track people in physical spaces such as shopping malls based on the WiFi, cellular, and other emanations from their smartphones .
Computer science researchers also make creative use of methods that allow observing devices without affirmative user consent. These studies have led to insights on the state of computer security, the economics of online advertising and of spam campaigns, Internet censorship and filtering around the world, and more.
The most intrusive of these studies, technologically speaking, are those that exploit unpatched security flaws to turn users’ devices into observation points. A well-known one is Spamalytics, a study where researchers took over control of a botnet—a network of machines infected with malware and controlled by a single operator—to modify and study the spam campaigns that originated from the infected machines . In another instance, an anonymous researcher or researchers created a botnet named Carna by infiltrating more than 400,000 routers and other devices whose default passwords hadn’t been changed, and used the botnet to study essentially the entirety of Internet-connected devices [6, 7].
Other studies are non-intrusive: they simply eavesdrop on network traffic without interfering with devices. In computer networking, analyzing traffic data for improving performance and testing new protocols is standard practice and arguably essential. Such studies typically make use of data provided by Internet Service Providers (ISPs) that can be staggeringly large in size. For example, a 2014 study of IPv6 adoption utilized (among others) a dataset of traffic statistics that covered an estimated 33–50% of all Internet traffic for 2013 . This type of research generally looks at network traffic in the aggregate rather than the behavior of individual users or devices, but there are exceptions. One study used the traffic metadata of millions of users, including a campus network, to study the economics of online advertising . The study did this by inferring the information that advertisers collected about individuals, as well as how expensive the ads shown were, and then analyzing the relationship between the two.
Peer-to-peer networks are particularly amenable to non-intrusive study. Since such networks route information among peers rather than to and from designated servers, a researcher can simply set up one or more peers and hop on board, without needing any special privileges such as cooperation from an ISP. Researchers have used this method to study the BitTorrent file-sharing system, the Tor anonymity network, and the Bitcoin cryptocurrency network [10, 11, 12].
A burgeoning category of research lies in between these two in terms of intrusiveness: methods that use active probing of devices in some way but not exploitation of any security holes. These techniques are both technically and ethically fascinating, and include the Encore study.
An archetypal example of active probing is network scanning for security assessment of networks . The ZMap research tool allows performing fast scans on an Internet-wide scale . Network scanning has a long history, but a variety of new techniques are stealthier. Idle port scanning uses “side-channel attacks” to bounce traffic off an Internet-connected device in order to make measurements of other devices . These side-channel attacks are different from exploitation of security bugs. The researcher doesn’t take control over devices in any way; the bugs that allow side channels are common to many different implementations, and may be inherent to the protocol specification.
Encore similarly makes use of unintended effects inherent in the architecture of the web rather than a bug in any specific browser. The “same-origin policy” used in programming web browsers seeks to quarantine content from different domains even when loaded side-by-side on the same page, but there are limits to the effectiveness of this protection. Recent research at Princeton created an interesting twist to Encore’s research methodology, showing how to deploy measurements through online advertisements . The researcher simply purchases ad impressions—available cheaply by the thousands—and delivers the measurement code as part of the ad. This technique allows targeting by geography and demographics and can reach any user without relying on deployment by a website reachable by the user.
Encore is part of the small but growing research area of censorship measurement, an area that sees a handful of significant publications each year. In the United States, funding for censorship measurement comes from the National Science Foundation, indirectly from the State Department through its funding of censorship circumvention research, and from a few companies and philanthropic organizations.
The most basic objective of censorship measurement is compiling data on what is censored or filtered, when, and for which users. A prominent example is the Harvard Berkman Center’s Herdict project, which aims to crowd-source and aggregate data about web filtering . The Tor project’s Open Observatory of Network Interference (OONI) has a similar aim; it provides a downloadable script that users can run . Such data collection is sometimes straightforward but can require technical innovation and research. Encore, of course, is one example; another is ConceptDopplr, a tool that incorporates a way to efficiently probe a keyword-based blocking system to discover the set of all blacklisted keywords .
Another objective of censorship measurement is understanding the technical mechanisms by which censorship operates. Here are some questions on which researchers have been able to shed light: Do governments operate filters in a centralized way at Internet routers at the nation’s borders, or in a decentralized way closer to the users ? How quickly are censors able to remove content from microblogging sites ? Does censorship operate purely by blocking or removal of content, or are performance degradation and modification of content also part of the picture ? Do censors have the technical infrastructure to examine the entire contents of Internet traffic (“deep packet inspection”) without slowing it down, or do they only look at the metadata ? What types of collateral damage does censorship cause ? So far, the bulk of the computer science research on censorship measurement addresses this class of questions.
Moving from the realm of computers and networks to the realm of people and governments, political scientists are interested in Internet censorship in terms of the motives of censors, the impact of censorship on freedom of speech, and so on . Measurement directly or indirectly helps answer these questions. In 2013 Harvard researchers analyzed millions of social media posts to show that censorship in China allows government criticism but silences collective expression .
For obvious reasons, it’s hard to measure censorship from a vantage point outside the country or countries of interest. There are some interesting and important exceptions. When censorship of online posts happens after the publication of posts and if researchers can obtain the content before the censors do, measurement can happen from the outside . In another instance, a hacktivist group leaked 600 gigabytes of log files of Internet filtering devices used in Syria, allowing researchers to gain insights into censorship in that country . Similarly, an anonymous ISP in Pakistan provided researchers access to a trove of data that enabled analysis of Pakistani censorship .
Outside such exceptions, collecting data about censorship requires the participation of volunteers “on the ground”—volunteers who might expose themselves to some risk and whose numbers limit the scale of measurements. Encore straddles these two categories: it avoids the need for researchers to recruit volunteers and is easily scalable, but the individuals whose devices are used face potential risks. Encore also provides a geographically fine-grained view of measurement, which researchers in the field value . Further, since Encore turns regular Internet users into measurement vantage points, it avoids the problem of censors being able to detect and disable measurement units.
Ethical Oversight by Program Committees
Which of the research projects we’ve looked at should be considered human-subjects research? Questions of this sort have long been contentious in computer science. Human-subjects research at institutions that receive federal funding in the United States is subject to Institutional Review Board (IRB) oversight. IRBs approve research proposals based on investigators’ efforts to account for and mitigate risk to participants. Typically research that poses little risk to individual human subjects is categorized as “exempt” from extensive oversight. In practice, however, much of such computer science research operates without IRB involvement. Historically computer science and engineering considers itself to be researching human-less systems, and university IRBs are typically geared toward regulating biomedical and social science research. When IRBs encounter computer science research, there is often mutual confusion.
Acknowledging that there are ethical and legal questions regardless of whether their activities involve human subjects, researchers have sought an alternative way to ensure that published research is justifiable on scientific ethics grounds. Several sub-communities, including computer security, networking, and Internet measurement—which collectively encompass all of the research described above—appear to be converging on conference program committees as the oversight mechanism. What’s a program committee? The most prestigious research in computer science is published in the proceedings of conferences rather than journals; each iteration of each conference selects a program committee to carry out peer review.
The appropriateness of ethical gatekeeping by program committees is a topic of continual debate in the community . Supporters of this model argue that technical domain expertise is critical for ethical review and that each subcommunity must evolve its own norms by adapting ethical principles to the specific domain. The program committee process might help subcommunities evolve those norms because of the flux of members among committees, as opposed to IRBs that operate in relative independence from each other.
On the other hand, the system has numerous shortcomings, some inherent and others potentially fixable. First, the review happens after the research is complete. Unlike IRBs, there is no process for advance or continual review. The uncertainty induced by the potential rejection of research by program committees might lead researchers to abandon some research ideas or entire areas of research—especially research that pursues methodological innovation—even if, on balance, the research would have been found to be ethically acceptable had it proceeded. In cases where the putative harm arises from conducting the research rather than its publication, the retrospective ethical review fails to prevent that harm.
Second, program committee members are domain experts and rarely include any members with scholarly expertise in research ethics or ethics in general. Third, since they are formed and disbanded for each conference, they lack institutional memory—whether about specific research projects or about decision-making critieria and procedures. So far, they have operated without consistency in ethical standards and with ad hoc decision-making processes. Indeed, to our knowledge, among the conferences where the papers referenced above appeared, not a single one has published rules or guidelines for what qualifies as ethical research as part of the call for papers!
The Encore paper was submitted to ACM SIGCOMM 2015, a prestigious conference on computer networking. After heated debate, the committee accepted the paper for publication, but with a “signing statement” at the top of the paper, an unprecedented move .
The committee’s ethical objections stemmed from several arguments, outlined in the public review of the paper . First, third-party requests used for ad tracking, the committee held, at least notionally reflect the user’s intent, whereas Encore’s requests do not. Second, users downloading censored URLs might face repercussions if they live in a regime without due process. Third, the committee believed that most users for whom censorship is an issue would be unlikely to consent to Encore’s measurements.
Several analytical frameworks for scientific ethics and regulation have been created by U.S. government commmissions. For example, the Belmont Report  concerns scientific and medical research involving human subjects. This report, issued by the National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research in 1979, established “respect for persons,” “beneficence” and “justice” as the guiding principles of research ethics. The Common Rule  is the federal regulation that tasks IRBs with reviewing research to ensure it meets those principles. The 2012 Menlo Report  builds on the Belmont Report and translates the scientific ethics research principles into the computer science and network engineering domain. There are many other frameworks and guidelines for this type of research . In our analysis here we’ll roughly follow the Menlo report in terms of structure and the set of principles used. Here are the ways we examine the Encore project.
First, we provide an ethical inspection. Ethics guidelines typically recommend something akin to the following from the Menlo report: “it is first necessary to perform a systematic and comprehensive stakeholder analysis.” In this study, this inspection raises questions of who the stakeholders are and whether Encore is human-subjects research.
Second, we provide a beneficence analysis. The principle of beneficence concerns the goal of the welfare of research participants and the balancing of probable harms. Ideally, the principle requires a systematic identification of the probability and magnitude of risk as well as benefits for the stakeholders, a subsequent iterative analysis of minimizing risk and maximizing benefits through the research design, and finally plans to mitigate identified risks and any unforeseen harms that materialize. In this study, this analysis raises questions about identifying Encore’s potential benefits and harms, identifying the benefits of the research, minimizing risk of harm, and mitigating harm.
An important aspect of the Belmont report is respect for persons, law, and the public interest. We examine these principles in terms of informed consent, transparency, and accountability, as described below.
Third, we review informed consent in the context of Encore. Research ethics requires treatment of individuals involved in a study as autonomous persons. In traditional human subjects research, the investigator (ideally) approaches participants before the data collection begins, explains the research, and seeks their consent. Seeking informed consent is not always feasible, as is frequently the case in network measurement research, and certain proxies for consent are sometimes deemed appropriate by the research community. Researchers can seek consent from a representative authority while debriefing research subjects after completion of data collection, or an IRB can, prior to the research, waive informed consent requirements completely. And according to the Common Rule, in cases where the researcher does not intervene in the life of an individual person to gather data, and there is no reasonable expectation of privacy, no form of consent is required. For example, anonymous observations of public activities, textual research, and examination of public records and other publicly accessible databases (even if access requires payment) are forms of research considered exempt from consent, even though they may reveal sensitive data about persons.
Fourth, we review transparency and accountability. Research ethics guidelines typically stress the importance of transparency of projects to serve the principles of accountability and meaningful informed consent. Additionally, guidelines recommend: “Debriefing is typically required when deception is used in order to mitigate harm resulting from loss of trust in researchers by those subjects who were deceived.”
Finally, we assess legal compliance. Laws and policies regarding censorship and accessing unlawful or undesirable content on the Internet vary widely across jurisdictions; sometimes they may not be codified into law or may be subject to interpretation by political officials.
Who are the stakeholders?
Any Internet user worldwide can stumble upon the invisible Encore script and carry out a censorship measurement. When an unsuspecting Internet user’s browser sends a request to a potentially censored website, as instructed by the Encore code, the user’s IP address may be recorded by the server hosting that website, as well as by many intermediaries and potentially unknown third parties. Most significantly, government-mandated censorship systems may also record and try to identify persons who access a censored website, although this is an assumption and may vary significantly by country. The Encore research team also records such measurements, which include the user’s IP address.
Trying to identify the stakeholders immediately reveals a conflict. As mentioned above, ethics guidelines typically recommend something akin to the following from the Menlo report: “it is first necessary to perform a systematic and comprehensive stakeholder analysis.” Yet the worldwide scale of Encore means that analyzing all potential stakeholders individually is infeasible. Worse, the principle is inherently at odds with the goal of scalability in computer science and engineering. Scalability is a goal that the Encore authors emphasize; in this context, it means that the team can expand the set of measurement targets by simply adding more machine resources and without multiplying the researcher effort required .
The dogma of computer science (and the technology industry) enshrines scalability as a virtue. Web companies regularly boast about their ratio of users to engineers, which can be over a million to one . Similarly, all else being equal, a research project that scales better is considered superior. In contrast, in fields where research involves experimenting on people, researchers aim to minimize the number of subjects necessary to measure a given effect with statistical rigor. When research that uses automated methods affects people, even if indirectly, we see a clash between these two paradigms.
The Menlo report briefly acknowledges the issue, stating: “Even a simple link traffic characterization study could involve millions of computers used by humans who are not themselves the direct subjects of research.” This tension is a theme to which we will repeatedly return.
Is Encore human-subjects research?
Unsuspecting Internet users across the globe generate research data for the Encore project. Does the reliance on these humans mean that the Encore project constitutes human-subjects research in the traditional sense, analogous to fields such as medical research or psychology? Although networking researchers typically see themselves as conducting research on technical systems, the Internet is more properly understood as a sociotechnical system in which humans and technology interact. Experiments on the Internet will likely also include data collection about the behavior of humans, or affect their environment.
Neither the Princeton nor the Georgia Tech IRB considered Encore to be human-subjects research. Under the operational definition from the Common Rule that IRBs use, a human subject is a living individual about whom an investigator obtains “(1) Data through intervention or interaction with the individual, or (2) Identifiable private information.” Should Encore’s collection of IP addresses classify it as human-subjects research?
The question of whether or not IP addresses constitute personally identifiable information (PII) is a well-worn debate . Buchanan et al. note:
“The Office for Human Research Protections has not issued a formal statement on whether IP addresses are considered to be personally identifiable information for purposes of the HHS protection of human subjects regulations at 45 CFR Part 46. However, for purposes of the HIPAA Privacy Rule, the HHS Office for Civil Rights has opined that an IP address is considered to be a direct identifier of an individual. Other European data regulations consider IPs as identifiers, and as such fall under the realm of the EU Data Directives (1995, 2006). This presents a challenge for international research and should be considered carefully by researchers and boards .”
Can researchers design Encore’s data collection to generalize collected IP addresses so that they are no longer personally identifying, yet still carry out their measurement and analysis objectives? This is an open question.
Under a narrow interpretation, the data that Encore collects is not about the individual but rather about the behavior of censorship systems. On the other hand, the definition of PII stems from medical and behavioral research, and probably did not anticipate the investigator’s actions causing other parties—in Encore’s case, the censor—to collect data about the individual. The Menlo Report advises the investigator to “respect individuals who are not targets of research yet are impacted” and says that “human subject research should now be considered as ‘human-harming research’—so the internet users may not be subjects per se, but they can still experience harm due to the research being conducted.”
Garfinkel is a proponent of the view that much of computer security research should be viewed as human-subjects research . He proposes what he calls the human test: “would the experiment be useful if the data were generated by a random process and not by a human?” It is not obvious how to apply this test to Encore. If it were practically possible—which, unfortunately, it is not—to replace the humans whose devices Encore uses for measurement by robots that visit websites in a random fashion, Encore would work very well, and in some ways better than it currently does since biases in measurement times and so on would be minimized.
Identifying Potential Benefits and Harms
Given Encore’s global scale, it will be tough for a small research group to adhere fully to the requirements of the beneficence principle. For example, before risks and harms can be identified, they must first be defined. However, due to the complex, dynamic, and innovative nature of the Internet, it is difficult to concretely define the harms for each Internet user, or even for regional groups of Internet users. The norms and attitudes of identified stakeholders with regard to accessing censored content differ greatly around the world, along with the type of censored content or possible enforcement actions. These are influenced by political, religious, historical and other social factors and are difficult—if not impossible—to quantify into a solid assessment of risks for each individual user.
Benefits of the research
Internet censorship measurement researchers argue that “whilst filtering and censorship can, to an extent, be open and transparent, their nature tends towards secrecy .” Measurement helps illuminate censorship—both its motivations and the technologies behind it. Understanding the motivations behind censorship yields valuable insights in political science, such as the fact that censorship in China allows government criticism but silences collective expression that may spur collective action . Illuminating censorship techniques enhances the ability to create effective censorship circumvention tools .
A view of Internet censorship as harmful to citizens subjected to it is implicit in much of censorship measurement research. Many see censorship as violating human rights—the freedom of speech and more specifically the freedom to seek, receive, and impart information .
Computer scientists and engineers also have technical concerns. Architecting networks to allow censorship and filtering by governments and intermediaries violates the “end-to-end” principle, a key design philosophy of the Internet. As far back as the year 2000, Clark and Blumenthal warned that as the end-to-end design erodes, the “Internet might lose some of its key features, in particular its ability to support new and unanticipated applications .” Internet engineers also raised this concern, among others, in response to the Stop Online Piracy Act (SOPA) and PROTECT IP Act (PIPA) bills in the United States, which proposed to allow the government to block copyright-infringing websites: “Censorship of Internet infrastructure will inevitably cause network errors and security problems. This is true in China, Iran and other countries that censor the network today; it will be just as true of American censorship .”
An unequivocally negative view of censorship is not universal. Bambauer argues that “widespread censorship on-line is not necessarily bad” and that the legitimacy of censorship should be assessed not by what is blocked but rather the transparency and accountability of decision-making regarding censorship . Chu and Cheng view the “Western” lens of individual autonomy and equality as inappropriate for Chinese society and evaluate Chinese online censorship from the perspective of “raising the moral level of both the state and society .”
At any rate, scholars have raised critical questions about data science, especially “big data,”  that apply to censorship measurement as well. For example, since some types of censorship are much easier to detect than others, measurement results may produce a biased picture of the state of censorship and the direction of its movement. Moreover, stripped of cultural and political context, data is hard to interpret. For instance, it may be easy to find correlations between Internet filtering and news events, but causal attribution is far trickier. Similar concerns have been raised in the field of international development . Censorship measurement researchers should be aware of this debate.
To conclude the discussion of benefits, let us recall the principle of justice, which entails that burdens as well as benefits be fairly and equitably distributed. For example, participants in a study who run the risk of harm should also benefit in the longer run from the research findings. From the Menlo report: “Each person deserves equal consideration in how to be treated, and the benefits of research should be fairly distributed according to individual need, effort, societal contribution, and merit.”
Harm: does Encore present more than minimal risk?
The Encore paper itself considers harm primarily in terms of a comparison between Encore usage and regular web browsing. The Common Rule captures this type of comparison in the notion of minimal risk : “minimal risk means that the probability and magnitude of harm or discomfort anticipated in the research is not greater in and of itself than those encountered during daily life or during the performance of routine physical and psychological examinations or tests .”
The authors argue that normal web browsing exposes users to the same risks that Encore does, saying “the prevalence of malware and third-party trackers itself lends credibility to the argument that a user cannot reasonably control the traffic that their devices send” and “laws against accessing filtered content vary from country to country, and may be effectively unenforceable given the ease with which sites (like Encore) can request cross-origin resources without consent.”
It is true that the average web user today is not in a position to effectively control third-party requests that their browser makes. Tracking technologies often go to great lengths to be stealthy, and publishers are often oblivious to the tracking technologies deployed on their properties . Furthermore, online trackers make requests to yet other third parties, just as Encore does. In fact, these “chains” of trackers can be half-a-dozen (or more) deep . While trackers may not necessarily make requests to censored domains, they beget other risks such as exposing users to surveillance agencies that monitor Internet traffic [56, 57]. Finally, advertisements themselves—and not just advertising networks—can make cross-origin requests to arbitrary domains. The bar to serving ads is much lower than the bar to becoming an advertising network.
However, there are several caveats to this argument and nuances that we should note. First, current online tracking practices are deeply at odds with users’ expectations [58, 59, 60]. According to Nissenbaum’s theory of contextual integrity, “what people care most about is not simply restricting the flow of information but ensuring that it flows appropriately .” According to the Association of Internet Researchers (AoIR)’s ethical guidelines, researchers must ask “What are the ethical expectations users attach to the venue in which they are interacting, particularly around issues of privacy ?” Arguably, both Encore and much third-party tracking today equally flout these expectations. In such an environment, there is a risk of an “ethical race to the bottom.” Credentialed researchers and respected academic organizations arguably should not participate in and facilitate a race to the bottom even if advertisers feel obliged to do so—their tools may be similar, but their ethical obligations need not be.
Second, the probability and magnitude of harm may depend on the type of censored website. For social media sites such as Facebook, Twitter, and YouTube, which are frequently censored, an Encore measurement might not stand out in any way, since widgets from these websites (such as Facebook’s Like button) are encountered extremely frequently in regular web browsing. The status is less clear with other websites such as news sites, also frequent targets of censorship. The nature and magnitude of harm may also depend on the reason the website was censored. A pattern of repeated access to specific religious websites deemed sensitive and censored will likely be viewed differently from accesses to Facebook or Twitter. It is difficult to generalize about the feasibility of enforcing laws across different regimes with different technological capabilities, real-world enforcement resources, and, more fundamentally, different levels of respect for the rule of law. There is little information available on the likelihood and severity of persecution for simply accessing (or attempting to access) blocked domains, although of course citizens of many countries face such risks for online writing . Tor Project leader Roger Dingledine notes that there is “little reprisal against passive consumers of information .” On the other hand, we know that the NSA monitored visits to pornography websites as part of a plan to “discredit radicalizers.” 
Third, the focus on harm to individuals doesn’t account for other types of harms that might result. For example, the Encore authors argue that “more widespread measurements like Encore become, the less risky they are for users” by making cross-origin requests to censored domains a commonplace occurrence. On the other hand, the censors might conceivably respond by shutting down Internet connectivity altogether.
The Encore researchers limited the set of URLs that the script induced users to measure. All such URLs came from the list that Herdict asks its users to test. The current version of Encore tests only Twitter, Facebook, and YouTube, the rationale being that these domains are accessed regularly and automatically by most users’ web browsers in the course of normal web browsing. In the section on informed consent, transparency and accountability below, we discuss other (actually used as well as potential) methods to mitigate harm.
The need for harm mitigation should inform the research design process, especially in research areas where established norms don’t exist. From the AoIR guidelines:
“Ethical decision-making is a deliberative process, and researchers should consult as many people and resources as possible in this process, including fellow researchers, people participating in or familiar with contexts/sites being studied, research review boards, ethics guidelines, published scholarship (within one’s discipline but also in other disciplines), and, where applicable, legal precedent.”
The document further provides several questions for investigators to consider:
“How are the concepts of “vulnerability” and “harm” being defined and operationalized in the study? How are risks to the community/author/participant being assessed? How is vulnerability determined in contexts where this categorization may not be apparent? Would a mismatch between researcher and community/participant/author definitions of “harm” or “vulnerability” create an ethical dilemma? If so, how would this be addressed?”
Informed consent, transparency and accountability
The SIGCOMM program committee reviewing the Encore paper stated that the main ethical concern with the research would be mitigated if those who deployed Encore obtained informed consent from users. The authors argue against both the feasibility and desirability of obtaining informed consent and provide three related arguments. First, they say that it would be impractical since it would require teaching users “nuanced technical concepts... across language barriers” that would “dramatically reduce the scale and scope of measurements.” The challenges of communicating the necessary technical information to a global set of participants again highlights the tension between the scalability imperative and established ethical norms. Second, they argue that Encore with informed consent would be essentially equivalent to existing alternatives (such as, presumably, Herdict), forfeiting the benefits of the novel measurement architecture. Third, they say that informed consent may even increase risk to users by removing plausible deniability. However, we must consider that if there is no rule of law or guarantee of a fair trial with an independent judiciary in the censoring country, plausible deniability may not be enough to protect a user.
In terms of transparency, the Encore website contains a statement at the bottom “Visitors of this page have performed XXX measurements of Web filtering” and provides links to a page with additional information and the ability to opt out of future participation. Encore is meant to be deployed by other website operators; accordingly, the FAQ contains the question “Do I need to inform my site's visitors about Encore?” whose response begins:
“Although we cannot provide legal advice, we believe that you are not required to inform your site’s visitors about Encore or obtain their consent before collecting measurements. That said, Encore’s installation instructions explain how your [sic] can inform your visitors of Encore's presence and allow them to disable Encore entirely.” (emphasis in original)
The focus appears to be on legal compliance over ethical obligation.
There are several possibilities for strengthening notice. The notice and opt-out link provided to users could be more prominent, perhaps in the style of the EU “cookie law” notices. Encore could require website operators who deploy it to give the same type of notice. Encore’s FAQ could be expanded to include an explanation of the risks and benefits of the research as well as the technical concepts necessary to fully understand these.
Although the Encore team is based in the U.S., the measurement actions occur in browsers of Internet users worldwide, which makes the issue of jurisdiction unclear.
In terms of compliance with United States computer law, Encore appears to be in the clear. U.S. cybersecurity law expert Jonathan Mayer notes that “While the scope of computer abuse law remains deeply unsettled, courts have converged on two baseline principles. First, circumventing a security protection on a remote system is illegal . Second, when a system's owner explicitly revokes permission, remote access must cease .” He argues that both Encore and the ZMap tool discussed in Section 2 “unambiguously abide by these guidelines” since “both measurement approaches take advantage of known, intentional software functionality.” He continues, “As for respecting system owner preferences, the main deployment of both platforms is accompanied by a straightforward opt-out mechanism. If a system’s owner revokes permission, research data collection immediately terminates .”
A global study of Internet censorship law and policies—as well as other applicable bodies of law such as privacy and data protection law—would be a near-impossible task for a legal researcher, let alone a team of computer scientists. Enumerating all possible (albeit remote) legal risks to Encore users is similarly infeasible. For example, the Falun Gong organization is banned in China; perhaps visits to a Falun Gong website may be interpreted as support for their cause . Or perhaps Encore measurements are interpreted to constitute an act of espionage by helping a foreign power to map the national filter.
Since a thorough worldwide legal study is infeasible, Encore researchers cannot be certain that the measurements they induce do not constitute a violation of any local law. Ethicists would advise not putting people in a position where they could be perceived to have broken a law. In exceptional cases, however, researchers could develop an ethical justification that a law (or a type of law) is not in the public interest. The researchers must then demonstrate that they accept responsibility for their actions and the consequences, and have the necessary mitigation strategies in place .
In conclusion, Encore makes for a fascinating case study that presents a thick web of considerations and no easy answers. While the scale of today’s Internet and datasets is giddying to researchers and companies alike, the ethical responsibility that comes with it is rather sobering. Our analysis reveals a complex interplay between the technical design of the experiment and its potential risks and benefits.
As of this writing, Encore is very recent work, and there is an ongoing debate about its ethics, the broader question of norms for ethical research in network measurement, computer security, data science, and other disciplines, as well as the meta-question of how these disciplines should exercise ethical gatekeeping. We invite you to join the conversation.
Arvind Narayanan is a computer science professor at Princeton. He advised a Master's thesis, described in Section 2, that utilized a similar methodology to the Encore project.
Bendert Zevenbergen is a Ph.D candidate and researcher at the Oxford Internet Institute, where he studies the intersection of law, ethics, social science, and the Internet. Along with a colleague at OII, he first brought certain ethical concerns to the Encore authors' attention, resulting in a significant change to the design. This case study is the result of a dialogue between us.
This case study was first written for the Council on Big Data, Ethics, and Society. Funding for this Council was provided by the National Science Foundation (#IIS-1413864). For more information on the Council, see: http://bdes.datasociety.net/
We are grateful for useful feedback from Nick Feamster, Jacob Metcalf, Matt Salganik, Stuart Schechter, Joss Wright, members of the BDES council, and anonymous reviewers.
Narayanan A, Zevenbergen B. No Encore for Encore? Ethical questions for web-based censorship measurement. Technology Science. 2015121501. December 14, 2015. https://techscience.org/a/2015121501/
Under review for data sharing classification.
Enter your recommendation for follow-up or ongoing work in the box at the end of the page. Feel free to provide ideas for next steps, follow-on research, or other research inspired by this paper. Perhaps someone will read your comment, do the described work, and publish a paper about it. What do you recommend as a next research step?