Introduction
How is online privacy doing? Public policy makers regularly propose measures to give consumers more privacy rights online. These measures rely on the assumption that the web privacy landscape has become worse for consumers and that online tracking is more pervasive now than in the past. As policymakers consider different approaches for addressing Internet privacy, it is critical to understand how interventions such as negative press attention, self-regulation, Federal Trade Commission enforcement actions, and direct regulation affect tracking.
In 2011, we began taking comprehensive measures of online privacy. We term our measures the Web Privacy Census. We took a Web Privacy Census in 2011 [1] and 2012 [2]. In this paper, we report on the Web Privacy Census of 2015.
The earlier Web Privacy Censuses measured how much information could be associated with a visitor to a website. Tracking activities relied on cookies, Flash cookies and HTML5 storage.
A cookie is a message a web browser (e.g., Internet Explorer, Safari, or Firefox) stores when a website it visits requests it to do so. The browser sends the message back to the server each time the browser requests a page from the server. Websites often use cookies to track visits to the same or different websites. A third-party cookie is one that appears in your browser when you visit a web page even though the cookie is not specific to the website you visited.
Flash cookies, more formally termed Local Shared Objects (LSOs), are like regular cookies except that they do not appear in a browser’s list of cookies, making them harder to detect and delete.
HTML5 is a markup language used for presenting content on web pages. Web browsers that support HTML5 also allocate some local storage in the browser to store data. Browsers store cookies in the local storage, for example. However, storage of larger amounts of data is allowed. The size of a cookie should not exceed 4093 bytes or 4K, while a Flash cookie is 100KB. The HTML5 storage limit is far larger (at least 5MB).
Our Web Privacy Census found a marked increase in HTML5 storage usage and a sharp decline in Flash cookies between 2011 and 2012. An increase in HTML5 storage does not directly correlate with an increase in tracking, as an HTML5 storage object can hold any information that the developer needs to store locally. However, this information can potentially contain information used to track users and can persist. Our 2015 census found that regular cookie counts continue to increase, with larger and larger numbers of third-party cookies in use Cookies are present on every website in the top 100 most popular websites, with approximately 34% using HTML5 storage, more than double the amount we counted in 2011.
Methods
To answer the questions above, we use a web crawler, a computer program that systematically browses the Internet, to run a crawl on the top 100, 1,000, and 25,000 sites ranked by Quantcast. The crawler determines the number of HTTP cookies, Flash cookies, and HTML5 local storage placed by each website and compares these numbers with results from our 2012 survey. We collect data using deep and shallow crawls within the pages of a domain. Shallow crawls consist of visiting only the homepage of each site, while deep crawls visit the homepage and two other links at random on the site.
We collect data on the top 100, 1,000, and 25,000 websites as ranked on Quantcast's top 1 million websites in the United States in July 2015. We collect data using two processes: 1) a shallow automated crawl of the top 100, 1,000, and 25,000 sites, which consists of visiting only the homepage of the domain obtained from Quantcast's rankings, and 2) a deep automated crawl of the top 100 and 1,000 sites that consists of visiting the homepage and 2 randomly selected links from the homepage. After visiting the first link, the crawler returns to the homepage before selecting the second link. Both links are on the same domain as the homepage.
The Crawler. The crawler is OpenWPM, a flexible and scalable platform written in Python [31]. This crawler offers features such as collecting HTTP cookies, Flash cookies, HTML5 local storage objects, and the ability to perform deep crawls by visiting links. OpenWPM allows the crawl to be run in either Firefox or Chrome. It can be run with or without add-ons.
We run all crawls using a Firefox version 39 browser with no add-ons, with Flash turned on, and in headless mode. We collect information from each crawled domain visit: HTTP cookies, HTML5 local storage objects, Flash cookies, and HTTP requests and responses (including headers). We run each crawl four times and report the average for each tracking method.
Shallow Automated Crawl. We run shallow crawls with a clean browser instance cleared of all tracking data. The crawler visits each URL homepage, waits for the page to load, and then dumps all tracking data obtained from that URL into a database. The crawler then closes that browser tab, opens a new tab, then continues this process with the next URL on the Quantcast list.
Deep Automated Crawl. We run deep crawls with a clean browser instance cleared of all tracking data. The crawler visits each URL homepage and waits for the page to load. It then randomly selects a link on the homepage and visits that page. After the linked page finishes loading, the crawler goes back to the previous page and visits a second randomly selected link. After the second link finishes loading, the crawler dumps all tracking data obtained from those three URLs into a database. The crawler then closes that browser tab, opens a new tab, then continues this process with the next URL on the Quantcast list.
Results
Top 100 Websites - Shallow Crawl
In our shallow crawl, we detected cookies on 99 of the top 100 websites, in comparison with all 100 in October 2012. In total, we detected 6,280 HTTP cookies for the top 100 websites, compared to 3,152 in October 2012. In 2015, with our shallow crawl, we found 3 websites that placed 300 or more cookies.
Figure 1 shows the distribution of cookies for the top 100 sites. The x-axis is the number of cookies, and the y-axis is the number of sites.
A significant number of top sites used Flash cookies, but the biggest increases are in the use of HTML5. In 2012, we found that 34 sites use HTML5. In this investigation, 76 sites used HTML5 when we investigated three links on the site. Also, many “keys” are included in HTML5 cookies. In our shallow crawl, we detected more than 800 keys in HTML5 storage.
Most HTTP cookies—83% of them—came from a third-party host. We detected 275 third-party hosts among the third-party cookies. This means that Internet tracking remains diffuse. A user who browses the most popular websites must vet dozens, even hundreds of policies to understand the state of data collection online.
At the same time, one player has an outsized ability to track online. Google Analytics had cookies on 15 sites; Google’s ad tracking network, doubleclick.net, had cookies on 26 sites; youtube.com, also owned by Google, had cookies on 8 sites. Overall, Google had a presence on 85 of the top 100 websites.
Facebook had a presence on 20 of the top 100 websites.
The most frequently appearing cookie keys for the top 100 sites in our shallow crawl were: uid, _ga, __qca, i, __uuid.
Figure 1. Histogram showing number of HTTP cookies (horizontal axis) found on the top 100 websites using shallow crawl. Vertical axis is the number of websites with a given number of cookies.
Top 100 Shallow Flash Cookies and HTML5 Local Storage
Figure 2 shows an increase in the number of Flash cookies from 2012 to 2015 on the 100 most popular web pages using shallow crawl. We tracked 877 HTML5 storage keys for these same sites.
Figure 2. Historical comparison of Flash cookies and HTML5 storage from 2012 to 2015 appearing on the homepages of the top 100 websites using shallow crawl.
Top 100 Websites—Deep Crawl
When we visited sites and made two clicks on the same domain, we detected cookies on all 100 top websites. In total, we detected 12,857 HTTP cookies for the top 100 websites, compared to 6,485 in October 2012. Figure 3 shows a summary of the key tracking metrics.
Figure 3. Key tracking metrics found in 2015 with comparisons to 2012.
In 2015, our deep crawl found that 11 websites placed 300 or more cookies. Figure 4 shows a summary. Google Analytics had cookies on 52 of the top sites; doubleclick.net had cookies on 73 sites; YouTube had cookies on 19 sites. Overall, Google had a presence on 92 of the top 100 websites. Facebook had a presence on 57 of the top 100 websites.
Our observations about Flash cookies and HTML5 storage in the shallow crawl were also reflected in a crawl to three links on sites. Flash cookies grew modestly, but sites now use HTML5 to store many keys about site visitors.
The most frequently appearing cookie keys for the top 100 sites in our deep crawl were: _ga, uid, __utma, __utmz, id.
Figure 4. Histogram showing number of HTTP cookies (horizontal axis) found on the top 100 websites using deep crawl. Vertical axis is the number of websites with a given number of cookies.
Top 1,000 Websites - Shallow Crawl
In 2015, with a shallow crawl, we detected cookies on 94% of the top 1,000 websites. In total, there were 80,821 HTTP cookies for the top 1,000 websites. Forty-six sites placed 300 or more cookies.
The most frequently appearing cookie keys for the top 1,000 sites in our shallow crawl were: _ga _utma, _utmz, _qca, uid.
Figure 6 shows the distribution of cookies for the top 1,000 sites. The x-axis is the number of cookies, and the y-axis is the number of sites. Most cookies—87% of them—were placed by a third-party host. We detected more than 797 third-party hosts among the third-party cookies. Google Analytics had cookies on 151 of the top sites; doubleclick.net had cookies on 212 sites; youtube.com had cookies on 65 sites. Overall, Google had a presence on 844 of the top websites.
Facebook had a presence on 182 of the top websites.
The most frequently appearing cookie keys for the top 1,000 sites in our shallow crawl were: _ga, __utma, __utmz, __qca, uid.
Figure 6. Histogram showing number of HTTP cookies (horizontal axis) found on the top 1,000 websites using shallow crawl. Vertical axis is the number of websites with a given number of cookies.
Top 1,000 Websites - Deep Crawl
In 2015, with a deep crawl, we detected cookies on 95% of the top 1,000 websites. In total, there were 134,769 HTTP cookies for the top 1,000 websites, compared to 65,381 in 2012. One hundred and thirty sites placed 300 or more cookies.
The most frequently appearing cookie keys for the top 1,000 sites in our deep crawl were: _ga, _utma, _utmz, optimizelySegments, optimizeltEndUserID.
Figure 8 shows the distribution of cookies for the top 100 sites. The x-axis is the number of cookies, and the y-axis is the number of sites. Most cookies—92% of them—were placed by a third-party host. We detected more than 880 third-party hosts among the third-party cookies.
Google Analytics had cookies on 581 of the top sites; doubleclick.net had cookies on 754 sites; youtube had cookies on 121. Overall, Google had a presence on 923 of the top websites.
Facebook had a presence on 548 of the top websites.
The most frequently appearing cookie keys for the top 1,000 sites in our deep crawl were: ga, utma, id, utmz, and optimizeitEndUserID.
Figure 8. Histogram showing number of HTTP cookies (horizontal axis) found on the top 1,000 websites using deep crawl. Vertical axis is the number of websites with a given number of cookies.
Top 1,000 Deep Flash Cookies and HTML5 Local Storage
Figure 9 shows an increase in the number of Flash cookies from 2012 to 2015 found on the 1,000 most popular web pages using deep crawl. We tracked 6,309 HTML5 storage keys for these same sites.
Figure 9. Historical comparison of Flash cookies and HTML5 storage from 2012 to 2015 appearing on the homepages of the top 1,000 websites using deep crawl.
Top 1,000 Deep Flash Cookies and HTML5 Local Storage
Top 25,000 Websites—Shallow Crawl
We detected HTTP cookies on 81% of the top 25,000 websites. In total, we detected 1,065,076 HTTP cookies on the top 25,000 websites, compared to 476,492 in October 2012. In 2015, with our shallow crawl, we found 568 sites placing 300 or more cookies.
Figure 10 shows the distribution of cookies for the top 25,000 sites. The x-axis is the number of cookies, and the y-axis is the number of sites. Most cookies—87% of them—come from a third-party host. We detected more than 8,839 third-party hosts among the third-party cookies. Google Analytics had cookies on 11,521 of the top sites; doubleclick.net had cookies on 5,883; YouTube had cookies on 1,453. Overall, Google had a presence on 18,375 of the top 25,000 websites.
Facebook had a presence on 2,123 of the top websites.
The most frequently appearing cookie keys for the top 25,000 sites in our shallow crawl were:__utma,__utmz,_ga,__utmb,__gads,__qca.
Figure 10. Histogram showing number of HTTP cookies (horizontal axis) found on the top 25,000 websites using shallow crawl. Vertical axis is the number of websites with a given number of cookies.
Top 25,000 Shallow Flash Cookies and HTML5 Local Storage
Figure 11 shows an increase in the number of Flash cookies from 2012 to 2015 on the 25,000 most popular web pages using shallow crawl. We tracked 48,949 HTML5 storage keys for these same sites.
Figure 11. Historical comparison of Flash cookies and HTML5 storage from 2012 to 2015 appearing on the homepages of the top 25,000 websites using shallow crawl.
Figure 12 lists the names of the top trackers.
Figure 12. The top trackers found in the study and the number of distinct websites on which they were found.
Table 2 shows a summary of the number of tracking technologies (HTTP cookies, Flash cookies and HTML5 cookies) returned by the top-level websites for the top 100, 1,000 and 25,000 domains we visited. It displays the sum per category as well as the percentage overall.
Table 2. Overall summary of results for shallow and deep crawls for the top 100, 1,000 and 25,000 websites.
Limitations of crawler methods. For the October 2015 report, the crawler did not login to any sites, nor bypass any modal dialogs, and therefore our data does not record how cookies changed based on additional information provided by users logging into third-party services or requesting further access to the main site. Additionally, as the crawler automated selection of URLs for deep crawls, we did not necessarily capture any retargeting based on a human action (e.g., adding items to a shopping cart). We limited deep crawls to HTML anchor tags found and did not follow links set by JavaScript. Additionally, we randomly selected from links obtained by the deep crawler, and we consequently did not take into account page layout and visual layout in the selection process. We ran the crawl using Firefox with no add-ons.
Limitations of data collection methods. The identification and classification of third- and first-party cookies is a complex task. Many tracking and advertising companies are owned by other sites that have different domain names. For example, DoubleClick is owned by Google. For consistency in categorizing third-party cookies, the public suffix list was leveraged to combine suffixes consistent with previous work. We classified cookies from the top-level domain as first-party, while we classified cookies from a domain outside of the top-level domain third-party. This limited analysis of third-party domains to domains syntactically considered to be third parties. The analysis is not reflective of any underlying agreements or connections that may exist between multiple domains, through “DNS aliasing,” for instance, where a primary domain assigns a subdomain to a tracking company. Under such an arrangement, ordinary third-party cookies are instantiated in a first-party fashion. The ranking list used was Quantcast's top 1 million sites in the United States. This ranking may be different in other countries. Some websites on Quantcast’s top 1 million list don’t wish to be listed on the list and are marked as “Hidden profile”. We crawled top 100, 1000, and 25,000 excluding “hidden profiles”.
Discussion
We found that users who merely visit the homepages of the top 100 most popular sites collect over 6,000 HTTP cookies in the process. —twice as many as we detected in 2012. If the user browses to just two more links, the number of HTTP cookies doubles. Third-party hosts set 83% of cookies. Just by visiting the homepage of popular sites, users receive cookies placed by 275 third-party hosts.
Some popular websites use many cookies. In just visiting the homepage of popular sites, we found 24 websites that placed 100 or more cookies, 6 websites that placed 200 or more cookies, and 3 websites that placed 300 or more cookies.
We also found that more sites are using HTML5 storage, which enables websites to store a greater amount of information about consumers.
By just visiting three links per site, we found that Google has tracking infrastructure on 92 of the top 100 most popular websites and on 923 of the top 1,000 websites. This means that Google’s ability to track on popular websites is unparalleled and approaches the level of surveillance that only an ISP can achieve.
In comparison to 2012, tracking on the Web increased. There has been a marked increase in HTTP cookies and HTML5 storage usage. Cookie counts continued to increase, with larger amounts of third-party cookies in use. More than half of the top cookies ( _ga, __utma, __utmb, __utmz, optimizelyEndUserId) collect information on the pages visited by a user.
Google continues to be the single entity that can track individuals online more than any other company aside from a user’s Internet Service Provider. Still, hundreds of third-party hosts also track users, and under the current self-regulatory regime [32], it is up to users to investigate these companies’ privacy policies and decide whether to use the websites.