Spectrum Management and Telecommunications

Regulating content on the Internet: A new technological perspective

Approach to Content Identification: Low Level Inspection

How It Works

Beyond examination of the data contained in the packet header, it is possible to examine the contents of a data packet for alphanumeric strings. It is also possible to use methods other than string matching to identify characteristics of data or data traffic. Traffic analysis can be used to detect specific types of traffic (e.g., peer-to-peer file transfers, denial of service attacks) at the network level using network devices that detect known patterns of data behaviour. Other techniques require that the entire file be captured and key characteristics (e.g., checksums) be compared against a database of known files. If a match is found, key identifying information about the data packets comprising the file can be stored in a network device and, under certain circumstances, used to identify the file when it transits the network. There are severe limitations on what types of information can be detected and the reliability of detection at this level, which will be discussed later in this report.

How Low-level Inspection can be used to Filter Content: Page Blocking & Keyword Filtering

Page blocking provides finer grained control than either IP address blocking or DNS tampering. Commercial providers such as Secure Computing and Web Sense provide software that allows administrators to block specific pages on a Web site. The authors of these types of software often provide lists of blocked content based on categories such as sex, obscenity and pornography, hate speech, criminal skills and gambling.

Administrators of this type of software can also add specific pages to be blocked, based on their own criteria. So for example, the cnn.com Web site might contain an article about a political event that some government would find sensitive. Page blocking would allow that particular page to be filtered out, while still allowing access to the rest of the content on the CNN site.

Originally designed for corporate customers, technology of this type is increasingly being used by ISPs and national governments. Page blocking is generally implemented in a proxy server, and also provides the option of redirection to a blockpage or simply returning no result for a requested page. A blockpage may make it clear that access to the page has been blocked or may simply make it appear to the user as if the connection has failed.

URL keyword filtering is very similar to page blocking. However rather than blocking a predetermined list of URLs, URL keyword filtering scans the URLs for keywords and blocks results that contain words (alphanumeric character strings) on a predetermined blacklist. This method is particularly effective for restricting the results of search engines, which tend to include the search term in the URL of the results page. Thus, for example, a user in China may be able to access the google.com search engine, but may be prevented from seeing the results if they search for a term such as "falun gong" that is subject to URL keyword filtering because the URL of the results page contains the alphanumeric character string "falun+gong."

Screenshot
Example of message received by users when page is blocked. In this example, the user receives a connection dropped message, giving the impression that the problem is on the content provider's server.

The other advantage to this strategy, from the perspective of the administrator, is that content can be blocked without pre-knowledge of the URL of a specific page. Only a list of prohibited terms needs to be pre-defined.

Generally speaking, ISP routers do not inspect URLs that are contained in the packet payload. Most enterprise class routers have this capability, but enabling this feature introduces significant overhead and negatively impacts network performance.

URL keyword searches are a special case of keyword searches performed on the payload of a packet. It is technically possible to search for alphanumeric strings in the body of a page. However, the cost of doing such searches and the impact on network performance would be high. The research conducted by ONI, which is the most detailed research we have on current state-mandated filtering practices provides no evidence that full text searches of packet contents is being conducted at the network level.

Challenges, Problems & Countermeasures to Page Blocking & URL Keyword Filtering
Impact on network performance

The network devices that are the building blocks of the Internet are highly specialized devices, designed and optimized for the purpose of routing data from source to destination as quickly and reliably as possible. Generally speaking network devices are not concerned with the contents of a data packet, but only with the headers that tell the network device how to route the packet from its source to its destination. These routing rules are simple and the amount of data that must be parsed is a small percentage of the total size of the packet. In the case of an Ethernet network, the most widely used type of network, the packet headers make up approximately the first 64 bytes out of the total 1518 bytes in the packet or about the first 4% of the total amount of data in the packet.

Routing rules are extremely simple logical operations that can be executed at very high speed. In fact, they are so simple that many high performance network devices implement these operations in hardware, rather than in software. The data required to route a packet is all contained in the packet header. The packet header is highly structured and follows a consistent layout, regardless of the type of data contained within it. The router, therefore, "knows" exactly where to look for the information it needs to route the packet. In contrast, alphanumeric strings in the relatively unstructured payload of the packet are much more difficult to detect. Scanning the payload of a packet and looking for dozens or hundreds of words in a list, or hundreds or thousands of URLs, is a much more complex operation and requires significant computing power in the network device to be diverted to filtering from its primary function. While many network devices such as routers have the ability to inspect the contents of packets, doing so significantly and negatively impacts their performance.

False positives/false negatives

As noted elsewhere in this report, list-based content filtering is problematic for two reasons: firstly, the amount of effort required to maintain the list is prohibitive; and secondly, previously unknown content cannot, in fact, be filtered. These two problems lead to significant numbers of false positives and false negatives. False positives are a particular problem with keyword filtering in a packet switched network, because words may be split across packets. As the president of one of the leading network device vendors said to us: How does a device installed by an Indian ISP know whether the character string "Bomb" is a reference to a terrorist plot or the first half of the word "Bombay" split across two packets?

Cost

In order to maintain network performance while implementing a regime of packet inspection, it would be necessary to install dedicated firewalls or proxy servers to perform these functions. This would off-load the work of filtering from the core network devices, but would add significant cost and complexity to the network. According to all of the major Canadian ISPs with whom we spoke, none of them is currently using proxy servers or firewalls on their ISP networks. As noted elsewhere in this report, the hardware and software costs of network devices constitute only a small percentage of the total cost of ownership. Facilities costs, such as power and cooling, physical plant to house devices, vendor support and maintenance contracts, and administrative staff costs can add three-to-five times the initial purchase price to the cost of each device. This does not include the costs of updating and maintaining the lists of blocked terms.

Encryption

If the user wishes to circumvent page blocking or other methods that inspect the payload of a packet for alphanumeric strings, the user may request the page through a proxy server using HTTPS, which will encrypt the data stream delivered to the user. In some jurisdictions, it may be possible to block HTTPS requests. Real-time brute-force decryption47 of packet payloads encrypted using even the simplest methods of encryption is well-beyond the capabilities of currently available network devices. In Canada, blocking of HTTPS requests would have a crippling effect on a wide variety of e-commerce activities, Internet banking, and many Web-based mail clients, which commonly use this method to provide secure access for their users.

How Low-level Inspection can be used to Filter Content: Digital Signatures; Content Fingerprinting

Recently the technology developed by Audible Magic to detect copyright-violating content distributed on peer-to-peer networks has received a significant amount of attention. According to testimony of Vance Ikezoye, CEO and President of Audible Magic Corporation, before the U.S. House Science and Technology Committee in June 2007, Audible Magic had at that time "over 80 customers worldwide" of whom about 70 were universities and colleges, as well as "legitimate peer-to-peer networks" such as iMesh and Kazaa, and video sharing and social community sites like MySpace and Microsoft Soapbox.

What does Audible Magic do? To quote Ikezoye, "the system matches unknown files transferred over known public peer-to-peer file sharing applications to a database of copyrighted materials that have been registered by the copyright owners. Since we focus on known public peer-to-peer file sharing applications, private communications such as email pass by unaffected."

How does Audible Magic work? As explained to the researchers by Jeremy Stern, VP Business Development for Audible Magic, the Audible Magic technology has two core aspects: (1) a very large database of "digital fingerprints" created using an algorithm that converts the "psycho-perceptual" sound characteristics of an audio file as we hear it into a "digital fingerprint"; and (2) an appliance that sits at the side of the network and creates a mirror copy of peer-to-peer traffic for analysis and comparison with its "digital fingerprint database."

Given that files transferred over the Internet are broken into packets, the Audible Magic appliance will not know whether a file it has not seen before matches the "psychoperceptual digital fingerprint" of a piece of content registered in its database. The first time packets from that file pass by the appliance it will not recognize a match. It will have to (a) wait until it has a mirror copy of all the packets for that file, (b) create a digital fingerprint, and (c) compare that fingerprint with its database. In Ikezoye's words: "The first time our product comes across a file we have never seen before, we must go through the process of collecting and analyzing the file using our fingerprinting technology. As one might guess, on high speed networks, it may happen too fast for our technology." After Audible Magic's technology has determined whether a particular file is a match for a digital fingerprint in its database, it records in its system identifying information that can easily be detected about the packets from that file, so that the next time packets from that file are detected a quick decision can be made and action taken. Ikezoye continues: "... after this initial experience with the file, we can associate the identity of the file with an identifier, which is like an ID number for files shared over these networks. This ID number can be read from the data transmission very quickly — in more than enough time to take action. We maintain a local list of these identifiers in every system installed. Thus in most cases, this list can be used to accurately match files transferred even over high speed networks in plenty of time to react."

How effective is the Audible Magic solution? Ikezoye states, "No solution is or will ever be 100% effective no matter what the context." So then, in what contexts is the Audible Magic solution effective? It is effective if (and only if) all of the following are the case:

  1. The traffic is not encrypted. Again to quote Ikezoye: "The reality is that encryption technology can prevent the detection of content transfers at the file level such as that performed by our product. There are popular file sharing applications that use various levels of encryption today." In a university setting perhaps, where the institution is prepared to implement and enforce a policy prohibiting the use of encrypted peer-to-peer file sharing among users of its network, the inability to decipher encrypted content will not be an issue, as long as the system is still able to detect the protocols being used.48 However, in a public Internet setting, where there are legitimate reasons for encrypting peer-to-peer traffic, such a prohibition is not viable. Accordingly, content transferred using peer-to-peer clients such as BitTorrent, which automatically encrypt the content stream, would not be caught by the Audible Magic solution; and
  2. The traffic uses a port commonly used by peer-to-peer traffic. While most peer-to-peer traffic uses certain ports for data transfer, as discussed earlier, it is possible to tunnel peer-to-peer traffic inside of other ports, disguising it, for example, as HTTP traffic; and
  3. The offending content is being transferred as peer-to-peer traffic. Audible Magic generally examines peer-to-peer traffic to detect files that match its database of registered copyright material. While peer-to-peer is the most common method of exchanging material that violates copyrights, offending content can be and is transferred using other methods. For example, today entire HD quality commercially released videos are being broken into packets, encoded, exchanged as part of the Usenet feed, and reassembled on the recipient's computer. Moreover peer-to-peer file sharing is also able to take advantage of other protocols (e.g., tunneling inside HTTP); and finally,
  4. There is a tolerance for "false positives" or a willingness to delay taking action until checking with the user. If content is going to be blocked automatically, some legitimate transfers are also likely be blocked. To quote the testimony before the US House Committee on Science and Technology of Charles A. Wight, Associate VP for Academic Affairs at the University of Utah, which has implemented the Audible Magic product on its campus: "It is equally important to reserve judgment in each case until after making personal contact with the user or administrator of a suspect computer to assess whether or not the use of university network resources is appropriate." Given that, as both Wight and Ikezoye state, "no system is 100% accurate," systems like Audible Magic could only be implemented at the ISP level if one were either prepared to block some legitimate transfers or prepared to confirm the violation before interfering.

It is worth noting that Audible Magic did not participate in the test of peer-to-peer filtering systems conducted by the European Advanced Networking Test Center AG (EANTC) described above. This suggests that the solution is not ready for deployment on carriergrade networks at this time or that it could not meet the rigorous requirements of the EANTC test.

Challenges, Problems & Countermeasures to Digital Fingerprint Detection
Encryption

Encryption can be applied to the contents of a packet and/or to the header information for a packet. So, for example, the popular peer-to-peer file-sharing application BitTorrent can be set automatically to encrypt the data stream once it has been received by the client machine. Any real-time attempt to analyze the contents of the data stream as it traverses the network would be defeated by this encryption. Analysis of the data stream would require the interposition of the relevant client application to receive all the data and decrypt the file. As long as one knows the client application that is intended to receive the data and that client is publicly available, it is possible to decrypt the information using this method. Interposing a client application to receive, decrypt and analyze files before they are delivered to the end user, however, would have a significant impact on network performance and costs.

Other encryption methods require the recipient to have a "key": a code that must be entered before the data can be decrypted. This type of encryption is much more difficult to break (which is why it is used by banks and security organizations), but also much more difficult to implement (which is why it does not tend to be used for ordinary file sharing).

Cost; network impact

As discussed above in connection with protocol detection, the hardware used by ISPs and carriers is designed with extremely high performance and availability in mind. Devices such as Cisco's XR 12000 49, Juniper Networks T Series50 and Nortel's 8600 family51 routers offer the high levels of scalability and fault tolerance required in carrier and ISP networks. These devices can be configured with redundant modules that can be replaced and upgraded without an interruption of service. In typical carrier configurations, each of these devices can cost in excess of several hundreds of thousands of dollars, and devices must be installed on every segment of the network that one wishes to manage. And, as was pointed out to us by the President of Narus Inc., the hardware costs associated implementing a solution are only a fraction of the administration costs, which for a large ISP would be in the millions of dollars.

According to Vance Ikezoye, CEO and President of Audible Magic Corporation, in his testimony before the U.S. House Science and Technology Committee, the cost of their product on a large university campus network would be approximately $100,000 "depending on the bandwidth managed." (Note that these costs do not include any network management, infrastructure or operational costs associated with implementing the solution.) The University Technology Officer for Arizona State University, "one of the nation's largest universities, with over 65,000 students," installed the Audible Magic product. He notes that "the list price for the product at ASU's scale is just over $200,000, but ASU expects its costs this year, as a Pioneer Reference Account, to be closer to 1/2 that price." It is worth noting in this context that, in its campus implementations, Audible Magic does not recommend that the university implement the product on every segment on the network. "In order to detect 100% of the traffic, it would require us to be installed on every segment and device on the network — this could dramatically increase costs." It is also worth noting again that product costs are often a small part of the total costs of implementing any kind of network monitoring technology.52 The expert opinion accepted by the Belgian Court of First Instance in the case of Sabam v Tiscali (currently under appeal) suggested that the cost for implementation of the Audible Magic product by an ISP would be .50€/month per user. For a major Canadian ISP with 1 million subscribers, this would equate to approximately $9 Million per year.

After-the-Fact Analysis

Up to this point, we have discussed approaches that can be implemented in real-time to identifying content delivered over the Internet. Real-time approaches are necessary if the goal is to restrict or promote access to content through the interruption of the data transfer. If the goal is simply to identify content, other approaches are possible. If entire data files are available for asynchronous analysis, more detailed inspection of their contents is possible. For example, as reported in an episode of the PBS program Frontline, "AT&T technician Mark Klein inadvertently discovered that the whole flow of Internet traffic in several AT&T operations centers was being regularly diverted to the NSA."53 Large scale network surveillance of this type involves the installation of network devices that watch a mirror stream of the data traversing the data pipe (for example, the traffic at an ISP's peering points) and, based on sets of rules, send copies of network traffic to large server installations. At these server installations, a combination of technical and human intervention can be used to investigate potential suspicious or illegal exchanges.

While we are not in a position to describe the techniques being used by national security agencies to analyze the actual contents of files, the task is similar to that faced by law enforcement authorities when attempting to determine whether evidence of criminal activity exists on an individual's computer hard drive when that computer has been seized as part of a criminal investigation. With proper authority, police may search the hard drive for evidence. To do this, they may use some automated tools that will help them categorize files by type, match files against databases of known content, or search for keywords (specific alphanumeric content) in files. Once a file that may contain potential evidence of criminal activity is identified by these means, it must be examined by a human being to determine if it may constitute evidence of wrongdoing.54 Finally, even if the police find files that they believe constitute evidence of criminal activity, the ultimate decision (at least in Canada) is up to the judicial system.55.

Challenges, Problems & Countermeasures to After-the-Fact Analysis
Not Real Time

By definition, after-the-fact analysis does not restrict or promote real-time access to content. By the time the content is identified, it has already reached the recipient. After-the-fact analysis is therefore only useful to: (a) identify offending content so as to block sources in the future (e.g., add IP addresses to block lists); or (b) impose penalties on users for having accessed illegal content.

False Negatives /False Positives

As with the techniques described above for automatic identification of content transiting the Internet, any automated technique for scanning the content of files stored for after-the-fact analysis depends on pre-identification of criteria, whether these are alphanumeric character strings, checksums, or other unique characteristics of files or elements of files. If a precise match is required in order to catch offending content, a small variation (e.g., a single character) will cause the match to fail. This is the reason email spam continues to make its way through spam filters.

The larger the set of criteria specified, the more likely a piece of content identified will be relevant. For example, the character string "blow up" might appear in documents discussing photography, balloons, political scandals, or terrorist plots (among others). On the other hand, the character string "we are going to blow up the Parliament buildings in Ottawa" is more likely to refer to a terrorist plot (although it could also refer to a photographic treatment, or be part of a report on regulating the Internet). Of course, the larger the set of criteria, the fewer items will provide a precise match.

Cost

Given the volume of data traversing the networks of major ISPs56, an attempt to comprehensively capture and analyze all Internet traffic would require massive amounts of computing power and storage. We were informed by a senior executive at Narus, Inc. that one of its largest clients (a major carrier) has invested hundreds of millions of dollars in implementing a system that is primarily focused on detecting and identifying "anomalous" IP traffic (e.g., viruses, worms, and denial of service attacks). When anomalous traffic is identified, a copy is streamed to other applications for processing and a decision as to what action should be taken. For the purposes of maintaining network health, it is generally not necessary to analyze the payload of any given data packet, so automated analysis by recognition of traffic patterns and inspection of packet headers can usually be performed quickly and action initiated in time to prevent damage to the network.

Determining the engineering involved in creating the specifications for a system that would be capable of capturing all data for an ISP with over one million users is beyond the scope of this document. It is safe to say, however, that the costs of implementing such a system for all (or even a majority) of the traffic traversing the Canadian Internet backbone would be significantly more than the hundreds of millions spent by the above mentioned carrier on its system.57

And finally, as indicated above, after-the-fact analysis ultimately relies on the human review of any potentially offending content that is identified using this technological approach. For any significant amount of content, this implies the costs of a data centre staffed by an army of reviewers.58

Search Engine Indexing

How It Works

There are numerous search engines that can be used to locate content of interest on the Internet. Each of these search engines uses its own methodology for indexing Web pages and matching the search terms entered by users to the indexed items. While the exact methods used by most search engines are proprietary, search engines tend to use programs called Web "crawlers" to scan the contents of Web pages and then store some or all of the text contained in those pages in an index. When a user enters a search term, the search engine examines its index to find a list of pages that best match the query, and a link to the original Web page is provided. A range of other criteria can influence the priority in which the search results are provided. For instance, for certain search terms geo-location (see discussion above) may be used to prioritize results for Web pages. For example, a search on the term "French restaurants" conducted using google.ca (the Canadian version of the Google search engine) will return a large number of Canadian entries in the first ten results, while the same search conducted using google.com.au (the Australian version of the Google search engine) will return a large number of Australian entries in the first ten results. Content is thus identified by how similar it is to the user's search terms and other criteria determined by the search engine provider.

How Search Engine Indexing can be used to Filter Content: Search Results Tampering

Search results removal or search results tampering is not, strictly speaking, a method for preventing access, in that it does not, in and of itself, block users' access to content. However, it does have the potential to reduce access by impeding users' awareness of the existence of certain content: if the user doesn't know the content exists, it might as well be blocked.

Industry reports have commented on the agreements reached by Google and other search engine providers to remove search results for sensitive topics in accordance with the requirements of various governments. For example, it has been widely reported that the Chinese government has provided Google with a list of restrictions that must be imposed on users of google.cn (the local domain version of the Google search engine for China). For purposes of comparison, the authors of this report typed the term "falun gong" into the English language versions of every "local domain" Google search site.59

All 146 searches, except for the search conducted on google.cn, returned the Wikipedia article on Falun Gong and the Web site for the Falun Dafa organization within the top five results. The table below shows the ranking of the Wikipedia article and the Falun Dafa site on all searches except for that conducted on google.cn. Neither site appeared within the results returned by google.cn.

Results of Search on "falun gong"
Using Google Localized English Language Search Sites
(not including results from google.cn)
  Number of Results
Ranking #1
Number of Results
Ranking #2
Number of Results
Rankings #3–5
Wikipedia article on
Falun Gong
139 6 0
Falun Dafa
organization site
6 125 14

While Google localizes content for the purpose of making it more relevant to specific regions, it is clear that more than localization is taking place in the case of google.cn. Results are being removed from the search results being provided to users of the English language version of google.cn. Not only are the specific Web links provided completely different than those provided by the other search sites, the number of results provided is significantly lower (for example, approximately 20,800 for the English language version of google.cn, compared to approximately 122,000 for the English language version of google.com.tw).60 While it is impossible to know the exact form of the agreement between Google and the government of China, a perusal of the results suggests that users are only receiving results from Chinese government sites or sites within China, over which the Chinese government can exercise control.

Below, are screenshots of the first page of search results returned for the search term "falun gong" by the English language versions of google.ca (Canada), google.com.tw (Taiwan), and google.cn (China) Web sites, as well as the Mandarin version of the google.cn Web site.

Screenshot
Screenshot of search results for "falun gong" on google.ca, 19/02/08

Screenshot
Screenshot of search results for "falun gong" on google.com.tw, 19/02/08

Screenshot
Screenshot of English search results for "falun gong" on google.cn, 19/02/08

Screenshot
Screenshot of Mandarin search results for "falun gong" on google.cn, 19/02/08

Challenges, Problems & Countermeasures to Search Engine Results Removal
Overblocking

As noted, no public information is available about the exact criteria on which search results are being removed from search engines in compliance with state requirements. However, our testing indicates that search results are being removed based on pages containing specified alphanumeric character strings and/or IP addresses. As with URL keyword filtering, discussed above, restricting content based on alphanumeric character strings contained in a page or page name, may result in pages being blocked that contain the proscribed strings but not, in fact, any offending content. For example, the word "beaver" may appear on a large number of pornography sites, but may also refer to a student's Canadian heritage school project.

Screenshot Screenshot
Canadian school projects featuring beavers.

Similarly, excluding IP addresses (or ranges of IP addresses) from search results means that significant amounts of content that may not offend the state's sensitivities will be blocked. For example, it appears that in the Mandarin version of google.cn, results for the search term "falun gong" are only displayed if the IP address is registered in China (or is on a "white list"). There may be sites outside of China that hold views that do not contradict the sanctioned official Chinese view on Falun Gong. Nonetheless, these would be blocked along with any sites holding offending or neutral views, if location blocking is one of the techniques being used to remove search results.

Underblocking

As with URL keyword filtering, search results removal relies on pre-identification of either alphanumeric strings or IP addresses to determine which results should be removed. Unless the net is cast very widely (resulting in overblocking), users will be able to find content by using terms other than the blocked terms. Content providers wishing to avoid search results removal, may choose to use alternative, non-offending, terms on their main pages, allowing users to link to other pages that contain the desired content. For example, while results for the search term "falun gong" are restricted in the English language version of google.cn, the term "world religions" is not. Among the results returned by that search term is www.religionfacts.com, a site that provides "facts on the world's religions." One click away, under religions D-M is a page on Falun Gong with a brief description and links to a number of other sites, including "Human Rights Abuse in China" with the description: Falun Gong practitioners persecuted in China. Make a change before Beijing 2008."

Requires Cooperation of Search Engine Provider

Any scheme of search results removal requires the cooperation of the search engine provider. Implementation of this technology has both monetary and reputation costs to the provider; and the more extensive the scheme, the higher those costs. There are many search engines available to users. To be effective it would be necessary to reach agreements for search results removal with all search engine providers. In a free and democratic society, any scheme that lacked validity in the eyes of users would result in a loss of reputation on the part of currently popular search engine providers, and users would turn to other search engines, thereby nullifying the effect of any agreement reached with the original providers, as well as negatively impacting the providers' advertising business.


47 Decryption of an encrypted message without access to the encryption keys.

48 Some peer-to-peer file sharing systems, however, automatically use traffic encryption, or obfuscation, the purpose of which is to disguise the protocols being used and prevent traffic detection and regulation. Some peer-to-peer filtering systems are able to detect peer-to-peer traffic that is using obfuscation. These systems do not break the encryption, but are able to detect patterns in the traffic that reveal peer-to-peer sharing.

49 http://www.cisco.com/en/US/products/ps6342/index.html.

50 http://www.juniper.net/products_and_services/t_series_core_platforms/index.html.

51 http://products.nortel.com/go/product_content.jsp?segId=0&catId=null&parId=0&prod_id=44781&locale=en-US#.

52 For example, Greg Oslan, President of Narus Technologies Inc. advised us that the product costs of the Narus traffic management and security solution is generally approximately 15% of the total cost of implementing and operating the system.

53 http://www.pbs.org/wgbh/pages/frontline/homefront/view/ In recent media reports, American President George W. Bush has suggested that he would like the U.S. Congress to approve measures that would capture the traffic from suspect telephone numbers or IP addresses for the purposes of detecting terrorist activity.

54 In this regard, Kevin McQuiggan from the Vancouver Police Department advised the researchers that the police are strictly limited in the scope of what they may look at on a suspect's computer by the terms of the search warrant issued.

55 We would like to thank Kevin McQuiggan of the Vancouver Police Department and his team for their valuable insights.

56 Greg Oslan, President of Narus, Inc., whose Narus system is used to provide real-time traffic intelligence to maintain the health of their networks by, among other things, detecting worms, viruses, Spam and denial of service attacks, informed us that one of their largest customers analyzes over 6 petabytes of data per day. (A petabyte is one million gigabytes.)

57 Another useful reference, although implemented for very different purposes, is the infrastructure established by Google to allow it to index Web pages is a useful point of reference. Google indexes 20 petabytes of data per day. In order to do this, it has created a cluster of 1800 machines, each machine with 2 × 2Ghz processors, 4GB of memory, 2 × 160GB hard disks, and a gigabit Ethernet connection. For an enterprise class server with these specifications, a large customer would typically pay approximately $3,000 Canadian. These are, of course, just the hardware costs associated with the system. http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0026.html

58 According to recent estimates, there are at least 10,000 people in China actively involved in policing the Internet, looking primarily for a limited number of terms of concern: in particular, Taiwan, Tibet, Tiananmen, and falun gong.

59 As of February 21, 2008 Google had localized its search engine for 146 countries, including Hong Kong which has a distinct site and returns comparable results to other countries other than China. Each localized version of the search engine is available in 117 languages (including over 100 languages in actual use, as well as Pig Latin, Klingon & Esperanto, among other languages not commonly used). Thus it is possible to search the google.cn site in Chinese (simplified or traditional), English, Italian or any of 114 other languages).

60 Note that a search on "falun gong" using the Mandarin version of google.cn returns many more results. However, perusal of the URLs of the first 10 result pages reveals that the sites are all located in China. It is also interesting to note that a search on "falun gong" using the Italian version of google.cn returns similar results to a search using google.ca.