Processing Artificial Intelligence: Annexes
From: Canadian Intellectual Property Office
On this page
- Annex A - Methodology
- Annex B – Data cleaning
- Annex C - Intellectual property concentration index
- Annex D - Revealed Specialization Index
Annex A - Methodology
The term "patented inventions" in our report refers to patent families. A patent family is a collection of similar patent applications filed across multiple jurisdictions. Even though there are multiple patent family indices developed by various organizations, the one considered in our report is the DOCDB patent family index. The earliest patent filed in every patent family is known as the priority patent application. For the purpose of this analysis, priority applications that were filed between 1998 and 2017 were considered.
As briefly mentioned in the introductory section of the report, defining AI from a patent perspective is challenging because of its constantly evolving nature. WIPO was the first to step forward and attempt to define AI in terms of international patent activity. Outlined in its Technology Trends 2019 report, WIPO takes a generalized approach to defining AI patent activity. WIPO uses a combination of International Patent Classification (IPC) codes, Cooperative Patent Classification (CPC) codes, and File Index and File Forming Terms (FI/F-terms) classes and AI-specific keywords to define AI. In conjunction with WIPO's efforts, the Organisation for Economic Co-operation and Development (OECD) created a working group to establish a commonly accepted definition of AI. The working group involved representatives from IP Australia, the Canadian Intellectual Property Office (CIPO), the European Patent Office (EPO), the Israel Patent Office (ILPO), the Italian Patent and Trademark Office (UIBM), the National Institute for Industrial Property of Chile (INAPI), the United Kingdom Intellectual Property Office (UKIPO) and the United States Patent and Trademark Office (USPTO).
Inspired by the work carried out by WIPO and the findings from OECD working group discussion, the UKIPO subsequently published a report titled Artificial Intelligence - a worldwide overview of AI patents,Footnote iv which focused on the patenting trends in the U.K. for AI. In order to reduce the number of records being incorrectly captured by the patent search strategy, the UKIPO adopted a narrow definition of AI and considered a time span of 20 years (1998–2017). The complete search strategy can be found in Appendix 1 of its report, and the list of patent applications identified through this search strategy can be found on the UKIPO's website.Footnote v The underlying raw dataset for our report is the same as the one used by the UKIPO to conduct their analysis. However, owing to differences in the approaches adopted to clean the data, there may be instances where there is a discrepancy in the statistics reported between the two reports.
Annex B - Data cleaning
In order to account for the inconsistencies and spelling errors that are commonly found in any IP dataset, CIPO devotes a significant amount of time to ensure the underlying dataset on which the analysis is conducted has minimum inconsistencies. Previously, this issue was dealt with entirely by manually grouping the same names together with a software known as VantagePoint. This process was a highly inefficient process and took around 10 business days to accomplish.
Thus, in order to reduce this manual intervention, a Python script leveraging Machine Learning (ML) techniques to clean researcher information was developed. One of the attributes fed into the ML model includes a string comparison metric known as the Jaro–Winkler score, which compares the last name and first name of the two researchers under consideration. Another attribute considered is the difference between the application dates of the two researchers being compared. The ML model also takes into consideration the number of shared assignees between the two researchers.
Using this script, the record having the most information will replace records having similar information. For illustration, the first two records in the following example will be replaced by the third record:
- John Smith
- John Smith, CA
- John Smith, Ottawa, ON, CA
We are in the process of further improving the performance of the script with short names and are also trying to leverage the geographical information of researchers as an additional attribute for the ML model.
Annex C - Intellectual property concentration index
The IPCI introduced in this report follows a long history of concentration indices applied in many disciplines, such as the Herfindahl-Hirschman Index, Simpson index, Shannon diversity index, and the effective number of parties index. The formula used to calculate the Intellectual Property Concentration Index (IPCI) is as follows:
IPCI = s12 + s22 + s32 + ….. + sn2
where sn is the share of patented inventions held by participant n, in fraction. Note that a fractional counting approach was used to calculate patented invention totals for each participant.
The value of the index ranges between 1/n and 1. Index values closer to 0 would indicate an industry or technology field has an environment that is more competitive, consisting of a large number of less-active participants, whereas index values closer to 1 would indicate an industry or technology field has an environment that is more concentrated consisting of a few dominant players.
Annex D - Revealed specialization index
In order to better understand a country's strengths in AI, the Relative Specialization Index (RSI) was used. The formula used to calculate the RSI for a particular country is as follows:
where P represents patented inventions.
The sum total of patented inventions assigned to a particular country's applicants in AI is divided by the sum total of patented inventions identified globally in AI.
The sum total of patented inventions assigned to a particular country's applicants is divided by the sum total of patented inventions identified globally across all technology sectors. The data pertaining to the denominator was obtained from EPO-PATSTAT.
- Date modified: