Key Concepts#

This page highlights key concepts and terminology used throughout the package.

Patent Number#

Patent numbers are assigned by the USPTO following the format described here <https://www.uspto.gov/patents/apply/applying-online/patent-number>_. Patent numbers consist of two parts: a prefix, which identifies the type of patent, and a number, which is unique within the prefix. Utility patents numbers consist of six, seven or eight digits. The patent number is entered excluding commas and spaces and omitting leading zeroes.

Note that other countries have different systems for assigning patent numbers. As such, we typically prefix the patent number by the code “US” to identify United States patents. This is done in the representation of mention IDs below.

Mention ID#

An inventor’s mention ID is a reference to a specific author on a specific patent. It takes the format US<patent_number>-<sequence_number>, such as US12345-0, where patent_number is the patent number and where sequence_number is the authorship number (0 for the first author, 1 for the second author, etc).

Each patent application is assigned a unique patent number by USPTO, and each inventor or author is assigned a unique sequence number on that patent. These two pieces of information, the patent number and the sequence number, are combined to create a unique mention ID for each inventor on each patent. The mention ID provides a way to refer to a specific inventor on a specific patent in the disambiguation process.

Note that mention ID is specific to the USPTO and its format, other countries may have different format for patent number and sequence number and different format for mention ID.

The mention IDs are used as input to the disambiguation algorithm, which attempts to group together multiple mention IDs that correspond to the same inventor.

Clusters#

An inventor cluster is a set of mention IDs thought to refer to the same person. There are predicted clusters which are provided by disambiguation algorithms, and there are true clusters which are ground-truth sets of mentions for inventors.

An inventor cluster is a set of mention IDs thought to refer to the same person. There are two types of clusters used in evaluating disambiguation algorithms:

Predicted clusters are generated by the algorithm being evaluated. These clusters group together mention IDs that the algorithm has determined to refer to the same inventor.
True clusters are the ground-truth sets of mention IDs that correspond to the same inventor. These clusters are usually derived from manual annotation or other expert knowledge.

The goal of the evaluation process is to compare the predicted clusters to the true clusters, in order to assess the accuracy and performance of the disambiguation algorithm.

Membership Vector#

A clustering is typically represented as a membership vector. This is a map between mention IDs and the clusters to which they are associated.

In this package, membership vectors are represented using pandas Series with mention IDs as the index and cluster IDs as the values. All clusterings and disambiguation results follow this format.

Below is an example of a membership vector for a subset of inventor mention IDs. The values appearing in the right column (cluster IDs) are arbitrary; the only convention is that mention IDs corresponding to the same inventor (belonging to the same cluster) should have the same cluster ID.

from pv_evaluation.benchmark import load_israeli_inventors_benchmark

load_israeli_inventors_benchmark()

mention_id
US3858246-0    11797
US3858578-0    11797
US3858674-0    16606
US3859165-0    13384
US3859616-0     9865
               ...  
US6009346-0    12734
US6009390-1     7694
US6009409-2    11416
US6009543-0    19168
US6009552-0      650
Name: unique_id, Length: 9156, dtype: int64