✍️ Creating Inventors Benchmark Datasets by Hand#

This notebook describes the practical procedure used at the American Institutes for Research to construct hand-disambiguated benchmark datasets of inventor mentions.

The procedure has three steps:

  1. We sample inventor mentions uniformly at random.

  2. For each sampled mention and given an associated predicted cluster, we identify mentions that should be removed from the predicted cluster.

  3. For each sampled mention and given an associated predicted cluster, we identify mentions that should be added to the predicted cluster.

This provides a set of ground true clusters which have been sampled with probability proportional to their size. Note that the procedure is dependent on a baseline disambiguation algorithm, typically taken as the current PatentsView disambiguation. In cases where no errors are found, predicted clusters are assumed to be correct.

In order to find mentions that should be removed in step (2), we use PatentsView.org as it provides a convenient interface to browse inventor clusters. In order to find mentions that should be added in step (3), we use PatentsView.org’s search tools to review mentions to similarly-named inventors.

Practical Implementation#

From a practical standpoint, staff reviewing inventor clusters keeps track of mentions to be added and to be removed in an excel spreadsheet. This spreadsheet contains one row for each sampled inventor mention, as well as columns for the patent number, the predicted inventor identifier, and the sampled inventor mention’s name. As part of the review process, a column named “add” is appended to contain comma-separated lists of inventor mentions to add to each row. A column named “remove” is appended to contain comma-separated lists of inventor mentions to remove from each row.

Note that inventor mentions take the standard form “US<patent_number>-<sequence_number>” where <patent_number> is the patent number of the inventor mention and is the 0-indexed inventor sequence number.

An example of a reviewed set of inventor mentions is shown below.

import pandas as pd

pd.read_excel("2022-07-25-Emma-patent-samples-part-2.xlsx").head(10)
patent_id inventor_id name_first name_last sequence add remove correct notes Unnamed: 9
0 6267035 fl:ca_ln:santizo-1 Carlos Gilberto Santizo 3 NaN NaN yes NaN NaN
1 4690644 fl:ma_ln:flanders-2 Marguerita E. Flanders 1 NaN NaN yes NaN NaN
2 10120759 fl:ar_ln:gv-1 Aravind Gv 0 NaN NaN yes NaN NaN
3 5290082 fl:th_ln:mealey-1 Thomas P. Mealey 2 NaN NaN yes NaN NaN
4 RE46143 fl:p._ln:erman-1 P. Gregory Erman 0 US8561841-0, US7475795-3, US9296603-1,US9738507-1 NaN NaN NaN NaN
5 10223669 fl:to_ln:geniesse-1 Tom Geniesse 0 NaN NaN yes NaN NaN
6 6387460 fl:hi_ln:yoshizawa-11 Hideo Yoshizawa 1 US6993267-5, US10895827-2 NaN NaN not sure about this one- seems like glass and ... NaN
7 5928343 fl:ma_ln:horowitz-3 Mark Horowitz 1 NaN US7736282-0 NaN a few with abnormal assignees NaN
8 11009520 fl:st_ln:bowers-3 Stewart V. Bowers, III 1 NaN NaN yes NaN NaN
9 5467579 fl:si_ln:boriani-1 Silvano Boriani 0 NaN NaN yes NaN NaN

Validation#

Using a reference set of inventor mentions together with the predicted clustering (i.e., the “rawinventor.tsv” file from PatentsView’s bulk data downloads), we look for inventor mentions that do not exist in the data and for mentions listed to be removed but that are not part of the sampled mention’s cluster. Additionally, we print out a sheet containing a comparison between the name of sampled inventors and the names of inventors added to predicted clusters. This way, obvious errors in the review process can be flagged and corrected.

This validation process is done using the process-inventors-hand-disambiguation.py script provided by the pv_evaluation package as follows. First, we install pv_evaluation and download the rawinventor.tsv file. Next, we run process-inventors-hand-disambiguation.py in debug mode to produce an excel spreadsheet containing one page for review errors and one page for the comparison of sampled names with added inventor names.

%%bash

pip install -q git+https://github.com/PatentsView/PatentsView-Evaluation.git@release
wget -nc -q -nv https://s3.amazonaws.com/data.patentsview.org/download/rawinventor.tsv.zip -O rawinventor.tsv.zip
unzip -n rawinventor.tsv.zip
process-inventors-hand-disambiguation.py --debug 2022-07-25-Emma-patent-samples-part-2.xlsx rawinventor.tsv
Archive:  rawinventor.tsv.zip

The debugging pages are shown below:

pd.read_excel("true_clusters.csv.debug.xlsx", sheet_name="Cluster Errors").drop(columns=["sequence", "Unnamed: 9", "correct", "notes"])
patent_id inventor_id name_first name_last add remove remove_errors add_errors
0 6267035 fl:ca_ln:santizo-1 Carlos Gilberto Santizo NaN NaN [] []
1 4690644 fl:ma_ln:flanders-2 Marguerita E. Flanders NaN NaN [] []
2 10120759 fl:ar_ln:gv-1 Aravind Gv NaN NaN [] []
3 5290082 fl:th_ln:mealey-1 Thomas P. Mealey NaN NaN [] []
4 RE46143 fl:p._ln:erman-1 P. Gregory Erman US8561841-0, US7475795-3, US9296603-1,US9738507-1 NaN [] []
... ... ... ... ... ... ... ... ...
195 10455291 fl:jo_ln:bernstein-1 Joseph Harold Bernstein NaN NaN [] []
196 5057055 fl:mi_ln:presseau-1 Michel Presseau NaN NaN [] []
197 6810399 fl:an_ln:osborn-5 Andrew Osborn NaN NaN [] []
198 9730177 fl:st_ln:toth-4 Stefan Karl Toth NaN NaN [] []
199 4614424 fl:sh_ln:watanabe-201 Shunji Watanabe NaN NaN [] []

200 rows × 8 columns

pd.read_excel("true_clusters.csv.debug.xlsx", sheet_name="Validation of Added Mentions").drop(columns=["sequence", "inventor_id"])
patent_id name_first name_last added name_first_added name_last_added
0 RE46143 P. Gregory Erman US7475795-3 Gregory Erman
1 RE46143 P. Gregory Erman US8561841-0 Gregory P. Erman
2 RE46143 P. Gregory Erman US9296603-1 Paul Gregory Erman
3 RE46143 P. Gregory Erman US9738507-1 Paul Gregory Erman
4 6387460 Hideo Yoshizawa US10895827-2 Hideo Yoshizawa
... ... ... ... ... ... ...
227 5792879 Thomas Gessner US8963898-4 Thomas Gessner
228 5792879 Thomas Gessner US9005705-2 Thomas Gessner
229 5792879 Thomas Gessner US9291285-2 Thomas Gessner
230 10474574 Seung-Beom Lee US10275371-1 Seungbeom Lee
231 9979878 Sapna A Shroff US11042034-2 Sapna Shroff

232 rows × 6 columns

Transformation into Benchmark Dataset#

Once errors in the review process have been corrected, the working excel sheet can be transformed to a csv file containing the hand-disambiguation results in the standard format of a membership vector. This is done by running process-inventors-hand-disambiguation.py as follows. Note that the name of the output file can be changed using the “–output” argument.

%%bash

process-inventors-hand-disambiguation.py "2022-07-25-Emma-patent-samples-part-2.xlsx" "rawinventor.tsv"

The result (saved by default to “true_clusters.csv”) is shown below:

pd.read_csv("true_clusters.csv")
mention_id inventor_id
0 US7152514-3 fl:ca_ln:santizo-1
1 US6564684-3 fl:ca_ln:santizo-1
2 US7832315-3 fl:ca_ln:santizo-1
3 US6267035-3 fl:ca_ln:santizo-1
4 US6708592-3 fl:ca_ln:santizo-1
... ... ...
6733 US5679889-0 fl:sh_ln:watanabe-201
6734 US7169506-0 fl:sh_ln:watanabe-201
6735 US6459564-0 fl:sh_ln:watanabe-201
6736 US7749649-0 fl:sh_ln:watanabe-201
6737 US8553392-3 fl:sh_ln:watanabe-201

6738 rows × 2 columns

More Information#

For more information, please refer to the help file of process-inventors-hand-disambiguation.py:

%%bash

process-inventors-hand-disambiguation.py --help
usage: process-inventors-hand-disambiguation.py [-h] [-o OUTPUT] [-d]
                                                hand_disambiguation
                                                rawinventor

Process inventors hand-disambiguation files: validate data and produce
benchmark dataset.

positional arguments:
  hand_disambiguation   Excel spreadsheet with sampled inventor mentions, the
                        corresponding predicted cluster, and lists of inventor
                        mentions to add to and remove from the predicted
                        clusters. This spreadsheet should contain the columns
                        'patent_id', 'sequence', 'inventor_id', 'add', and
                        'remove'. The 'add' and 'remove' columns should
                        contain comma-separated inventor mentions in the
                        format US<patent_number>-<sequence_number>.
  rawinventor           File with reference inventor mentions and predicted
                        clusters. It should contain the columns 'patent_id',
                        'sequence', and 'inventor_id'.

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        CSV file where to save the resulting hand-
                        disambiguated membership vector.
  -d, --debug           Save debugging spreadsheet to
                        '<hand_disambiguation>.csv.debug.xlsx'. This
                        spreadsheet has two pages. The first shows inventor
                        mentions to remove that were not found in the
                        reference predicted clusters. The second shows the
                        name of inventors added to predicted clusters.