✍️ Creating Inventors Benchmark Datasets by Hand#
This notebook describes the practical procedure used at the American Institutes for Research to construct hand-disambiguated benchmark datasets of inventor mentions.
The procedure has three steps:
We sample inventor mentions uniformly at random.
For each sampled mention and given an associated predicted cluster, we identify mentions that should be removed from the predicted cluster.
For each sampled mention and given an associated predicted cluster, we identify mentions that should be added to the predicted cluster.
This provides a set of ground true clusters which have been sampled with probability proportional to their size. Note that the procedure is dependent on a baseline disambiguation algorithm, typically taken as the current PatentsView disambiguation. In cases where no errors are found, predicted clusters are assumed to be correct.
In order to find mentions that should be removed in step (2), we use PatentsView.org as it provides a convenient interface to browse inventor clusters. In order to find mentions that should be added in step (3), we use PatentsView.org’s search tools to review mentions to similarly-named inventors.
Practical Implementation#
From a practical standpoint, staff reviewing inventor clusters keeps track of mentions to be added and to be removed in an excel spreadsheet. This spreadsheet contains one row for each sampled inventor mention, as well as columns for the patent number, the predicted inventor identifier, and the sampled inventor mention’s name. As part of the review process, a column named “add” is appended to contain comma-separated lists of inventor mentions to add to each row. A column named “remove” is appended to contain comma-separated lists of inventor mentions to remove from each row.
Note that inventor mentions take the standard form “US<patent_number>-<sequence_number>” where <patent_number> is the patent number of the inventor mention and
An example of a reviewed set of inventor mentions is shown below.
import pandas as pd
pd.read_excel("2022-07-25-Emma-patent-samples-part-2.xlsx").head(10)
patent_id | inventor_id | name_first | name_last | sequence | add | remove | correct | notes | Unnamed: 9 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 6267035 | fl:ca_ln:santizo-1 | Carlos Gilberto | Santizo | 3 | NaN | NaN | yes | NaN | NaN |
1 | 4690644 | fl:ma_ln:flanders-2 | Marguerita E. | Flanders | 1 | NaN | NaN | yes | NaN | NaN |
2 | 10120759 | fl:ar_ln:gv-1 | Aravind | Gv | 0 | NaN | NaN | yes | NaN | NaN |
3 | 5290082 | fl:th_ln:mealey-1 | Thomas P. | Mealey | 2 | NaN | NaN | yes | NaN | NaN |
4 | RE46143 | fl:p._ln:erman-1 | P. Gregory | Erman | 0 | US8561841-0, US7475795-3, US9296603-1,US9738507-1 | NaN | NaN | NaN | NaN |
5 | 10223669 | fl:to_ln:geniesse-1 | Tom | Geniesse | 0 | NaN | NaN | yes | NaN | NaN |
6 | 6387460 | fl:hi_ln:yoshizawa-11 | Hideo | Yoshizawa | 1 | US6993267-5, US10895827-2 | NaN | NaN | not sure about this one- seems like glass and ... | NaN |
7 | 5928343 | fl:ma_ln:horowitz-3 | Mark | Horowitz | 1 | NaN | US7736282-0 | NaN | a few with abnormal assignees | NaN |
8 | 11009520 | fl:st_ln:bowers-3 | Stewart V. | Bowers, III | 1 | NaN | NaN | yes | NaN | NaN |
9 | 5467579 | fl:si_ln:boriani-1 | Silvano | Boriani | 0 | NaN | NaN | yes | NaN | NaN |
Validation#
Using a reference set of inventor mentions together with the predicted clustering (i.e., the “rawinventor.tsv” file from PatentsView’s bulk data downloads), we look for inventor mentions that do not exist in the data and for mentions listed to be removed but that are not part of the sampled mention’s cluster. Additionally, we print out a sheet containing a comparison between the name of sampled inventors and the names of inventors added to predicted clusters. This way, obvious errors in the review process can be flagged and corrected.
This validation process is done using the process-inventors-hand-disambiguation.py
script provided by the pv_evaluation package as follows. First, we install pv_evaluation and download the rawinventor.tsv file. Next, we run process-inventors-hand-disambiguation.py
in debug mode to produce an excel spreadsheet containing one page for review errors and one page for the comparison of sampled names with added inventor names.
%%bash
pip install -q git+https://github.com/PatentsView/PatentsView-Evaluation.git@release
wget -nc -q -nv https://s3.amazonaws.com/data.patentsview.org/download/rawinventor.tsv.zip -O rawinventor.tsv.zip
unzip -n rawinventor.tsv.zip
process-inventors-hand-disambiguation.py --debug 2022-07-25-Emma-patent-samples-part-2.xlsx rawinventor.tsv
Archive: rawinventor.tsv.zip
The debugging pages are shown below:
pd.read_excel("true_clusters.csv.debug.xlsx", sheet_name="Cluster Errors").drop(columns=["sequence", "Unnamed: 9", "correct", "notes"])
patent_id | inventor_id | name_first | name_last | add | remove | remove_errors | add_errors | |
---|---|---|---|---|---|---|---|---|
0 | 6267035 | fl:ca_ln:santizo-1 | Carlos Gilberto | Santizo | NaN | NaN | [] | [] |
1 | 4690644 | fl:ma_ln:flanders-2 | Marguerita E. | Flanders | NaN | NaN | [] | [] |
2 | 10120759 | fl:ar_ln:gv-1 | Aravind | Gv | NaN | NaN | [] | [] |
3 | 5290082 | fl:th_ln:mealey-1 | Thomas P. | Mealey | NaN | NaN | [] | [] |
4 | RE46143 | fl:p._ln:erman-1 | P. Gregory | Erman | US8561841-0, US7475795-3, US9296603-1,US9738507-1 | NaN | [] | [] |
... | ... | ... | ... | ... | ... | ... | ... | ... |
195 | 10455291 | fl:jo_ln:bernstein-1 | Joseph Harold | Bernstein | NaN | NaN | [] | [] |
196 | 5057055 | fl:mi_ln:presseau-1 | Michel | Presseau | NaN | NaN | [] | [] |
197 | 6810399 | fl:an_ln:osborn-5 | Andrew | Osborn | NaN | NaN | [] | [] |
198 | 9730177 | fl:st_ln:toth-4 | Stefan Karl | Toth | NaN | NaN | [] | [] |
199 | 4614424 | fl:sh_ln:watanabe-201 | Shunji | Watanabe | NaN | NaN | [] | [] |
200 rows × 8 columns
pd.read_excel("true_clusters.csv.debug.xlsx", sheet_name="Validation of Added Mentions").drop(columns=["sequence", "inventor_id"])
patent_id | name_first | name_last | added | name_first_added | name_last_added | |
---|---|---|---|---|---|---|
0 | RE46143 | P. Gregory | Erman | US7475795-3 | Gregory | Erman |
1 | RE46143 | P. Gregory | Erman | US8561841-0 | Gregory P. | Erman |
2 | RE46143 | P. Gregory | Erman | US9296603-1 | Paul Gregory | Erman |
3 | RE46143 | P. Gregory | Erman | US9738507-1 | Paul Gregory | Erman |
4 | 6387460 | Hideo | Yoshizawa | US10895827-2 | Hideo | Yoshizawa |
... | ... | ... | ... | ... | ... | ... |
227 | 5792879 | Thomas | Gessner | US8963898-4 | Thomas | Gessner |
228 | 5792879 | Thomas | Gessner | US9005705-2 | Thomas | Gessner |
229 | 5792879 | Thomas | Gessner | US9291285-2 | Thomas | Gessner |
230 | 10474574 | Seung-Beom | Lee | US10275371-1 | Seungbeom | Lee |
231 | 9979878 | Sapna A | Shroff | US11042034-2 | Sapna | Shroff |
232 rows × 6 columns
Transformation into Benchmark Dataset#
Once errors in the review process have been corrected, the working excel sheet can be transformed to a csv file containing the hand-disambiguation results in the standard format of a membership vector. This is done by running process-inventors-hand-disambiguation.py
as follows. Note that the name of the output file can be changed using the “–output” argument.
%%bash
process-inventors-hand-disambiguation.py "2022-07-25-Emma-patent-samples-part-2.xlsx" "rawinventor.tsv"
The result (saved by default to “true_clusters.csv”) is shown below:
pd.read_csv("true_clusters.csv")
mention_id | inventor_id | |
---|---|---|
0 | US7152514-3 | fl:ca_ln:santizo-1 |
1 | US6564684-3 | fl:ca_ln:santizo-1 |
2 | US7832315-3 | fl:ca_ln:santizo-1 |
3 | US6267035-3 | fl:ca_ln:santizo-1 |
4 | US6708592-3 | fl:ca_ln:santizo-1 |
... | ... | ... |
6733 | US5679889-0 | fl:sh_ln:watanabe-201 |
6734 | US7169506-0 | fl:sh_ln:watanabe-201 |
6735 | US6459564-0 | fl:sh_ln:watanabe-201 |
6736 | US7749649-0 | fl:sh_ln:watanabe-201 |
6737 | US8553392-3 | fl:sh_ln:watanabe-201 |
6738 rows × 2 columns
More Information#
For more information, please refer to the help file of process-inventors-hand-disambiguation.py
:
%%bash
process-inventors-hand-disambiguation.py --help
usage: process-inventors-hand-disambiguation.py [-h] [-o OUTPUT] [-d]
hand_disambiguation
rawinventor
Process inventors hand-disambiguation files: validate data and produce
benchmark dataset.
positional arguments:
hand_disambiguation Excel spreadsheet with sampled inventor mentions, the
corresponding predicted cluster, and lists of inventor
mentions to add to and remove from the predicted
clusters. This spreadsheet should contain the columns
'patent_id', 'sequence', 'inventor_id', 'add', and
'remove'. The 'add' and 'remove' columns should
contain comma-separated inventor mentions in the
format US<patent_number>-<sequence_number>.
rawinventor File with reference inventor mentions and predicted
clusters. It should contain the columns 'patent_id',
'sequence', and 'inventor_id'.
optional arguments:
-h, --help show this help message and exit
-o OUTPUT, --output OUTPUT
CSV file where to save the resulting hand-
disambiguated membership vector.
-d, --debug Save debugging spreadsheet to
'<hand_disambiguation>.csv.debug.xlsx'. This
spreadsheet has two pages. The first shows inventor
mentions to remove that were not found in the
reference predicted clusters. The second shows the
name of inventors added to predicted clusters.