✍️ Creating Inventors Benchmark Datasets by Hand#

This notebook describes the practical procedure used at the American Institutes for Research to construct hand-disambiguated benchmark datasets of inventor mentions.

The procedure has three steps:

We sample inventor mentions uniformly at random.
For each sampled mention and given an associated predicted cluster, we identify mentions that should be removed from the predicted cluster.
For each sampled mention and given an associated predicted cluster, we identify mentions that should be added to the predicted cluster.

This provides a set of ground true clusters which have been sampled with probability proportional to their size. Note that the procedure is dependent on a baseline disambiguation algorithm, typically taken as the current PatentsView disambiguation. In cases where no errors are found, predicted clusters are assumed to be correct.

In order to find mentions that should be removed in step (2), we use PatentsView.org as it provides a convenient interface to browse inventor clusters. In order to find mentions that should be added in step (3), we use PatentsView.org’s search tools to review mentions to similarly-named inventors.

Practical Implementation#

From a practical standpoint, staff reviewing inventor clusters keeps track of mentions to be added and to be removed in an excel spreadsheet. This spreadsheet contains one row for each sampled inventor mention, as well as columns for the patent number, the predicted inventor identifier, and the sampled inventor mention’s name. As part of the review process, a column named “add” is appended to contain comma-separated lists of inventor mentions to add to each row. A column named “remove” is appended to contain comma-separated lists of inventor mentions to remove from each row.

Note that inventor mentions take the standard form “US<patent_number>-<sequence_number>” where <patent_number> is the patent number of the inventor mention and is the 0-indexed inventor sequence number.

An example of a reviewed set of inventor mentions is shown below.

import pandas as pd

pd.read_excel("2022-07-25-Emma-patent-samples-part-2.xlsx").head(10)

	patent_id	inventor_id	name_first	name_last	sequence	add	remove	correct	notes	Unnamed: 9
0	6267035	fl:ca_ln:santizo-1	Carlos Gilberto	Santizo	3	NaN	NaN	yes	NaN	NaN
1	4690644	fl:ma_ln:flanders-2	Marguerita E.	Flanders	1	NaN	NaN	yes	NaN	NaN
2	10120759	fl:ar_ln:gv-1	Aravind	Gv	0	NaN	NaN	yes	NaN	NaN
3	5290082	fl:th_ln:mealey-1	Thomas P.	Mealey	2	NaN	NaN	yes	NaN	NaN
4	RE46143	fl:p._ln:erman-1	P. Gregory	Erman	0	US8561841-0, US7475795-3, US9296603-1,US9738507-1	NaN	NaN	NaN	NaN
5	10223669	fl:to_ln:geniesse-1	Tom	Geniesse	0	NaN	NaN	yes	NaN	NaN
6	6387460	fl:hi_ln:yoshizawa-11	Hideo	Yoshizawa	1	US6993267-5, US10895827-2	NaN	NaN	not sure about this one- seems like glass and ...	NaN
7	5928343	fl:ma_ln:horowitz-3	Mark	Horowitz	1	NaN	US7736282-0	NaN	a few with abnormal assignees	NaN
8	11009520	fl:st_ln:bowers-3	Stewart V.	Bowers, III	1	NaN	NaN	yes	NaN	NaN
9	5467579	fl:si_ln:boriani-1	Silvano	Boriani	0	NaN	NaN	yes	NaN	NaN

Validation#

Using a reference set of inventor mentions together with the predicted clustering (i.e., the “rawinventor.tsv” file from PatentsView’s bulk data downloads), we look for inventor mentions that do not exist in the data and for mentions listed to be removed but that are not part of the sampled mention’s cluster. Additionally, we print out a sheet containing a comparison between the name of sampled inventors and the names of inventors added to predicted clusters. This way, obvious errors in the review process can be flagged and corrected.

This validation process is done using the process-inventors-hand-disambiguation.py script provided by the pv_evaluation package as follows. First, we install pv_evaluation and download the rawinventor.tsv file. Next, we run process-inventors-hand-disambiguation.py in debug mode to produce an excel spreadsheet containing one page for review errors and one page for the comparison of sampled names with added inventor names.

%%bash

pip install -q git+https://github.com/PatentsView/PatentsView-Evaluation.git@release
wget -nc -q -nv https://s3.amazonaws.com/data.patentsview.org/download/rawinventor.tsv.zip -O rawinventor.tsv.zip
unzip -n rawinventor.tsv.zip
process-inventors-hand-disambiguation.py --debug 2022-07-25-Emma-patent-samples-part-2.xlsx rawinventor.tsv

Archive:  rawinventor.tsv.zip

The debugging pages are shown below:

pd.read_excel("true_clusters.csv.debug.xlsx", sheet_name="Cluster Errors").drop(columns=["sequence", "Unnamed: 9", "correct", "notes"])

	patent_id	inventor_id	name_first	name_last	add	remove	remove_errors	add_errors
0	6267035	fl:ca_ln:santizo-1	Carlos Gilberto	Santizo	NaN	NaN	[]	[]
1	4690644	fl:ma_ln:flanders-2	Marguerita E.	Flanders	NaN	NaN	[]	[]
2	10120759	fl:ar_ln:gv-1	Aravind	Gv	NaN	NaN	[]	[]
3	5290082	fl:th_ln:mealey-1	Thomas P.	Mealey	NaN	NaN	[]	[]
4	RE46143	fl:p._ln:erman-1	P. Gregory	Erman	US8561841-0, US7475795-3, US9296603-1,US9738507-1	NaN	[]	[]
...	...	...	...	...	...	...	...	...
195	10455291	fl:jo_ln:bernstein-1	Joseph Harold	Bernstein	NaN	NaN	[]	[]
196	5057055	fl:mi_ln:presseau-1	Michel	Presseau	NaN	NaN	[]	[]
197	6810399	fl:an_ln:osborn-5	Andrew	Osborn	NaN	NaN	[]	[]
198	9730177	fl:st_ln:toth-4	Stefan Karl	Toth	NaN	NaN	[]	[]
199	4614424	fl:sh_ln:watanabe-201	Shunji	Watanabe	NaN	NaN	[]	[]

200 rows × 8 columns

pd.read_excel("true_clusters.csv.debug.xlsx", sheet_name="Validation of Added Mentions").drop(columns=["sequence", "inventor_id"])

	patent_id	name_first	name_last	added	name_first_added	name_last_added
0	RE46143	P. Gregory	Erman	US7475795-3	Gregory	Erman
1	RE46143	P. Gregory	Erman	US8561841-0	Gregory P.	Erman
2	RE46143	P. Gregory	Erman	US9296603-1	Paul Gregory	Erman
3	RE46143	P. Gregory	Erman	US9738507-1	Paul Gregory	Erman
4	6387460	Hideo	Yoshizawa	US10895827-2	Hideo	Yoshizawa
...	...	...	...	...	...	...
227	5792879	Thomas	Gessner	US8963898-4	Thomas	Gessner
228	5792879	Thomas	Gessner	US9005705-2	Thomas	Gessner
229	5792879	Thomas	Gessner	US9291285-2	Thomas	Gessner
230	10474574	Seung-Beom	Lee	US10275371-1	Seungbeom	Lee
231	9979878	Sapna A	Shroff	US11042034-2	Sapna	Shroff

232 rows × 6 columns

Transformation into Benchmark Dataset#

Once errors in the review process have been corrected, the working excel sheet can be transformed to a csv file containing the hand-disambiguation results in the standard format of a membership vector. This is done by running process-inventors-hand-disambiguation.py as follows. Note that the name of the output file can be changed using the “–output” argument.

%%bash

process-inventors-hand-disambiguation.py "2022-07-25-Emma-patent-samples-part-2.xlsx" "rawinventor.tsv"

The result (saved by default to “true_clusters.csv”) is shown below:

pd.read_csv("true_clusters.csv")

	mention_id	inventor_id
0	US7152514-3	fl:ca_ln:santizo-1
1	US6564684-3	fl:ca_ln:santizo-1
2	US7832315-3	fl:ca_ln:santizo-1
3	US6267035-3	fl:ca_ln:santizo-1
4	US6708592-3	fl:ca_ln:santizo-1
...	...	...
6733	US5679889-0	fl:sh_ln:watanabe-201
6734	US7169506-0	fl:sh_ln:watanabe-201
6735	US6459564-0	fl:sh_ln:watanabe-201
6736	US7749649-0	fl:sh_ln:watanabe-201
6737	US8553392-3	fl:sh_ln:watanabe-201

6738 rows × 2 columns

More Information#

For more information, please refer to the help file of process-inventors-hand-disambiguation.py:

%%bash

process-inventors-hand-disambiguation.py --help

usage: process-inventors-hand-disambiguation.py [-h] [-o OUTPUT] [-d]
                                                hand_disambiguation
                                                rawinventor

Process inventors hand-disambiguation files: validate data and produce
benchmark dataset.

positional arguments:
  hand_disambiguation   Excel spreadsheet with sampled inventor mentions, the
                        corresponding predicted cluster, and lists of inventor
                        mentions to add to and remove from the predicted
                        clusters. This spreadsheet should contain the columns
                        'patent_id', 'sequence', 'inventor_id', 'add', and
                        'remove'. The 'add' and 'remove' columns should
                        contain comma-separated inventor mentions in the
                        format US<patent_number>-<sequence_number>.
  rawinventor           File with reference inventor mentions and predicted
                        clusters. It should contain the columns 'patent_id',
                        'sequence', and 'inventor_id'.

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        CSV file where to save the resulting hand-
                        disambiguated membership vector.
  -d, --debug           Save debugging spreadsheet to
                        '<hand_disambiguation>.csv.debug.xlsx'. This
                        spreadsheet has two pages. The first shows inventor
                        mentions to remove that were not found in the
                        reference predicted clusters. The second shows the
                        name of inventors added to predicted clusters.