{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# ✍️ Creating Inventors Benchmark Datasets by Hand\n", "\n", "This notebook describes the practical procedure used at the American Institutes for Research to construct hand-disambiguated benchmark datasets of inventor mentions.\n", "\n", "The procedure has three steps:\n", "1. We sample inventor mentions uniformly at random.\n", "2. For each sampled mention and given an associated predicted cluster, we identify mentions that should be **removed** from the predicted cluster.\n", "3. For each sampled mention and given an associated predicted cluster, we identify mentions that should be **added** to the predicted cluster.\n", "\n", "This provides a set of ground true clusters which have been sampled with probability proportional to their size. Note that the procedure is dependent on a baseline disambiguation algorithm, typically taken as the current PatentsView disambiguation. In cases where no errors are found, predicted clusters are assumed to be correct.\n", "\n", "In order to find mentions that should be removed in step (2), we use PatentsView.org as it provides a convenient interface to browse inventor clusters. In order to find mentions that should be added in step (3), we use PatentsView.org's search tools to review mentions to similarly-named inventors. \n", "\n", "## Practical Implementation\n", "\n", "From a practical standpoint, staff reviewing inventor clusters keeps track of mentions to be added and to be removed in an excel spreadsheet. This spreadsheet contains one row for each sampled inventor mention, as well as columns for the patent number, the predicted inventor identifier, and the sampled inventor mention's name. As part of the review process, a column named \"add\" is appended to contain comma-separated lists of inventor mentions to add to each row. A column named \"remove\" is appended to contain comma-separated lists of inventor mentions to remove from each row.\n", "\n", "Note that inventor mentions take the standard form \"US-\" where is the patent number of the inventor mention and is the 0-indexed inventor sequence number.\n", "\n", "An example of a reviewed set of inventor mentions is shown below." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
patent_idinventor_idname_firstname_lastsequenceaddremovecorrectnotesUnnamed: 9
06267035fl:ca_ln:santizo-1Carlos GilbertoSantizo3NaNNaNyesNaNNaN
14690644fl:ma_ln:flanders-2Marguerita E.Flanders1NaNNaNyesNaNNaN
210120759fl:ar_ln:gv-1AravindGv0NaNNaNyesNaNNaN
35290082fl:th_ln:mealey-1Thomas P.Mealey2NaNNaNyesNaNNaN
4RE46143fl:p._ln:erman-1P. GregoryErman0US8561841-0, US7475795-3, US9296603-1,US9738507-1NaNNaNNaNNaN
510223669fl:to_ln:geniesse-1TomGeniesse0NaNNaNyesNaNNaN
66387460fl:hi_ln:yoshizawa-11HideoYoshizawa1US6993267-5, US10895827-2NaNNaNnot sure about this one- seems like glass and ...NaN
75928343fl:ma_ln:horowitz-3MarkHorowitz1NaNUS7736282-0NaNa few with abnormal assigneesNaN
811009520fl:st_ln:bowers-3Stewart V.Bowers, III1NaNNaNyesNaNNaN
95467579fl:si_ln:boriani-1SilvanoBoriani0NaNNaNyesNaNNaN
\n", "
" ], "text/plain": [ " patent_id inventor_id name_first name_last sequence \\\n", "0 6267035 fl:ca_ln:santizo-1 Carlos Gilberto Santizo 3 \n", "1 4690644 fl:ma_ln:flanders-2 Marguerita E. Flanders 1 \n", "2 10120759 fl:ar_ln:gv-1 Aravind Gv 0 \n", "3 5290082 fl:th_ln:mealey-1 Thomas P. Mealey 2 \n", "4 RE46143 fl:p._ln:erman-1 P. Gregory Erman 0 \n", "5 10223669 fl:to_ln:geniesse-1 Tom Geniesse 0 \n", "6 6387460 fl:hi_ln:yoshizawa-11 Hideo Yoshizawa 1 \n", "7 5928343 fl:ma_ln:horowitz-3 Mark Horowitz 1 \n", "8 11009520 fl:st_ln:bowers-3 Stewart V. Bowers, III 1 \n", "9 5467579 fl:si_ln:boriani-1 Silvano Boriani 0 \n", "\n", " add remove correct \\\n", "0 NaN NaN yes \n", "1 NaN NaN yes \n", "2 NaN NaN yes \n", "3 NaN NaN yes \n", "4 US8561841-0, US7475795-3, US9296603-1,US9738507-1 NaN NaN \n", "5 NaN NaN yes \n", "6 US6993267-5, US10895827-2 NaN NaN \n", "7 NaN US7736282-0 NaN \n", "8 NaN NaN yes \n", "9 NaN NaN yes \n", "\n", " notes Unnamed: 9 \n", "0 NaN NaN \n", "1 NaN NaN \n", "2 NaN NaN \n", "3 NaN NaN \n", "4 NaN NaN \n", "5 NaN NaN \n", "6 not sure about this one- seems like glass and ... NaN \n", "7 a few with abnormal assignees NaN \n", "8 NaN NaN \n", "9 NaN NaN " ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "pd.read_excel(\"2022-07-25-Emma-patent-samples-part-2.xlsx\").head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Validation\n", "\n", "Using a reference set of inventor mentions together with the predicted clustering (i.e., the \"rawinventor.tsv\" file from [PatentsView's bulk data downloads](https://patentsview.org/download/data-download-tables)), we look for inventor mentions that do not exist in the data and for mentions listed to be removed but that are not part of the sampled mention's cluster. Additionally, we print out a sheet containing a comparison between the name of sampled inventors and the names of inventors **added** to predicted clusters. This way, obvious errors in the review process can be flagged and corrected.\n", "\n", "This validation process is done using the `process-inventors-hand-disambiguation.py` script provided by the **pv_evaluation** package as follows. First, we install **pv_evaluation** and download the rawinventor.tsv file. Next, we run `process-inventors-hand-disambiguation.py` in debug mode to produce an excel spreadsheet containing one page for review errors and one page for the comparison of sampled names with added inventor names." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Archive: rawinventor.tsv.zip\n" ] } ], "source": [ "%%bash\n", "\n", "pip install -q git+https://github.com/PatentsView/PatentsView-Evaluation.git@release\n", "wget -nc -q -nv https://s3.amazonaws.com/data.patentsview.org/download/rawinventor.tsv.zip -O rawinventor.tsv.zip\n", "unzip -n rawinventor.tsv.zip\n", "process-inventors-hand-disambiguation.py --debug 2022-07-25-Emma-patent-samples-part-2.xlsx rawinventor.tsv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The debugging pages are shown below:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
patent_idinventor_idname_firstname_lastaddremoveremove_errorsadd_errors
06267035fl:ca_ln:santizo-1Carlos GilbertoSantizoNaNNaN[][]
14690644fl:ma_ln:flanders-2Marguerita E.FlandersNaNNaN[][]
210120759fl:ar_ln:gv-1AravindGvNaNNaN[][]
35290082fl:th_ln:mealey-1Thomas P.MealeyNaNNaN[][]
4RE46143fl:p._ln:erman-1P. GregoryErmanUS8561841-0, US7475795-3, US9296603-1,US9738507-1NaN[][]
...........................
19510455291fl:jo_ln:bernstein-1Joseph HaroldBernsteinNaNNaN[][]
1965057055fl:mi_ln:presseau-1MichelPresseauNaNNaN[][]
1976810399fl:an_ln:osborn-5AndrewOsbornNaNNaN[][]
1989730177fl:st_ln:toth-4Stefan KarlTothNaNNaN[][]
1994614424fl:sh_ln:watanabe-201ShunjiWatanabeNaNNaN[][]
\n", "

200 rows × 8 columns

\n", "
" ], "text/plain": [ " patent_id inventor_id name_first name_last \\\n", "0 6267035 fl:ca_ln:santizo-1 Carlos Gilberto Santizo \n", "1 4690644 fl:ma_ln:flanders-2 Marguerita E. Flanders \n", "2 10120759 fl:ar_ln:gv-1 Aravind Gv \n", "3 5290082 fl:th_ln:mealey-1 Thomas P. Mealey \n", "4 RE46143 fl:p._ln:erman-1 P. Gregory Erman \n", ".. ... ... ... ... \n", "195 10455291 fl:jo_ln:bernstein-1 Joseph Harold Bernstein \n", "196 5057055 fl:mi_ln:presseau-1 Michel Presseau \n", "197 6810399 fl:an_ln:osborn-5 Andrew Osborn \n", "198 9730177 fl:st_ln:toth-4 Stefan Karl Toth \n", "199 4614424 fl:sh_ln:watanabe-201 Shunji Watanabe \n", "\n", " add remove remove_errors \\\n", "0 NaN NaN [] \n", "1 NaN NaN [] \n", "2 NaN NaN [] \n", "3 NaN NaN [] \n", "4 US8561841-0, US7475795-3, US9296603-1,US9738507-1 NaN [] \n", ".. ... ... ... \n", "195 NaN NaN [] \n", "196 NaN NaN [] \n", "197 NaN NaN [] \n", "198 NaN NaN [] \n", "199 NaN NaN [] \n", "\n", " add_errors \n", "0 [] \n", "1 [] \n", "2 [] \n", "3 [] \n", "4 [] \n", ".. ... \n", "195 [] \n", "196 [] \n", "197 [] \n", "198 [] \n", "199 [] \n", "\n", "[200 rows x 8 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_excel(\"true_clusters.csv.debug.xlsx\", sheet_name=\"Cluster Errors\").drop(columns=[\"sequence\", \"Unnamed: 9\", \"correct\", \"notes\"])" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
patent_idname_firstname_lastaddedname_first_addedname_last_added
0RE46143P. GregoryErmanUS7475795-3GregoryErman
1RE46143P. GregoryErmanUS8561841-0Gregory P.Erman
2RE46143P. GregoryErmanUS9296603-1Paul GregoryErman
3RE46143P. GregoryErmanUS9738507-1Paul GregoryErman
46387460HideoYoshizawaUS10895827-2HideoYoshizawa
.....................
2275792879ThomasGessnerUS8963898-4ThomasGessner
2285792879ThomasGessnerUS9005705-2ThomasGessner
2295792879ThomasGessnerUS9291285-2ThomasGessner
23010474574Seung-BeomLeeUS10275371-1SeungbeomLee
2319979878Sapna AShroffUS11042034-2SapnaShroff
\n", "

232 rows × 6 columns

\n", "
" ], "text/plain": [ " patent_id name_first name_last added name_first_added \\\n", "0 RE46143 P. Gregory Erman US7475795-3 Gregory \n", "1 RE46143 P. Gregory Erman US8561841-0 Gregory P. \n", "2 RE46143 P. Gregory Erman US9296603-1 Paul Gregory \n", "3 RE46143 P. Gregory Erman US9738507-1 Paul Gregory \n", "4 6387460 Hideo Yoshizawa US10895827-2 Hideo \n", ".. ... ... ... ... ... \n", "227 5792879 Thomas Gessner US8963898-4 Thomas \n", "228 5792879 Thomas Gessner US9005705-2 Thomas \n", "229 5792879 Thomas Gessner US9291285-2 Thomas \n", "230 10474574 Seung-Beom Lee US10275371-1 Seungbeom \n", "231 9979878 Sapna A Shroff US11042034-2 Sapna \n", "\n", " name_last_added \n", "0 Erman \n", "1 Erman \n", "2 Erman \n", "3 Erman \n", "4 Yoshizawa \n", ".. ... \n", "227 Gessner \n", "228 Gessner \n", "229 Gessner \n", "230 Lee \n", "231 Shroff \n", "\n", "[232 rows x 6 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_excel(\"true_clusters.csv.debug.xlsx\", sheet_name=\"Validation of Added Mentions\").drop(columns=[\"sequence\", \"inventor_id\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Transformation into Benchmark Dataset\n", "\n", "Once errors in the review process have been corrected, the working excel sheet can be transformed to a csv file containing the hand-disambiguation results in the standard format of a membership vector. This is done by running `process-inventors-hand-disambiguation.py` as follows. Note that the name of the output file can be changed using the \"--output\" argument." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "\n", "process-inventors-hand-disambiguation.py \"2022-07-25-Emma-patent-samples-part-2.xlsx\" \"rawinventor.tsv\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result (saved by default to \"true_clusters.csv\") is shown below:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mention_idinventor_id
0US7152514-3fl:ca_ln:santizo-1
1US6564684-3fl:ca_ln:santizo-1
2US7832315-3fl:ca_ln:santizo-1
3US6267035-3fl:ca_ln:santizo-1
4US6708592-3fl:ca_ln:santizo-1
.........
6733US5679889-0fl:sh_ln:watanabe-201
6734US7169506-0fl:sh_ln:watanabe-201
6735US6459564-0fl:sh_ln:watanabe-201
6736US7749649-0fl:sh_ln:watanabe-201
6737US8553392-3fl:sh_ln:watanabe-201
\n", "

6738 rows × 2 columns

\n", "
" ], "text/plain": [ " mention_id inventor_id\n", "0 US7152514-3 fl:ca_ln:santizo-1\n", "1 US6564684-3 fl:ca_ln:santizo-1\n", "2 US7832315-3 fl:ca_ln:santizo-1\n", "3 US6267035-3 fl:ca_ln:santizo-1\n", "4 US6708592-3 fl:ca_ln:santizo-1\n", "... ... ...\n", "6733 US5679889-0 fl:sh_ln:watanabe-201\n", "6734 US7169506-0 fl:sh_ln:watanabe-201\n", "6735 US6459564-0 fl:sh_ln:watanabe-201\n", "6736 US7749649-0 fl:sh_ln:watanabe-201\n", "6737 US8553392-3 fl:sh_ln:watanabe-201\n", "\n", "[6738 rows x 2 columns]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_csv(\"true_clusters.csv\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## More Information\n", "\n", "For more information, please refer to the help file of `process-inventors-hand-disambiguation.py`:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "usage: process-inventors-hand-disambiguation.py [-h] [-o OUTPUT] [-d]\n", " hand_disambiguation\n", " rawinventor\n", "\n", "Process inventors hand-disambiguation files: validate data and produce\n", "benchmark dataset.\n", "\n", "positional arguments:\n", " hand_disambiguation Excel spreadsheet with sampled inventor mentions, the\n", " corresponding predicted cluster, and lists of inventor\n", " mentions to add to and remove from the predicted\n", " clusters. This spreadsheet should contain the columns\n", " 'patent_id', 'sequence', 'inventor_id', 'add', and\n", " 'remove'. The 'add' and 'remove' columns should\n", " contain comma-separated inventor mentions in the\n", " format US-.\n", " rawinventor File with reference inventor mentions and predicted\n", " clusters. It should contain the columns 'patent_id',\n", " 'sequence', and 'inventor_id'.\n", "\n", "optional arguments:\n", " -h, --help show this help message and exit\n", " -o OUTPUT, --output OUTPUT\n", " CSV file where to save the resulting hand-\n", " disambiguated membership vector.\n", " -d, --debug Save debugging spreadsheet to\n", " '.csv.debug.xlsx'. This\n", " spreadsheet has two pages. The first shows inventor\n", " mentions to remove that were not found in the\n", " reference predicted clusters. The second shows the\n", " name of inventors added to predicted clusters.\n" ] } ], "source": [ "%%bash\n", "\n", "process-inventors-hand-disambiguation.py --help" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.9.12 ('base': conda)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" }, "orig_nbformat": 4, "vscode": { "interpreter": { "hash": "7a2c4b191d1ae843dde5cb5f4d1f62fa892f6b79b0f9392a84691e890e33c5a4" } } }, "nbformat": 4, "nbformat_minor": 2 }