{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# ✍️ Creating Inventors Benchmark Datasets by Hand\n",
    "\n",
    "This notebook describes the practical procedure used at the American Institutes for Research to construct hand-disambiguated benchmark datasets of inventor mentions.\n",
    "\n",
    "The procedure has three steps:\n",
    "1. We sample inventor mentions uniformly at random.\n",
    "2. For each sampled mention and given an associated predicted cluster, we identify mentions that should be **removed** from the predicted cluster.\n",
    "3. For each sampled mention and given an associated predicted cluster, we identify mentions that should be **added** to the predicted cluster.\n",
    "\n",
    "This provides a set of ground true clusters which have been sampled with probability proportional to their size. Note that the procedure is dependent on a baseline disambiguation algorithm, typically taken as the current PatentsView disambiguation. In cases where no errors are found, predicted clusters are assumed to be correct.\n",
    "\n",
    "In order to find mentions that should be removed in step (2), we use PatentsView.org as it provides a convenient interface to browse inventor clusters. In order to find mentions that should be added in step (3), we use PatentsView.org's search tools to review mentions to similarly-named inventors. \n",
    "\n",
    "## Practical Implementation\n",
    "\n",
    "From a practical standpoint, staff reviewing inventor clusters keeps track of mentions to be added and to be removed in an excel spreadsheet. This spreadsheet contains one row for each sampled inventor mention, as well as columns for the patent number, the predicted inventor identifier, and the sampled inventor mention's name. As part of the review process, a column named \"add\" is appended to contain comma-separated lists of inventor mentions to add to each row. A column named \"remove\" is appended to contain comma-separated lists of inventor mentions to remove from each row.\n",
    "\n",
    "Note that inventor mentions take the standard form \"US<patent_number>-<sequence_number>\" where <patent_number> is the patent number of the inventor mention and <sequence> is the 0-indexed inventor sequence number.\n",
    "\n",
    "An example of a reviewed set of inventor mentions is shown below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>patent_id</th>\n",
       "      <th>inventor_id</th>\n",
       "      <th>name_first</th>\n",
       "      <th>name_last</th>\n",
       "      <th>sequence</th>\n",
       "      <th>add</th>\n",
       "      <th>remove</th>\n",
       "      <th>correct</th>\n",
       "      <th>notes</th>\n",
       "      <th>Unnamed: 9</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>6267035</td>\n",
       "      <td>fl:ca_ln:santizo-1</td>\n",
       "      <td>Carlos Gilberto</td>\n",
       "      <td>Santizo</td>\n",
       "      <td>3</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>yes</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4690644</td>\n",
       "      <td>fl:ma_ln:flanders-2</td>\n",
       "      <td>Marguerita E.</td>\n",
       "      <td>Flanders</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>yes</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>10120759</td>\n",
       "      <td>fl:ar_ln:gv-1</td>\n",
       "      <td>Aravind</td>\n",
       "      <td>Gv</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>yes</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>5290082</td>\n",
       "      <td>fl:th_ln:mealey-1</td>\n",
       "      <td>Thomas P.</td>\n",
       "      <td>Mealey</td>\n",
       "      <td>2</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>yes</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>RE46143</td>\n",
       "      <td>fl:p._ln:erman-1</td>\n",
       "      <td>P. Gregory</td>\n",
       "      <td>Erman</td>\n",
       "      <td>0</td>\n",
       "      <td>US8561841-0, US7475795-3, US9296603-1,US9738507-1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>10223669</td>\n",
       "      <td>fl:to_ln:geniesse-1</td>\n",
       "      <td>Tom</td>\n",
       "      <td>Geniesse</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>yes</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>6387460</td>\n",
       "      <td>fl:hi_ln:yoshizawa-11</td>\n",
       "      <td>Hideo</td>\n",
       "      <td>Yoshizawa</td>\n",
       "      <td>1</td>\n",
       "      <td>US6993267-5, US10895827-2</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>not sure about this one- seems like glass and ...</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>5928343</td>\n",
       "      <td>fl:ma_ln:horowitz-3</td>\n",
       "      <td>Mark</td>\n",
       "      <td>Horowitz</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>US7736282-0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>a few with abnormal assignees</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>11009520</td>\n",
       "      <td>fl:st_ln:bowers-3</td>\n",
       "      <td>Stewart V.</td>\n",
       "      <td>Bowers, III</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>yes</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>5467579</td>\n",
       "      <td>fl:si_ln:boriani-1</td>\n",
       "      <td>Silvano</td>\n",
       "      <td>Boriani</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>yes</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  patent_id            inventor_id       name_first    name_last  sequence  \\\n",
       "0   6267035     fl:ca_ln:santizo-1  Carlos Gilberto      Santizo         3   \n",
       "1   4690644    fl:ma_ln:flanders-2    Marguerita E.     Flanders         1   \n",
       "2  10120759          fl:ar_ln:gv-1          Aravind           Gv         0   \n",
       "3   5290082      fl:th_ln:mealey-1        Thomas P.       Mealey         2   \n",
       "4   RE46143       fl:p._ln:erman-1       P. Gregory        Erman         0   \n",
       "5  10223669    fl:to_ln:geniesse-1              Tom     Geniesse         0   \n",
       "6   6387460  fl:hi_ln:yoshizawa-11            Hideo    Yoshizawa         1   \n",
       "7   5928343    fl:ma_ln:horowitz-3             Mark     Horowitz         1   \n",
       "8  11009520      fl:st_ln:bowers-3       Stewart V.  Bowers, III         1   \n",
       "9   5467579     fl:si_ln:boriani-1          Silvano      Boriani         0   \n",
       "\n",
       "                                                 add       remove correct  \\\n",
       "0                                                NaN          NaN     yes   \n",
       "1                                                NaN          NaN     yes   \n",
       "2                                                NaN          NaN     yes   \n",
       "3                                                NaN          NaN     yes   \n",
       "4  US8561841-0, US7475795-3, US9296603-1,US9738507-1          NaN     NaN   \n",
       "5                                                NaN          NaN     yes   \n",
       "6                          US6993267-5, US10895827-2          NaN     NaN   \n",
       "7                                                NaN  US7736282-0     NaN   \n",
       "8                                                NaN          NaN     yes   \n",
       "9                                                NaN          NaN     yes   \n",
       "\n",
       "                                               notes Unnamed: 9  \n",
       "0                                                NaN        NaN  \n",
       "1                                                NaN        NaN  \n",
       "2                                                NaN        NaN  \n",
       "3                                                NaN        NaN  \n",
       "4                                                NaN        NaN  \n",
       "5                                                NaN        NaN  \n",
       "6  not sure about this one- seems like glass and ...        NaN  \n",
       "7                      a few with abnormal assignees        NaN  \n",
       "8                                                NaN        NaN  \n",
       "9                                                NaN        NaN  "
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "pd.read_excel(\"2022-07-25-Emma-patent-samples-part-2.xlsx\").head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Validation\n",
    "\n",
    "Using a reference set of inventor mentions together with the predicted clustering (i.e., the \"rawinventor.tsv\" file from [PatentsView's bulk data downloads](https://patentsview.org/download/data-download-tables)), we look for inventor mentions that do not exist in the data and for mentions listed to be removed but that are not part of the sampled mention's cluster. Additionally, we print out a sheet containing a comparison between the name of sampled inventors and the names of inventors **added** to predicted clusters. This way, obvious errors in the review process can be flagged and corrected.\n",
    "\n",
    "This validation process is done using the `process-inventors-hand-disambiguation.py` script provided by the **pv_evaluation** package as follows. First, we install **pv_evaluation** and download the rawinventor.tsv file. Next, we run `process-inventors-hand-disambiguation.py` in debug mode to produce an excel spreadsheet containing one page for review errors and one page for the comparison of sampled names with added inventor names."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Archive:  rawinventor.tsv.zip\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "\n",
    "pip install -q git+https://github.com/PatentsView/PatentsView-Evaluation.git@release\n",
    "wget -nc -q -nv https://s3.amazonaws.com/data.patentsview.org/download/rawinventor.tsv.zip -O rawinventor.tsv.zip\n",
    "unzip -n rawinventor.tsv.zip\n",
    "process-inventors-hand-disambiguation.py --debug 2022-07-25-Emma-patent-samples-part-2.xlsx rawinventor.tsv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The debugging pages are shown below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>patent_id</th>\n",
       "      <th>inventor_id</th>\n",
       "      <th>name_first</th>\n",
       "      <th>name_last</th>\n",
       "      <th>add</th>\n",
       "      <th>remove</th>\n",
       "      <th>remove_errors</th>\n",
       "      <th>add_errors</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>6267035</td>\n",
       "      <td>fl:ca_ln:santizo-1</td>\n",
       "      <td>Carlos Gilberto</td>\n",
       "      <td>Santizo</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4690644</td>\n",
       "      <td>fl:ma_ln:flanders-2</td>\n",
       "      <td>Marguerita E.</td>\n",
       "      <td>Flanders</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>10120759</td>\n",
       "      <td>fl:ar_ln:gv-1</td>\n",
       "      <td>Aravind</td>\n",
       "      <td>Gv</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>5290082</td>\n",
       "      <td>fl:th_ln:mealey-1</td>\n",
       "      <td>Thomas P.</td>\n",
       "      <td>Mealey</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>RE46143</td>\n",
       "      <td>fl:p._ln:erman-1</td>\n",
       "      <td>P. Gregory</td>\n",
       "      <td>Erman</td>\n",
       "      <td>US8561841-0, US7475795-3, US9296603-1,US9738507-1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>195</th>\n",
       "      <td>10455291</td>\n",
       "      <td>fl:jo_ln:bernstein-1</td>\n",
       "      <td>Joseph Harold</td>\n",
       "      <td>Bernstein</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>196</th>\n",
       "      <td>5057055</td>\n",
       "      <td>fl:mi_ln:presseau-1</td>\n",
       "      <td>Michel</td>\n",
       "      <td>Presseau</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>197</th>\n",
       "      <td>6810399</td>\n",
       "      <td>fl:an_ln:osborn-5</td>\n",
       "      <td>Andrew</td>\n",
       "      <td>Osborn</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>198</th>\n",
       "      <td>9730177</td>\n",
       "      <td>fl:st_ln:toth-4</td>\n",
       "      <td>Stefan Karl</td>\n",
       "      <td>Toth</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>199</th>\n",
       "      <td>4614424</td>\n",
       "      <td>fl:sh_ln:watanabe-201</td>\n",
       "      <td>Shunji</td>\n",
       "      <td>Watanabe</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>200 rows × 8 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "    patent_id            inventor_id       name_first  name_last  \\\n",
       "0     6267035     fl:ca_ln:santizo-1  Carlos Gilberto    Santizo   \n",
       "1     4690644    fl:ma_ln:flanders-2    Marguerita E.   Flanders   \n",
       "2    10120759          fl:ar_ln:gv-1          Aravind         Gv   \n",
       "3     5290082      fl:th_ln:mealey-1        Thomas P.     Mealey   \n",
       "4     RE46143       fl:p._ln:erman-1       P. Gregory      Erman   \n",
       "..        ...                    ...              ...        ...   \n",
       "195  10455291   fl:jo_ln:bernstein-1    Joseph Harold  Bernstein   \n",
       "196   5057055    fl:mi_ln:presseau-1           Michel   Presseau   \n",
       "197   6810399      fl:an_ln:osborn-5           Andrew     Osborn   \n",
       "198   9730177        fl:st_ln:toth-4      Stefan Karl       Toth   \n",
       "199   4614424  fl:sh_ln:watanabe-201           Shunji   Watanabe   \n",
       "\n",
       "                                                   add remove remove_errors  \\\n",
       "0                                                  NaN    NaN            []   \n",
       "1                                                  NaN    NaN            []   \n",
       "2                                                  NaN    NaN            []   \n",
       "3                                                  NaN    NaN            []   \n",
       "4    US8561841-0, US7475795-3, US9296603-1,US9738507-1    NaN            []   \n",
       "..                                                 ...    ...           ...   \n",
       "195                                                NaN    NaN            []   \n",
       "196                                                NaN    NaN            []   \n",
       "197                                                NaN    NaN            []   \n",
       "198                                                NaN    NaN            []   \n",
       "199                                                NaN    NaN            []   \n",
       "\n",
       "    add_errors  \n",
       "0           []  \n",
       "1           []  \n",
       "2           []  \n",
       "3           []  \n",
       "4           []  \n",
       "..         ...  \n",
       "195         []  \n",
       "196         []  \n",
       "197         []  \n",
       "198         []  \n",
       "199         []  \n",
       "\n",
       "[200 rows x 8 columns]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.read_excel(\"true_clusters.csv.debug.xlsx\", sheet_name=\"Cluster Errors\").drop(columns=[\"sequence\", \"Unnamed: 9\", \"correct\", \"notes\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>patent_id</th>\n",
       "      <th>name_first</th>\n",
       "      <th>name_last</th>\n",
       "      <th>added</th>\n",
       "      <th>name_first_added</th>\n",
       "      <th>name_last_added</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>RE46143</td>\n",
       "      <td>P. Gregory</td>\n",
       "      <td>Erman</td>\n",
       "      <td>US7475795-3</td>\n",
       "      <td>Gregory</td>\n",
       "      <td>Erman</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>RE46143</td>\n",
       "      <td>P. Gregory</td>\n",
       "      <td>Erman</td>\n",
       "      <td>US8561841-0</td>\n",
       "      <td>Gregory P.</td>\n",
       "      <td>Erman</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>RE46143</td>\n",
       "      <td>P. Gregory</td>\n",
       "      <td>Erman</td>\n",
       "      <td>US9296603-1</td>\n",
       "      <td>Paul Gregory</td>\n",
       "      <td>Erman</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>RE46143</td>\n",
       "      <td>P. Gregory</td>\n",
       "      <td>Erman</td>\n",
       "      <td>US9738507-1</td>\n",
       "      <td>Paul Gregory</td>\n",
       "      <td>Erman</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>6387460</td>\n",
       "      <td>Hideo</td>\n",
       "      <td>Yoshizawa</td>\n",
       "      <td>US10895827-2</td>\n",
       "      <td>Hideo</td>\n",
       "      <td>Yoshizawa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>227</th>\n",
       "      <td>5792879</td>\n",
       "      <td>Thomas</td>\n",
       "      <td>Gessner</td>\n",
       "      <td>US8963898-4</td>\n",
       "      <td>Thomas</td>\n",
       "      <td>Gessner</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>228</th>\n",
       "      <td>5792879</td>\n",
       "      <td>Thomas</td>\n",
       "      <td>Gessner</td>\n",
       "      <td>US9005705-2</td>\n",
       "      <td>Thomas</td>\n",
       "      <td>Gessner</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>229</th>\n",
       "      <td>5792879</td>\n",
       "      <td>Thomas</td>\n",
       "      <td>Gessner</td>\n",
       "      <td>US9291285-2</td>\n",
       "      <td>Thomas</td>\n",
       "      <td>Gessner</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>230</th>\n",
       "      <td>10474574</td>\n",
       "      <td>Seung-Beom</td>\n",
       "      <td>Lee</td>\n",
       "      <td>US10275371-1</td>\n",
       "      <td>Seungbeom</td>\n",
       "      <td>Lee</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>231</th>\n",
       "      <td>9979878</td>\n",
       "      <td>Sapna A</td>\n",
       "      <td>Shroff</td>\n",
       "      <td>US11042034-2</td>\n",
       "      <td>Sapna</td>\n",
       "      <td>Shroff</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>232 rows × 6 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "    patent_id  name_first  name_last         added name_first_added  \\\n",
       "0     RE46143  P. Gregory      Erman   US7475795-3          Gregory   \n",
       "1     RE46143  P. Gregory      Erman   US8561841-0       Gregory P.   \n",
       "2     RE46143  P. Gregory      Erman   US9296603-1     Paul Gregory   \n",
       "3     RE46143  P. Gregory      Erman   US9738507-1     Paul Gregory   \n",
       "4     6387460       Hideo  Yoshizawa  US10895827-2            Hideo   \n",
       "..        ...         ...        ...           ...              ...   \n",
       "227   5792879      Thomas    Gessner   US8963898-4           Thomas   \n",
       "228   5792879      Thomas    Gessner   US9005705-2           Thomas   \n",
       "229   5792879      Thomas    Gessner   US9291285-2           Thomas   \n",
       "230  10474574  Seung-Beom        Lee  US10275371-1        Seungbeom   \n",
       "231   9979878     Sapna A     Shroff  US11042034-2            Sapna   \n",
       "\n",
       "    name_last_added  \n",
       "0             Erman  \n",
       "1             Erman  \n",
       "2             Erman  \n",
       "3             Erman  \n",
       "4         Yoshizawa  \n",
       "..              ...  \n",
       "227         Gessner  \n",
       "228         Gessner  \n",
       "229         Gessner  \n",
       "230             Lee  \n",
       "231          Shroff  \n",
       "\n",
       "[232 rows x 6 columns]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.read_excel(\"true_clusters.csv.debug.xlsx\", sheet_name=\"Validation of Added Mentions\").drop(columns=[\"sequence\", \"inventor_id\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Transformation into Benchmark Dataset\n",
    "\n",
    "Once errors in the review process have been corrected, the working excel sheet can be transformed to a csv file containing the hand-disambiguation results in the standard format of a membership vector. This is done by running `process-inventors-hand-disambiguation.py` as follows. Note that the name of the output file can be changed using the \"--output\" argument."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "process-inventors-hand-disambiguation.py \"2022-07-25-Emma-patent-samples-part-2.xlsx\" \"rawinventor.tsv\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The result (saved by default to \"true_clusters.csv\") is shown below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>mention_id</th>\n",
       "      <th>inventor_id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>US7152514-3</td>\n",
       "      <td>fl:ca_ln:santizo-1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>US6564684-3</td>\n",
       "      <td>fl:ca_ln:santizo-1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>US7832315-3</td>\n",
       "      <td>fl:ca_ln:santizo-1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>US6267035-3</td>\n",
       "      <td>fl:ca_ln:santizo-1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>US6708592-3</td>\n",
       "      <td>fl:ca_ln:santizo-1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6733</th>\n",
       "      <td>US5679889-0</td>\n",
       "      <td>fl:sh_ln:watanabe-201</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6734</th>\n",
       "      <td>US7169506-0</td>\n",
       "      <td>fl:sh_ln:watanabe-201</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6735</th>\n",
       "      <td>US6459564-0</td>\n",
       "      <td>fl:sh_ln:watanabe-201</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6736</th>\n",
       "      <td>US7749649-0</td>\n",
       "      <td>fl:sh_ln:watanabe-201</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6737</th>\n",
       "      <td>US8553392-3</td>\n",
       "      <td>fl:sh_ln:watanabe-201</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>6738 rows × 2 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       mention_id            inventor_id\n",
       "0     US7152514-3     fl:ca_ln:santizo-1\n",
       "1     US6564684-3     fl:ca_ln:santizo-1\n",
       "2     US7832315-3     fl:ca_ln:santizo-1\n",
       "3     US6267035-3     fl:ca_ln:santizo-1\n",
       "4     US6708592-3     fl:ca_ln:santizo-1\n",
       "...           ...                    ...\n",
       "6733  US5679889-0  fl:sh_ln:watanabe-201\n",
       "6734  US7169506-0  fl:sh_ln:watanabe-201\n",
       "6735  US6459564-0  fl:sh_ln:watanabe-201\n",
       "6736  US7749649-0  fl:sh_ln:watanabe-201\n",
       "6737  US8553392-3  fl:sh_ln:watanabe-201\n",
       "\n",
       "[6738 rows x 2 columns]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.read_csv(\"true_clusters.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## More Information\n",
    "\n",
    "For more information, please refer to the help file of `process-inventors-hand-disambiguation.py`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "usage: process-inventors-hand-disambiguation.py [-h] [-o OUTPUT] [-d]\n",
      "                                                hand_disambiguation\n",
      "                                                rawinventor\n",
      "\n",
      "Process inventors hand-disambiguation files: validate data and produce\n",
      "benchmark dataset.\n",
      "\n",
      "positional arguments:\n",
      "  hand_disambiguation   Excel spreadsheet with sampled inventor mentions, the\n",
      "                        corresponding predicted cluster, and lists of inventor\n",
      "                        mentions to add to and remove from the predicted\n",
      "                        clusters. This spreadsheet should contain the columns\n",
      "                        'patent_id', 'sequence', 'inventor_id', 'add', and\n",
      "                        'remove'. The 'add' and 'remove' columns should\n",
      "                        contain comma-separated inventor mentions in the\n",
      "                        format US<patent_number>-<sequence_number>.\n",
      "  rawinventor           File with reference inventor mentions and predicted\n",
      "                        clusters. It should contain the columns 'patent_id',\n",
      "                        'sequence', and 'inventor_id'.\n",
      "\n",
      "optional arguments:\n",
      "  -h, --help            show this help message and exit\n",
      "  -o OUTPUT, --output OUTPUT\n",
      "                        CSV file where to save the resulting hand-\n",
      "                        disambiguated membership vector.\n",
      "  -d, --debug           Save debugging spreadsheet to\n",
      "                        '<hand_disambiguation>.csv.debug.xlsx'. This\n",
      "                        spreadsheet has two pages. The first shows inventor\n",
      "                        mentions to remove that were not found in the\n",
      "                        reference predicted clusters. The second shows the\n",
      "                        name of inventors added to predicted clusters.\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "\n",
    "process-inventors-hand-disambiguation.py --help"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.9.12 ('base': conda)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.12"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "7a2c4b191d1ae843dde5cb5f4d1f62fa892f6b79b0f9392a84691e890e33c5a4"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}