{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 🎯 Performance Estimates for Binette's 2022 Benchmark"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook showcases the use of our precision and recall performance estimators in application to Binette's 2022 benchmark dataset.\n",
    "\n",
    "Note that Binette's 2022 dataset only covers patents granted before 2022. As such, we can only estimate the performance of the current disambiguation algorithm for this time period.\n",
    "\n",
    "The sampling process assumed for Binette's 2022 benchmark is with probability proportional to cluster size. This is because inventors from this benchmark were identified from a sampling inventor mentions uniformly at random."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Preparation\n",
    "\n",
    "First we import required modules and recover the current disambiguation from `rawinventor.tsv`. The current disambiguation is filtered to only contain inventor mentions for granted patents between 1975 and 2022."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import wget\n",
    "import zipfile\n",
    "import os\n",
    "\n",
    "if not os.path.isfile(\"rawinventor.tsv\"):\n",
    "    wget.download(\"https://s3.amazonaws.com/data.patentsview.org/download/rawinventor.tsv.zip\")\n",
    "    with zipfile.ZipFile(\"rawinventor.tsv.zip\", 'r') as zip_ref:\n",
    "        zip_ref.extractall(\".\")\n",
    "    os.remove(\"rawinventor.tsv.zip\")\n",
    "\n",
    "if not os.path.isfile(\"patent.tsv\"):\n",
    "    wget.download(\"https://s3.amazonaws.com/data.patentsview.org/download/patent.tsv.zip\")\n",
    "    with zipfile.ZipFile(\"patent.tsv.zip\", 'r') as zip_ref:\n",
    "        zip_ref.extractall(\".\")\n",
    "    os.remove(\"patent.tsv.zip\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "patent = pd.read_csv(\"patent.tsv\", sep=\"\\t\", dtype=str, usecols=[\"id\", \"date\"])\n",
    "rawinventor = pd.read_csv(\"rawinventor.tsv\", sep=\"\\t\", dtype=str, usecols=[\"patent_id\", \"sequence\", \"inventor_id\"])\n",
    "\n",
    "date = pd.DatetimeIndex(patent.date)\n",
    "patent[\"date\"] = date.year.astype(int)\n",
    "joined = rawinventor.merge(patent, left_on=\"patent_id\", right_on=\"id\", how=\"left\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "joined[\"mention_id\"] = \"US\" + joined.patent_id + \"-\" + joined.sequence\n",
    "joined = joined.query('date >= 1975 and date <= 2022')\n",
    "current_disambiguation = joined.set_index(\"mention_id\")[\"inventor_id\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Precision and Recall Estimates\n",
    "\n",
    "We can now estimate precision and recall with uniform probability weights."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "from er_evaluation.estimators import pairwise_precision_design_estimate, pairwise_recall_design_estimate\n",
    "from er_evaluation.summary import cluster_sizes\n",
    "from pv_evaluation.benchmark import load_binette_2022_inventors_benchmark"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Precision estimate and standard deviation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(0.9138044762074496, 0.018549986866583854)"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pairwise_precision_design_estimate(current_disambiguation, load_binette_2022_inventors_benchmark(), weights=1/cluster_sizes(load_binette_2022_inventors_benchmark()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Recall estimate and standard deviation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(0.9637111046011154, 0.008180601394371729)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pairwise_recall_design_estimate(current_disambiguation, load_binette_2022_inventors_benchmark(), weights=1/cluster_sizes(load_binette_2022_inventors_benchmark()))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.7.15 ('pv-evaluation': conda)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.15"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "135eb778a123b23717215bebe642ebc480e0ab0e1bc583cf4971f84281f0b229"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}