{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Key Concepts\n",
    "\n",
    "This page highlights key concepts and terminology used throughout the package.\n",
    "\n",
    "### Patent Number\n",
    "\n",
    "Patent numbers are assigned by the USPTO following the format described `here <https://www.uspto.gov/patents/apply/applying-online/patent-number>`_. Patent numbers consist of two parts: a prefix, which identifies the type of patent, and a number, which is unique within the prefix. Utility patents numbers consist of six, seven or eight digits. The patent number is entered excluding commas and spaces and omitting leading zeroes.\n",
    "\n",
    "Note that other countries have different systems for assigning patent numbers. As such, we typically prefix the patent number by the code \"US\" to identify United States patents. This is done in the representation of mention IDs below.\n",
    "\n",
    "### Mention ID\n",
    "\n",
    "An inventor's mention ID is a reference to a specific author on a specific patent. It takes the format ``US<patent_number>-<sequence_number>``, such as US12345-0, where ``patent_number`` is the patent number and where ``sequence_number`` is the authorship number (0 for the first author, 1 for the second author, etc).\n",
    "\n",
    "Each patent application is assigned a unique patent number by USPTO, and each inventor or author is assigned a unique sequence number on that patent. These two pieces of information, the patent number and the sequence number, are combined to create a unique mention ID for each inventor on each patent. The mention ID provides a way to refer to a specific inventor on a specific patent in the disambiguation process.\n",
    "\n",
    "Note that mention ID is specific to the USPTO and its format, other countries may have different format for patent number and sequence number and different format for mention ID.\n",
    "\n",
    "The mention IDs are used as input to the disambiguation algorithm, which attempts to group together multiple mention IDs that correspond to the same inventor.\n",
    "\n",
    "### Clusters\n",
    "\n",
    "An inventor cluster is a set of mention IDs thought to refer to the same person. There are *predicted* clusters which are provided by disambiguation algorithms, and there are *true* clusters which are *ground-truth* sets of mentions for inventors.\n",
    "\n",
    "An inventor cluster is a set of mention IDs thought to refer to the same person. There are two types of clusters used in evaluating disambiguation algorithms:\n",
    "\n",
    "- **Predicted clusters** are generated by the algorithm being evaluated. These clusters group together mention IDs that the algorithm has determined to refer to the same inventor.\n",
    "- **True clusters** are the ground-truth sets of mention IDs that correspond to the same inventor. These clusters are usually derived from manual annotation or other expert knowledge.\n",
    "\n",
    "The goal of the evaluation process is to compare the predicted clusters to the true clusters, in order to assess the accuracy and performance of the disambiguation algorithm.\n",
    "\n",
    "### Membership Vector\n",
    "\n",
    "A clustering is typically represented as a *membership vector*. This is a map between mention IDs and the clusters to which they are associated. \n",
    "\n",
    "In this package, membership vectors are represented using pandas Series with mention IDs as the index and cluster IDs as the values. All clusterings and disambiguation results follow this format.\n",
    "\n",
    "Below is an example of a membership vector for a subset of inventor mention IDs. The values appearing in the right column (cluster IDs) are arbitrary; the only convention is that mention IDs corresponding to the same inventor (belonging to the same cluster) should have the same cluster ID."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "mention_id\n",
       "US3858246-0    11797\n",
       "US3858578-0    11797\n",
       "US3858674-0    16606\n",
       "US3859165-0    13384\n",
       "US3859616-0     9865\n",
       "               ...  \n",
       "US6009346-0    12734\n",
       "US6009390-1     7694\n",
       "US6009409-2    11416\n",
       "US6009543-0    19168\n",
       "US6009552-0      650\n",
       "Name: unique_id, Length: 9156, dtype: int64"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from pv_evaluation.benchmark import load_israeli_inventors_benchmark\n",
    "\n",
    "load_israeli_inventors_benchmark()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0]"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "916dbcbb3f70747c44a77c7bcd40155683ae19c65e1c03b4aa3499c5328201f1"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}