🎯 Performance Estimates for Binette’s 2022 Benchmark#

This notebook showcases the use of our precision and recall performance estimators in application to Binette’s 2022 benchmark dataset.

Note that Binette’s 2022 dataset only covers patents granted before 2022. As such, we can only estimate the performance of the current disambiguation algorithm for this time period.

The sampling process assumed for Binette’s 2022 benchmark is with probability proportional to cluster size. This is because inventors from this benchmark were identified from a sampling inventor mentions uniformly at random.

Data Preparation#

First we import required modules and recover the current disambiguation from rawinventor.tsv. The current disambiguation is filtered to only contain inventor mentions for granted patents between 1975 and 2022.

import pandas as pd
import numpy as np
import wget
import zipfile
import os

if not os.path.isfile("rawinventor.tsv"):
    wget.download("https://s3.amazonaws.com/data.patentsview.org/download/rawinventor.tsv.zip")
    with zipfile.ZipFile("rawinventor.tsv.zip", 'r') as zip_ref:
        zip_ref.extractall(".")
    os.remove("rawinventor.tsv.zip")

if not os.path.isfile("patent.tsv"):
    wget.download("https://s3.amazonaws.com/data.patentsview.org/download/patent.tsv.zip")
    with zipfile.ZipFile("patent.tsv.zip", 'r') as zip_ref:
        zip_ref.extractall(".")
    os.remove("patent.tsv.zip")

patent = pd.read_csv("patent.tsv", sep="\t", dtype=str, usecols=["id", "date"])
rawinventor = pd.read_csv("rawinventor.tsv", sep="\t", dtype=str, usecols=["patent_id", "sequence", "inventor_id"])

date = pd.DatetimeIndex(patent.date)
patent["date"] = date.year.astype(int)
joined = rawinventor.merge(patent, left_on="patent_id", right_on="id", how="left")

joined["mention_id"] = "US" + joined.patent_id + "-" + joined.sequence
joined = joined.query('date >= 1975 and date <= 2022')
current_disambiguation = joined.set_index("mention_id")["inventor_id"]

Precision and Recall Estimates#

We can now estimate precision and recall with uniform probability weights.

from er_evaluation.estimators import pairwise_precision_design_estimate, pairwise_recall_design_estimate
from er_evaluation.summary import cluster_sizes
from pv_evaluation.benchmark import load_binette_2022_inventors_benchmark

Precision estimate and standard deviation:

pairwise_precision_design_estimate(current_disambiguation, load_binette_2022_inventors_benchmark(), weights=1/cluster_sizes(load_binette_2022_inventors_benchmark()))

(0.9138044762074496, 0.018549986866583854)

Recall estimate and standard deviation:

pairwise_recall_design_estimate(current_disambiguation, load_binette_2022_inventors_benchmark(), weights=1/cluster_sizes(load_binette_2022_inventors_benchmark()))

(0.9637111046011154, 0.008180601394371729)