🎯 Performance Estimates for Lai’s 2011 Benchmark#

This notebook showcases the use of our precision and recall performance estimators in application to Lai’s 2011 benchmark dataset.

Note that Lai’s 2011 dataset only covers patents granted before 2010. As such, we can only estimate the performance of the current disambiguation algorithm for this time period.

The sampling process assumed for Lai’s 2011 benchmark is a uniform sample of inventors. This is because inventors from this benchmark were identified from a set of CVs (not from sampling individual patents, which would bias towards large clusters).

Data Preparation#

First we import required modules and recover the current disambiguation from rawinventor.tsv. The current disambiguation is filtered to only contain inventor mentions for granted patents between 1975 and 2010.

import pandas as pd
import numpy as np
import wget
import zipfile
import os

if not os.path.isfile("rawinventor.tsv"):
    wget.download("https://s3.amazonaws.com/data.patentsview.org/download/rawinventor.tsv.zip")
    with zipfile.ZipFile("rawinventor.tsv.zip", 'r') as zip_ref:
        zip_ref.extractall(".")
    os.remove("rawinventor.tsv.zip")

if not os.path.isfile("patent.tsv"):
    wget.download("https://s3.amazonaws.com/data.patentsview.org/download/patent.tsv.zip")
    with zipfile.ZipFile("patent.tsv.zip", 'r') as zip_ref:
        zip_ref.extractall(".")
    os.remove("patent.tsv.zip")

patent = pd.read_csv("patent.tsv", sep="\t", dtype=str, usecols=["id", "date"])
rawinventor = pd.read_csv("rawinventor.tsv", sep="\t", dtype=str, usecols=["patent_id", "sequence", "inventor_id"])

date = pd.DatetimeIndex(patent.date)
patent["date"] = date.year.astype(int)
joined = rawinventor.merge(patent, left_on="patent_id", right_on="id", how="left")

joined["mention_id"] = "US" + joined.patent_id + "-" + joined.sequence
joined = joined.query('date >= 1975 and date <= 2010')
current_disambiguation = joined.set_index("mention_id")["inventor_id"]

Precision and Recall Estimates#

We can now estimate precision and recall with uniform probability weights.

from er_evaluation.estimators import pairwise_precision_design_estimate, pairwise_recall_design_estimate
from pv_evaluation.benchmark import load_lai_2011_inventors_benchmark

Precision estimate and standard deviation:

pairwise_precision_design_estimate(current_disambiguation, load_lai_2011_inventors_benchmark(), weights=pd.Series(1, index=load_lai_2011_inventors_benchmark().unique()))

(0.9061700591403344, 0.02694415809739732)

Recall estimate and standard deviation:

pairwise_recall_design_estimate(current_disambiguation, load_lai_2011_inventors_benchmark(), weights=pd.Series(1, index=load_lai_2011_inventors_benchmark().unique()))

(0.9096034933749487, 0.05017639288406865)