pv_evaluation.benchmark#

Contents#

inspect_clusters_to_merge

Get table to inspect missing cluster links given a benchmark dataset.

inspect_clusters_to_split

Get table of cluster assignment errors on the given benchmark.

inventor_benchmark_plot

Bar plot of performance evaluation metrics on benchmark datasets.

inventor_estimates_plot

Plot performance estimates for given cluster samples.

inventor_estimates_trend_plot

Plot performance estimates over time.

inventor_summary_trend_plot

Plot key performance metrics over time.

load_israeli_inventors_benchmark

Loads the Israeli inventors benchmark dataset.

load_patentsview_inventors_benchmark

Loads the PatentsView hand-disambiguated inventors benchmark dataset.

load_lai_2011_inventors_benchmark

Loads Lai's 2011 Inventors Benchmark dataset.

load_als_inventors_benchmark

Loads the Academic Life Sciences inventors benchmark dataset.

load_ens_inventors_benchmark

Engineering and Sciences inventors benchmark.

load_binette_2022_inventors_benchmark

Loads the Binette's 2022 inventors benchmark dataset.

load_air_umass_assignees_benchmark

AIR-UMASS assigness benchmark.

load_nber_subset_assignees_benchmark

AIR-UMASS assigness benchmark.

style_cluster_inspection

Style table to highlight groups with alternating colors.

top_inventors

Table of most prolific inventors

plot_entropy_curves

Plot entropy curves for a set of disambiguations

plot_cluster_sizes

Plot cluster sizes for a set of disambiguations

plot_name_variation_rates

Plot name variation rates for a set of given disambiguations

plot_homonimy_rates

Plot homonimy rates for a set of given disambiguations

Documentation#

Evaluation datasets and standardized benchmarks

pv_evaluation.benchmark.inspect_clusters_to_merge(disambiguation, benchmark, join_with=None, links=False)[source]#

Get table to inspect missing cluster links given a benchmark dataset.

Parameters:
  • disambiguation (Series) – disambiguation result Series (disambiguation results are pandas Series with “mention_id” index and cluster assignment values).

  • benchmark (Series) – reference disambiguation Series.

  • join_with (DataFrame, optional) – DataFrame indexed by “mention_id”. Defaults to None.

Returns:

DataFrame containing missing cluster links according to the given benchmark.

Return type:

DataFrame

pv_evaluation.benchmark.inspect_clusters_to_split(disambiguation, benchmark, join_with=None, links=False)[source]#

Get table of cluster assignment errors on the given benchmark.

Parameters:
  • disambiguation (Series) – disambiguation result Series (disambiguation results are pandas Series with “mention_id” index and cluster assignment values).

  • benchmark (Series) – reference disambiguation Series.

  • join_with (DataFrame, optional) – DataFrame indexed by “mention_id”. Defaults to None.

Returns:

DataFrame containing erroneous cluster assignments according to the given benchmark.

Return type:

DataFrame

pv_evaluation.benchmark.inventor_benchmark_plot(predictions, references=None, metrics=None, facet_col_wrap=2, **kwargs)[source]#

Bar plot of performance evaluation metrics on benchmark datasets.

Parameters:
  • disambiguations (dict) – dictionary of disambiguation results (disambiguation results are pandas Series with “mention_id” index and cluster assignment values).

  • metrics (dict, optional) – dictionary of metrics (from the metrics submodule) to compute. Defaults to DEFAULT_METRICS.

  • benchmarks (dict, optional) – benchmark datasets loading functions (from the benchmark submodule) to use. Defaults to DEFAULT_BENCHMARK.

Returns:

plotly graph object

pv_evaluation.benchmark.inventor_estimates_plot(disambiguations, samples_weights=None, estimators=None, facet_col_wrap=2, **kwargs)[source]#

Plot performance estimates for given cluster samples.

Note

The timeframe for the disambiguation should match the timeframe considered by the reference sample.

Parameters:
  • disambiguations (dict) – dictionary of disambiguation results (disambiguation results are pandas Series with “mention_id” index and cluster assignment values). Note that the disambiguated population should match the population from which samples have been drawn. For instance, if using the Israeli benchmark dataset which covers granted patents between granted between 1963 and 1999, then disambiguations should be subsetted to the same time period.

  • samples (dict) – Dictionary of tuples (A, B), where A is a function to load a dataset and B is a dictionary of parameters to pass to estimator functions. See INVENTORS_SAMPLES for an example.

  • estimators (dict, optional) – Dictionary of tuples (A, B) where A is a point estimator and B is a standard deviation estimator. Defaults to DEFAULT_ESTIMATORS.

Returns:

Plotly bar chart

pv_evaluation.benchmark.inventor_estimates_trend_plot(persistent_inventor, samples_weights=None, estimators=None, **kwargs)[source]#

Plot performance estimates over time.

Note

The timeframe for the disambiguation should match the timeframe considered by the reference sample.

Parameters:
  • persisten_inventor (DataFrame) – String-valued DataFrame in the format of PatentsView’s bulk data download file “g_persistent_inventor.tsv”. This should contain the columns “patent_id”, “sequence”, as well as columns with names of the form “disamb_inventor_id_YYYYMMDD” for inventor IDs corresponding to the given disambiguation date.

  • samples (dict) – Dictionary of tuples (A, B), where A is a function to load a dataset and B is a dictionary of parameters to pass to estimator functions. See INVENTORS_SAMPLES for an example.

  • estimators (dict, optional) – Dictionary of tuples (A, B) where A is a point estimator and B is a standard deviation estimator. Defaults to DEFAULT_ESTIMATORS.

Returns:

Plotly scatter plot

pv_evaluation.benchmark.inventor_summary_trend_plot(persistent_inventor, names)[source]#

Plot key performance metrics over time.

Parameters:
  • persisten_inventor (DataFrame) – String-valued DataFrame in the format of PatentsView’s bulk data download file “g_persistent_inventor.tsv”. This should contain the columns “patent_id”, “sequence”, as well as columns with names of the form “disamb_inventor_id_YYYYMMDD” for inventor IDs corresponding to the given disambiguation date.

  • names (Series) – pandas Series indexed by mention IDs and with values corresponding to mentioned inventor name.

Returns:

Plotly scatter plot of the matching rate, homonymy rate, and name variation rate.

pv_evaluation.benchmark.load_air_umass_assignees_benchmark()[source]#

AIR-UMASS assigness benchmark.

The dataset is described as follows in the paper referenced below:

‘The PatentsView team created a hand-labeled set of disambiguated assignee records. The data were created by sampling records of each assignee type (universities, federal government entities, private companies, states, and local government agencies). We used those records as queries for annotators to find all other records referring to the same assignee. Team members annotated the labeled records according to string similarity. In cases where an identity could not be confirmed or was uncertain, annotators did not create a link. We intended this dataset to have a larger coverage of name varieties of the entities than the NBER dataset, which was important for us to evaluate the more difficult-to-disambiguate cases. Annotators attempted to label parent companies separately from subsidiaries, but the process was more likely to associate similarly named child and parent companies than more distinctive ones.’

The dataset is provided in the form of a pandas Series, where the index represents the mention ID and the

Returns:

pandas Series with the benchmark data as a membership vector.

Return type:

Series

References

pv_evaluation.benchmark.load_als_inventors_benchmark()[source]#

Loads the Academic Life Sciences inventors benchmark dataset.

This dataset contains a set of disambiguated inventor mentions derived from Pierre Azoulay’s Academic Life Sciences dataset, which covers US patents granted between 1970 and 2005.

Note that inventor sequence numbers were obtained using a computer matching procedure which may have introduced errors. Rows with unresolved sequence numbers were removed.

The dataset is provided in the form of a pandas Series, where the index represents the mention ID and the value represents the cluster assignment.

See:

Azoulay, P., J. S. Graff Zivin, and G. Manso (2011). Incentives and creativity: evidence from the academic life sciences. The RAND Journal of Economics 42(3), 527-554.

Returns:

pandas Series with the benchmark data as a membership vector.

Return type:

Series

pv_evaluation.benchmark.load_binette_2022_inventors_benchmark()[source]#

Loads the Binette’s 2022 inventors benchmark dataset.

The 2022 Binette inventors benchmark is a hand-disambiguated dataset of inventor mentions on granted patents for a sample of inventors from PatentsView.org. The inventors were selected indirectly by sampling inventor mentions uniformly at random, resulting in inventors sampled with probability proportional to their number of granted patents.

The time period considered is from 1976 to December 31, 2021. This correspond to the disambiguation labeled “disamb_inventor_id_20211230” in PatentsView’s bulk data downloads [“g_persistent_inventor.tsv” file](https://patentsview.org/download/data-download-tables)

The dataset is provided in the form of a pandas Series, where the index represents the mention ID and the

Returns:

pandas Series with the benchmark data as a membership vector.

Return type:

Series

References

  • [Binette, Olivier, Sokhna A York, Emma Hickerson, Youngsoo Baek, Sarvo Madhavan, Christina Jones. (2022). Estimating the Performance of Entity Resolution Algorithms: Lessons Learned Through PatentsView.org. arXiv e-prints: arxiv:2210.01230](https://arxiv.org/abs/2210.01230)

Notes

  • The methodology used for the hand-disambiguation is described in the reference.

  • The hand-disambiguation process was done by experts, but it should be expected to contain errors due to the ambiguous nature of inventor disambiguation.

  • The benchmark contains a few extraneous mentions of patents granted outside the considered time period, these should be ignored for evaluation purposes.

  • Given the use of the December 30, 2021, disambiguation from PatentsView as a starting point of the hand-labeling, a bias towards this disambiguation should be expected.

pv_evaluation.benchmark.load_ens_inventors_benchmark()[source]#

Engineering and Sciences inventors benchmark.

This is a set of disambiguated inventor mentions derived from Png’s LinkedIn-Patent Inventors Dataset for the 2015 PatentsView Disambiguation Workshop.

The dataset is provided in the form of a pandas Series, where the index represents the mention ID and the value represents the cluster assignment.

See:

Ge, Chunmian, Ke-wei Huang, and Ivan P.L. Png, “Engineer/Scientist Careers: Patents, Online Profiles, and Misclassification”, Strategic Management Journal, Vol 37 No 1, January 2016, 232-253.

Returns:

pandas Series with the benchmark data as a membership vector.

Return type:

Series

pv_evaluation.benchmark.load_israeli_inventors_benchmark()[source]#

Loads the Israeli inventors benchmark dataset.

This benchmark dataset is adapted from Trajenberg and Shiff (2008), which evaluated the U.S. patents granted between 1963 and 1999 for Israeli inventors.

The dataset is provided in the form of a pandas Series, where the index represents the mention ID and the value represents the cluster assignment.

See:

Trajtenberg, M., & Shiff, G. (2008). Identification and mobility of Israeli patenting inventors. Pinhas Sapir Center for Development.

Returns:

pandas Series with the benchmark data as a membership vector.

Return type:

Series

pv_evaluation.benchmark.load_lai_2011_inventors_benchmark()[source]#

Loads Lai’s 2011 Inventors Benchmark dataset.

This benchmark dataset is adapted from the dataset reported in Li et al. (2014), which was used to evaluate the disambiguation of the U.S. Patent Inventor Database (1975-2010).

The dataset is provided in the form of a pandas Series, where the index represents the mention ID and the value represents the cluster assignment.

See:

Li, G. C., Lai, R., D’Amour, A., Doolin, D. M., Sun, Y., Torvik, V. I., … & Fleming, L. (2014). Disambiguation and co-authorship networks of the US patent inventor database (1975-2010). Research Policy, 43(6), 941-955.

Returns:

pandas Series with the benchmark data as a membership vector.

Return type:

Series

Notes

  • A number of patent IDs which could not be found were removed from Lai’s original dataset.

  • Inventor sequence numbers were assigned through automatic matching and manual review. There could be some errors.

pv_evaluation.benchmark.load_nber_subset_assignees_benchmark()[source]#

AIR-UMASS assigness benchmark.

The dataset is described as follows in the paper referenced below:

‘The National Bureau of Economic Research provides disambiguated assignee data. These data are created semiautomatically with manual correction and labeling of assignee coreference decisions produced by string similarity. We grouped the assignee mentions by four-letter prefixes and focused on five prefix groups {Moto, Amer, Gene, Solu, Airc} that were both common and ambiguous.’

The dataset is provided in the form of a pandas Series, where the index represents the mention ID and the

Returns:

pandas Series with the benchmark data as a membership vector.

Return type:

Series

References

pv_evaluation.benchmark.load_patentsview_inventors_benchmark()[source]#

Loads the PatentsView hand-disambiguated inventors benchmark dataset.

This dataset contains the hand-disambiguation of a set of particularly ambiguous inventor names. The disambiguation process was done manually by experts, to be used as a benchmark for evaluating disambiguation algorithms.

The dataset is provided in the form of a pandas Series, where the index represents the mention ID and the value represents the cluster assignment.

See:

Monath, N., Jones, C., & Madhavan, S. Disambiguating Patent Inventors, Assignees, and their Locations in PatentsView. https://s3.amazonaws.com/data.patentsview.org/documents/PatentsView_Disambiguation_Methods_Documentation.pdf

Returns:

pandas Series with the benchmark data as a membership vector.

Return type:

Series

pv_evaluation.benchmark.plot_cluster_sizes(disambiguations)[source]#

Plot cluster sizes for a set of disambiguations

Parameters:

disambiguations (Dict) – Dictionary of membership vectors representing given disambiguations.

Returns:

Plotly figure.

pv_evaluation.benchmark.plot_entropy_curves(disambiguations)[source]#

Plot entropy curves for a set of disambiguations

Parameters:

disambiguations (Dict) – Dictionary of membership vectors representing given disambiguations

Returns:

Plotly figure.

pv_evaluation.benchmark.plot_homonimy_rates(disambiguations, names)[source]#

Plot homonimy rates for a set of given disambiguations

Parameters:
  • disambiguations (Dict) – Dictionary of membership vectors representing given disambiguations.

  • names (Series) – Pandas Series indexed by mention IDs and with values corresponding to inventor name.

Returns:

Plotly figure.

pv_evaluation.benchmark.plot_name_variation_rates(disambiguations, names)[source]#

Plot name variation rates for a set of given disambiguations

Parameters:
  • disambiguations (Dict) – Dictionary of membership vectors representing given disambiguations.

  • names (Series) – Pandas Series indexed by mention IDs and with values corresponding to inventor name.

Returns:

Plotly figure.

pv_evaluation.benchmark.style_cluster_inspection(table, by='prediction')[source]#

Style table to highlight groups with alternating colors.

Parameters:
  • table (dataframe) – DataFrame to style.

  • by (str, optional) – column to color by. Defaults to “prediction”.

pv_evaluation.benchmark.top_inventors(disambiguation, names, n=10)[source]#

Table of most prolific inventors

Parameters:
  • disambiguation (Series) – Membership vector, indexed by mention IDs, representing a given disambiguation.

  • names (Series) – Pandas Series indexed by mention IDs and with values corresponding to inventor name.

  • n (int, optional) – Number of rows to display. Defaults to 10.

Returns:

Table with top n most prolific inventors.

Return type:

DataFrame