pv_evaluation.benchmark#
Contents#
Get table to inspect missing cluster links given a benchmark dataset. |
|
Get table of cluster assignment errors on the given benchmark. |
|
Bar plot of performance evaluation metrics on benchmark datasets. |
|
Plot performance estimates for given cluster samples. |
|
Plot performance estimates over time. |
|
Plot key performance metrics over time. |
|
Loads the Israeli inventors benchmark dataset. |
|
Loads the PatentsView hand-disambiguated inventors benchmark dataset. |
|
Loads Lai's 2011 Inventors Benchmark dataset. |
|
Loads the Academic Life Sciences inventors benchmark dataset. |
|
Engineering and Sciences inventors benchmark. |
|
Loads the Binette's 2022 inventors benchmark dataset. |
|
AIR-UMASS assigness benchmark. |
|
AIR-UMASS assigness benchmark. |
|
Style table to highlight groups with alternating colors. |
|
Table of most prolific inventors |
|
Plot entropy curves for a set of disambiguations |
|
Plot cluster sizes for a set of disambiguations |
|
Plot name variation rates for a set of given disambiguations |
|
Plot homonimy rates for a set of given disambiguations |
Documentation#
Evaluation datasets and standardized benchmarks
- pv_evaluation.benchmark.inspect_clusters_to_merge(disambiguation, benchmark, join_with=None, links=False)[source]#
Get table to inspect missing cluster links given a benchmark dataset.
- Parameters:
disambiguation (Series) – disambiguation result Series (disambiguation results are pandas Series with “mention_id” index and cluster assignment values).
benchmark (Series) – reference disambiguation Series.
join_with (DataFrame, optional) – DataFrame indexed by “mention_id”. Defaults to None.
- Returns:
DataFrame containing missing cluster links according to the given benchmark.
- Return type:
DataFrame
- pv_evaluation.benchmark.inspect_clusters_to_split(disambiguation, benchmark, join_with=None, links=False)[source]#
Get table of cluster assignment errors on the given benchmark.
- Parameters:
disambiguation (Series) – disambiguation result Series (disambiguation results are pandas Series with “mention_id” index and cluster assignment values).
benchmark (Series) – reference disambiguation Series.
join_with (DataFrame, optional) – DataFrame indexed by “mention_id”. Defaults to None.
- Returns:
DataFrame containing erroneous cluster assignments according to the given benchmark.
- Return type:
DataFrame
- pv_evaluation.benchmark.inventor_benchmark_plot(predictions, references=None, metrics=None, facet_col_wrap=2, **kwargs)[source]#
Bar plot of performance evaluation metrics on benchmark datasets.
- Parameters:
disambiguations (dict) – dictionary of disambiguation results (disambiguation results are pandas Series with “mention_id” index and cluster assignment values).
metrics (dict, optional) – dictionary of metrics (from the metrics submodule) to compute. Defaults to DEFAULT_METRICS.
benchmarks (dict, optional) – benchmark datasets loading functions (from the benchmark submodule) to use. Defaults to DEFAULT_BENCHMARK.
- Returns:
plotly graph object
- pv_evaluation.benchmark.inventor_estimates_plot(disambiguations, samples_weights=None, estimators=None, facet_col_wrap=2, **kwargs)[source]#
Plot performance estimates for given cluster samples.
Note
The timeframe for the disambiguation should match the timeframe considered by the reference sample.
- Parameters:
disambiguations (dict) – dictionary of disambiguation results (disambiguation results are pandas Series with “mention_id” index and cluster assignment values). Note that the disambiguated population should match the population from which samples have been drawn. For instance, if using the Israeli benchmark dataset which covers granted patents between granted between 1963 and 1999, then disambiguations should be subsetted to the same time period.
samples (dict) – Dictionary of tuples (A, B), where A is a function to load a dataset and B is a dictionary of parameters to pass to estimator functions. See INVENTORS_SAMPLES for an example.
estimators (dict, optional) – Dictionary of tuples (A, B) where A is a point estimator and B is a standard deviation estimator. Defaults to DEFAULT_ESTIMATORS.
- Returns:
Plotly bar chart
- pv_evaluation.benchmark.inventor_estimates_trend_plot(persistent_inventor, samples_weights=None, estimators=None, **kwargs)[source]#
Plot performance estimates over time.
Note
The timeframe for the disambiguation should match the timeframe considered by the reference sample.
- Parameters:
persisten_inventor (DataFrame) – String-valued DataFrame in the format of PatentsView’s bulk data download file “g_persistent_inventor.tsv”. This should contain the columns “patent_id”, “sequence”, as well as columns with names of the form “disamb_inventor_id_YYYYMMDD” for inventor IDs corresponding to the given disambiguation date.
samples (dict) – Dictionary of tuples (A, B), where A is a function to load a dataset and B is a dictionary of parameters to pass to estimator functions. See INVENTORS_SAMPLES for an example.
estimators (dict, optional) – Dictionary of tuples (A, B) where A is a point estimator and B is a standard deviation estimator. Defaults to DEFAULT_ESTIMATORS.
- Returns:
Plotly scatter plot
- pv_evaluation.benchmark.inventor_summary_trend_plot(persistent_inventor, names)[source]#
Plot key performance metrics over time.
- Parameters:
persisten_inventor (DataFrame) – String-valued DataFrame in the format of PatentsView’s bulk data download file “g_persistent_inventor.tsv”. This should contain the columns “patent_id”, “sequence”, as well as columns with names of the form “disamb_inventor_id_YYYYMMDD” for inventor IDs corresponding to the given disambiguation date.
names (Series) – pandas Series indexed by mention IDs and with values corresponding to mentioned inventor name.
- Returns:
Plotly scatter plot of the matching rate, homonymy rate, and name variation rate.
- pv_evaluation.benchmark.load_air_umass_assignees_benchmark()[source]#
AIR-UMASS assigness benchmark.
The dataset is described as follows in the paper referenced below:
‘The PatentsView team created a hand-labeled set of disambiguated assignee records. The data were created by sampling records of each assignee type (universities, federal government entities, private companies, states, and local government agencies). We used those records as queries for annotators to find all other records referring to the same assignee. Team members annotated the labeled records according to string similarity. In cases where an identity could not be confirmed or was uncertain, annotators did not create a link. We intended this dataset to have a larger coverage of name varieties of the entities than the NBER dataset, which was important for us to evaluate the more difficult-to-disambiguate cases. Annotators attempted to label parent companies separately from subsidiaries, but the process was more likely to associate similarly named child and parent companies than more distinctive ones.’
The dataset is provided in the form of a pandas Series, where the index represents the mention ID and the
- Returns:
pandas Series with the benchmark data as a membership vector.
- Return type:
Series
References
Monath, N., Jones, C., & Madhavan, S. Disambiguating Patent Inventors, Assignees, and their Locations in PatentsView. https://s3.amazonaws.com/data.patentsview.org/documents/PatentsView_Disambiguation_Methods_Documentation.pdf
- pv_evaluation.benchmark.load_als_inventors_benchmark()[source]#
Loads the Academic Life Sciences inventors benchmark dataset.
This dataset contains a set of disambiguated inventor mentions derived from Pierre Azoulay’s Academic Life Sciences dataset, which covers US patents granted between 1970 and 2005.
Note that inventor sequence numbers were obtained using a computer matching procedure which may have introduced errors. Rows with unresolved sequence numbers were removed.
The dataset is provided in the form of a pandas Series, where the index represents the mention ID and the value represents the cluster assignment.
- See:
Azoulay, P., J. S. Graff Zivin, and G. Manso (2011). Incentives and creativity: evidence from the academic life sciences. The RAND Journal of Economics 42(3), 527-554.
- Returns:
pandas Series with the benchmark data as a membership vector.
- Return type:
Series
- pv_evaluation.benchmark.load_binette_2022_inventors_benchmark()[source]#
Loads the Binette’s 2022 inventors benchmark dataset.
The 2022 Binette inventors benchmark is a hand-disambiguated dataset of inventor mentions on granted patents for a sample of inventors from PatentsView.org. The inventors were selected indirectly by sampling inventor mentions uniformly at random, resulting in inventors sampled with probability proportional to their number of granted patents.
The time period considered is from 1976 to December 31, 2021. This correspond to the disambiguation labeled “disamb_inventor_id_20211230” in PatentsView’s bulk data downloads [“g_persistent_inventor.tsv” file](https://patentsview.org/download/data-download-tables)
The dataset is provided in the form of a pandas Series, where the index represents the mention ID and the
- Returns:
pandas Series with the benchmark data as a membership vector.
- Return type:
Series
References
[Binette, Olivier, Sokhna A York, Emma Hickerson, Youngsoo Baek, Sarvo Madhavan, Christina Jones. (2022). Estimating the Performance of Entity Resolution Algorithms: Lessons Learned Through PatentsView.org. arXiv e-prints: arxiv:2210.01230](https://arxiv.org/abs/2210.01230)
Notes
The methodology used for the hand-disambiguation is described in the reference.
The hand-disambiguation process was done by experts, but it should be expected to contain errors due to the ambiguous nature of inventor disambiguation.
The benchmark contains a few extraneous mentions of patents granted outside the considered time period, these should be ignored for evaluation purposes.
Given the use of the December 30, 2021, disambiguation from PatentsView as a starting point of the hand-labeling, a bias towards this disambiguation should be expected.
- pv_evaluation.benchmark.load_ens_inventors_benchmark()[source]#
Engineering and Sciences inventors benchmark.
This is a set of disambiguated inventor mentions derived from Png’s LinkedIn-Patent Inventors Dataset for the 2015 PatentsView Disambiguation Workshop.
The dataset is provided in the form of a pandas Series, where the index represents the mention ID and the value represents the cluster assignment.
- See:
Ge, Chunmian, Ke-wei Huang, and Ivan P.L. Png, “Engineer/Scientist Careers: Patents, Online Profiles, and Misclassification”, Strategic Management Journal, Vol 37 No 1, January 2016, 232-253.
- Returns:
pandas Series with the benchmark data as a membership vector.
- Return type:
Series
- pv_evaluation.benchmark.load_israeli_inventors_benchmark()[source]#
Loads the Israeli inventors benchmark dataset.
This benchmark dataset is adapted from Trajenberg and Shiff (2008), which evaluated the U.S. patents granted between 1963 and 1999 for Israeli inventors.
The dataset is provided in the form of a pandas Series, where the index represents the mention ID and the value represents the cluster assignment.
- See:
Trajtenberg, M., & Shiff, G. (2008). Identification and mobility of Israeli patenting inventors. Pinhas Sapir Center for Development.
- Returns:
pandas Series with the benchmark data as a membership vector.
- Return type:
Series
- pv_evaluation.benchmark.load_lai_2011_inventors_benchmark()[source]#
Loads Lai’s 2011 Inventors Benchmark dataset.
This benchmark dataset is adapted from the dataset reported in Li et al. (2014), which was used to evaluate the disambiguation of the U.S. Patent Inventor Database (1975-2010).
The dataset is provided in the form of a pandas Series, where the index represents the mention ID and the value represents the cluster assignment.
- See:
Li, G. C., Lai, R., D’Amour, A., Doolin, D. M., Sun, Y., Torvik, V. I., … & Fleming, L. (2014). Disambiguation and co-authorship networks of the US patent inventor database (1975-2010). Research Policy, 43(6), 941-955.
- Returns:
pandas Series with the benchmark data as a membership vector.
- Return type:
Series
Notes
A number of patent IDs which could not be found were removed from Lai’s original dataset.
Inventor sequence numbers were assigned through automatic matching and manual review. There could be some errors.
- pv_evaluation.benchmark.load_nber_subset_assignees_benchmark()[source]#
AIR-UMASS assigness benchmark.
The dataset is described as follows in the paper referenced below:
‘The National Bureau of Economic Research provides disambiguated assignee data. These data are created semiautomatically with manual correction and labeling of assignee coreference decisions produced by string similarity. We grouped the assignee mentions by four-letter prefixes and focused on five prefix groups {Moto, Amer, Gene, Solu, Airc} that were both common and ambiguous.’
The dataset is provided in the form of a pandas Series, where the index represents the mention ID and the
- Returns:
pandas Series with the benchmark data as a membership vector.
- Return type:
Series
References
Monath, N., Jones, C., & Madhavan, S. Disambiguating Patent Inventors, Assignees, and their Locations in PatentsView. https://s3.amazonaws.com/data.patentsview.org/documents/PatentsView_Disambiguation_Methods_Documentation.pdf
- pv_evaluation.benchmark.load_patentsview_inventors_benchmark()[source]#
Loads the PatentsView hand-disambiguated inventors benchmark dataset.
This dataset contains the hand-disambiguation of a set of particularly ambiguous inventor names. The disambiguation process was done manually by experts, to be used as a benchmark for evaluating disambiguation algorithms.
The dataset is provided in the form of a pandas Series, where the index represents the mention ID and the value represents the cluster assignment.
- See:
Monath, N., Jones, C., & Madhavan, S. Disambiguating Patent Inventors, Assignees, and their Locations in PatentsView. https://s3.amazonaws.com/data.patentsview.org/documents/PatentsView_Disambiguation_Methods_Documentation.pdf
- Returns:
pandas Series with the benchmark data as a membership vector.
- Return type:
Series
- pv_evaluation.benchmark.plot_cluster_sizes(disambiguations)[source]#
Plot cluster sizes for a set of disambiguations
- Parameters:
disambiguations (Dict) – Dictionary of membership vectors representing given disambiguations.
- Returns:
Plotly figure.
- pv_evaluation.benchmark.plot_entropy_curves(disambiguations)[source]#
Plot entropy curves for a set of disambiguations
- Parameters:
disambiguations (Dict) – Dictionary of membership vectors representing given disambiguations
- Returns:
Plotly figure.
- pv_evaluation.benchmark.plot_homonimy_rates(disambiguations, names)[source]#
Plot homonimy rates for a set of given disambiguations
- Parameters:
disambiguations (Dict) – Dictionary of membership vectors representing given disambiguations.
names (Series) – Pandas Series indexed by mention IDs and with values corresponding to inventor name.
- Returns:
Plotly figure.
- pv_evaluation.benchmark.plot_name_variation_rates(disambiguations, names)[source]#
Plot name variation rates for a set of given disambiguations
- Parameters:
disambiguations (Dict) – Dictionary of membership vectors representing given disambiguations.
names (Series) – Pandas Series indexed by mention IDs and with values corresponding to inventor name.
- Returns:
Plotly figure.
- pv_evaluation.benchmark.style_cluster_inspection(table, by='prediction')[source]#
Style table to highlight groups with alternating colors.
- Parameters:
table (dataframe) – DataFrame to style.
by (str, optional) – column to color by. Defaults to “prediction”.
- pv_evaluation.benchmark.top_inventors(disambiguation, names, n=10)[source]#
Table of most prolific inventors
- Parameters:
disambiguation (Series) – Membership vector, indexed by mention IDs, representing a given disambiguation.
names (Series) – Pandas Series indexed by mention IDs and with values corresponding to inventor name.
n (int, optional) – Number of rows to display. Defaults to 10.
- Returns:
Table with top n most prolific inventors.
- Return type:
DataFrame