{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 📑 HTML Report Template" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This document shows the use of **pv_evaluation** to automatically report on a disambiguation's performance using the `pv_evaluation.templates.render_inventor_disambiguation_report()` function.\n", "\n", "This function requires:\n", "- A list of disambiguations saved to file (tables with a \"mention_id\" column and a second column representing cluster ID assignment).\n", "- A \"inventor_not_disambiguated\" file with the columns \"patent_id\", \"inventor_sequence\", \"raw_inventor_name_first\", and \"raw_inventor_name_last\". For granted patents, this should be the \"g_inventor_not_disambiguated.tsv\" file from PatentsView's bulk data downloads.\n", "\n", "Below, we download \"g_inventor_not_disambiguated.tsv\" and prepare a set of disambiguations to evaluate." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Preparation\n", "\n", "Downloading \"g_inventor_not_disambiguated.tsv\" and the file containing persistent inventor disambiguations:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import wget\n", "import zipfile\n", "import os\n", "\n", "if not os.path.isfile(\"g_inventor_not_disambiguated.tsv\"):\n", " wget.download(\"https://s3.amazonaws.com/data.patentsview.org/download/g_inventor_not_disambiguated.tsv.zip\")\n", " with zipfile.ZipFile(\"g_inventor_not_disambiguated.tsv.zip\", 'r') as zip_ref:\n", " zip_ref.extractall(\".\")\n", " os.remove(\"g_inventor_not_disambiguated.tsv.zip\")\n", "\n", "if not os.path.isfile(\"g_persistent_inventor.tsv\"):\n", " wget.download(\"https://s3.amazonaws.com/data.patentsview.org/download/g_persistent_inventor.tsv.zip\")\n", " with zipfile.ZipFile(\"g_persistent_inventor.tsv.zip\", 'r') as zip_ref:\n", " zip_ref.extractall(\".\")\n", " os.remove(\"g_persistent_inventor.tsv.zip\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Preparing a set of distinct disambiguations saved to file:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "if not os.path.isfile(\"disambiguation_20211230.tsv\") or not os.path.isfile(\"disambiguation_20220630.tsv\"):\n", " g_persistent_inventor = pd.read_csv(\"g_persistent_inventor.tsv\", sep=\"\\t\", dtype=str)\n", " g_persistent_inventor[\"mention_id\"] = \"US\" + g_persistent_inventor.patent_id + \"-\" + g_persistent_inventor.sequence\n", "\n", " g_persistent_inventor.set_index(\"mention_id\").disamb_inventor_id_20211230.to_csv(\"disambiguation_20211230.tsv\", sep=\"\\t\")\n", " g_persistent_inventor.set_index(\"mention_id\").disamb_inventor_id_20220630.to_csv(\"disambiguation_20220630.tsv\", sep=\"\\t\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Rendering Report\n", "\n", "We can now generate the report using the `render_inventor_disambiguation_report()` function. The results are saved to the current folder \".\".\n", "\n", "Note that, if we wish to compare more disambiguations, then we can add more files to the list `disambiguation_files`." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\n", "Starting python3 kernel...Done\n", "\n", "Executing 'index.ipynb'\n", " Cell 1/30...Done\n", " Cell 2/30...Done\n", " Cell 3/30...Done\n", " Cell 4/30...Done\n", " Cell 5/30...Done\n", " Cell 6/30...Done\n", " Cell 7/30...Done\n", " Cell 8/30...Done\n", " Cell 9/30...Done\n", " Cell 10/30...Done\n", " Cell 11/30...Done\n", " Cell 12/30...Done\n", " Cell 13/30...Done\n", " Cell 14/30...Done\n", " Cell 15/30...Done\n", " Cell 16/30...Done\n", " Cell 17/30...Done\n", " Cell 18/30...Done\n", " Cell 19/30...Done\n", " Cell 20/30...Done\n", " Cell 21/30...Done\n", " Cell 22/30...Done\n", " Cell 23/30...Done\n", " Cell 24/30...Done\n", " Cell 25/30...Done\n", " Cell 26/30...Done\n", " Cell 27/30...Done\n", " Cell 28/30...Done\n", " Cell 29/30...Done\n", " Cell 30/30...Done\n", "\n", "\u001b[33mWARNING: Warning: diff of engine output timed out. No source lines will be available.\u001b[39m\n", "\u001b[1mpandoc \u001b[22m\n", " to: html\n", " output-file: index.html\n", " standalone: true\n", " self-contained: true\n", " section-divs: true\n", " html-math-method: mathjax\n", " wrap: none\n", " default-image-extension: png\n", " toc: true\n", " toc-depth: 3\n", " \n", "\u001b[1mmetadata\u001b[22m\n", " document-css: false\n", " link-citations: true\n", " date-format: long\n", " lang: en\n", " title: Inventor Disambiguation Report\n", " date: today\n", " author: PatentsView-Evaluation\n", " toc-location: left\n", " jupyter: python3\n", " theme: cosmo\n", " fig-cap-location: margin\n", " code-copy: true\n", " code-block-border-left: '#31BAE9'\n", " \n", "Output created: index.html\n", "\n" ] } ], "source": [ "from pv_evaluation.templates import render_inventor_disambiguation_report\n", "\n", "render_inventor_disambiguation_report(\".\", disambiguation_files=[\"disambiguation_20211230.tsv\", \"disambiguation_20220630.tsv\"],\n", "inventor_not_disambiguated_file=\"g_inventor_not_disambiguated.tsv\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Output\n", "\n", "The result can be seen at [https://patentsview.github.io/PatentsView-Evaluation/source/examples/templates/index.html](https://patentsview.github.io/PatentsView-Evaluation/source/examples/templates/index.html)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.7.15 ('pv-evaluation': conda)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.15" }, "orig_nbformat": 4, "vscode": { "interpreter": { "hash": "135eb778a123b23717215bebe642ebc480e0ab0e1bc583cf4971f84281f0b229" } } }, "nbformat": 4, "nbformat_minor": 2 }