Patent Citation Similarity

From: Patent Citations Reexamined: New Data and Methods
Joint work with Jeffrey Kuhn and Alan Marco

Many studies of innovation rely on patent citations to measure intellectual lineage and impact. To create this dataset, we use a vector space model of patent similarity to compute the technological similarity between each pair of citing-cited patents. The VSM model analyzes the full text of each document to position it as a vector in a vector space that includes more than 700,000 dimensions and then calculates the angular distance between the two vectors. The dataset includes similarity values for all citations made by patents issued between 1976 and 2017 to issued patents or published patent applications.

Download (819 MB)


Patent Citation Timing and Source

From: Patent Citations Reexamined: New Data and Methods
Joint work with Jeffrey Kuhn and Alan Marco

Innovation studies frequently distinguish between patent citation submitted by the patent examiner and those submitted by the patent application. However, publicly available citations data is often misleading, for instance by attributing a patent citation to the patent examiner when it was in fact first submitted by the patent application. This dataset uses internal USPTO data to identify the date on which each citation was first submitted as well as the party (examiner or applicant) who first submitted it. The dataset includes observations for citations made by patents issued 2001-2014, although some level of leftward truncation is evident due to limitations in internal data availability at the USPTO.

Download (292 MB)


Patent Families

From: Patent-to-Patent Similarity: A Vector Space Model
Joint work with Jeffrey Kuhn

Patent applicants frequently file groups of patent applications linked together by priority claims. These priority claims create families of patent applications that share features such as inventors, priority dates, and technical descriptions. By analyzing these linkages, each patent can be assigned a family identifier that it shares with other patents in the same family. This data set includes two levels of family identifiers (clone for near copies, and extended for more attenuated linkages) for each patent issued 2005-2014.

Download (18 MB)


Copyright 2017-2019.  All datasets are copyrighted and provided for non-commercial use, subject to the Creative Commons Attribution-NonCommercial-NoDerivatives license. No co‐authorship is required to use the software in academic research – please just cite author and source.