Bell Patent Dataset & Parsing Code

In 2013, I wrote Python code to download and clean information pertaining to all US patents issued since 1980 for my senior thesis. That information had been made publicly accessible a few years earlier through a bulk data-sharing agreement between USPTO and Google.

On this page, you can find the resulting "flat" files. The "main" file is called basic_bib.dta. This file is the only one that contains exactly one line per patent grant, and it describes basic bibliographic information about the patent such as filing and grant dates. The other files contain information derived from the patent grant for which each patent has a variable number of rows. For instance, the inventor.dta table contains one line per patent per inventor, and on each line describes information about the inventor (such as name and address).

In these datasets, I have made no attempt to link individuals or assignees across patent grants. For work on that topic, see the Fung Institute Inventor Database, which I am not affiliated with. Another earlier source of patent data with some firm name disambiguation was the NBER Patent Data Project. Their working paper nicely surveys several methodological issues that arise when working with patent data.

You can also find the data & code that I used in order to help you extend your analysis to present-day.

Sample Data Files (small -- 1000 records each)

Main bibliographic file (1 line per patent)
dta

Inventors dta

Assignees dta

Citations to other patents dta

Citations to non-patent literature dta

Field of search dta

Parent-child linkages (continuations, etc.) dta

Non-primary classes dta

Full Data Files

Main bibliographic file (1 line per patent) dta

Inventors dta

Assignees dta

Citations to other patents dta

Citations to non-patent literature dta

Field of search dta

Parent-child linkages (continuations, etc.) dta

Non-primary classes dta

Code to Replicate & Extend Datasets

You can download the Python and Stata scripts that I used here. I have also written a useful README. In the second section of the README, I make a few remarks on extending the code to future years' data, as the data format provided by the USPTO sometimes changes (for instance, from 1980-2013, they have used 3 different distinct structures). There is some further useful information about the structure of my data in Section 3 of my undergraduate thesis.

If you extend these datasets to present-day, please let me know so that I can include more up-to-date information for other researchers to use.


Home