Texthero is a python package to support the analysis of text-based datasets. It is fully integrated with Pandas, easy to learn, and adherent to the numpy/scipy API. It is free, open-source, and well documented. Texthero was developed by Jonathan Besomi, a member of the TIS lab.

Texthero includes tools for:

    • Preprocessing text data
    • NLP for keyphrase and  keyword extraction
    • NLP for named entity recognition
    • Text embeddings for TF, TF-IDF, and neural embeddings
    • Vector space analysis and topic modeling
    • Clustering (K-means, Mean-shift, DBSCAN, Hierarchical)
    • Text and vector space visualizations

Installation:  pip install texthero

Open Source Project

PyPi Distribution




Passcode provides a simple way to encrypt python modules on a development machine, pass protected code through a public repository, pull protected code back into production, automatically decrypt the code at runtime on a production machine, and reference/call the protected code from other components at runtime. You can do all of that with just two lines of code!

Installation:  pip install passcode

PyPi Distribution




Researchers often have to match business names between a focal list of company names that they care about, and an “alter list” of potential matches (perhaps one-to-one, or perhaps one-to-many). Sometimes that match is straight-forward (e.g., the match is perfect), but often the match requires a judgment call (i.e., the match is not perfect, the match depends on the year in consideration, or other factors), and managing that process across lists of tens-of-thousands of companies is inefficient and prone to mistakes or researcher bias.

bizmatch provides a rigorous and consistent approach to the matching propblem. Users can run bizmatch in an iterative manner to find direct matches, AND users can also run bizmatch to find candidate matches to examine in greater detail. After inspection of candidate matches, users can then move matches that they find acceptable from the candidate list to the matched list, and/or tweak the configuration files (which define synonyms, abbreviations, brand names, special cases, etc.) to refine the matching process for their context. Detailed instructions are provided in the GitHub repository.

Installation:  git clone




A collection of python tools and utilities used in our research.

Installation:  pip install qKit

PyPi Distribution



Copyright 2017-2020. All software is provided for non-commercial use, subject to a Creative Commons Attribution-NonCommercial-NoDerivatives license. No co‐authorship is required to use the software in academic research – please just cite author and source.