Text Analysis
Example Projects
Methods
- Topic Modeling: an unsupervised, clustering machine learning technique which
surfaces the abstract “topics” which occur in a corpus by sorting documents into groups which
share some similarities. LDA (Latent Dirichlet Allocation) is a popular method for fitting
topic models.
- Information Retrieval: full text search to return documents with high similarity
scores based on user input, often using a search engine such as Lucene, Solr, or Elasticsearch
- Text Classification: a class of supervised machine learning methods which
assign predefined classes or categories to text – either a binary classification or multiclassification.
Examples of classifiers and classification algorithms include logistic regression, Naive
Bayes, k-Nearest Neighbors (kNN), decision trees, and Support Vector Machine (SVM). Deep
Learning and Convolutional Neural Networks are also proving popular and powerful.
- Sentiment Analysis: classification based on opinion polarity (usually a
positive / negative spectrum, but sometimes more nuanced)
- Word Frequency Analysis: measures the most frequently occurring words using
TF-IDF (term frequency-inverse document frequency) or other methods
- Concordancing: also known as KWIC (KeyWord In Context) – each word in a
text is presented in its immediate context, eg with n words of
context on either side
- Named Entity Recognition: identifying named things such as people, places,
organizations, time periods, and more
- Collocation: determining which words commonly appear near each other
- Word Embeddings: word meanings are embedded within highly dimensional numerical
vectors, where semantically similar words are similarly vectorized. Compared to one-hot encodings
or bag-of-words models, word embeddings better capture word context within documents. Vector
math allows comparisons and operations using between word vectors, eg `king – man + woman
= queen`. Principal component analysis can be performed and visualized through dimensionality
reduction, eg from several hundred layers to two or three for visual representation in 2D
or 3D space. Popular word vector algorithms include Word2Vec, Doc2Vec, and GloVe.
- Transformer Models: transformers are an increasingly popular choice for
modern NLP and computer vision tasks. While word embeddings embed information about word
context into a single vector per word, transformer models do a better job at disambiguating
word sense because they are contextually embedded, using the entire sequence of words (a
sentence or paragraph). Transformer models such as GPT, GPT-2, GPT-3, and BERT (Bidirectional
Encoder Representations from Transformers) use the transfer learning approach: they are pretrained
on massive corpora, and then fine-turned for specific tasks. Transformer models are useful
for classification and text generation.
Popular Tools
- Voyant Tools: web
based reading and analysis environment for digital texts
- Mallet: a
Java-based package for statistical natural language processing, document classification,
clustering, topic modeling, information extraction, and other machine learning
applications to text
- WordSeer 4:
a text analysis environment that combines visualization, information retrieval,
sensemaking and natural language processing to make the contents of text navigable,
accessible, and useful
- Antconc: a freeware
corpus analysis toolkit for concordancing and text analysis
Popular Programming Languages and Packages
- Python
- NLTK (Natural Language Toolkit): a leading platform for building Python programs to work with human language
data. It provides easy-to-use interfaces to over 50 corpora and lexical
resources such as WordNet, along with a suite of text processing libraries for
classification, tokenization, stemming, tagging, parsing, and semantic reasoning
- spaCy:
industrial-strength NLP written in Cython for speed; supports multiple languages,
pretrained transformer models (BERT, etc), custom models from PyTorch and TensorFlow
(common ML libraries), POS and NER tagging, text classification, and more
- Gensim: free open-source Python library for representing documents as semantic
vectors, as efficiently (computer-wise) and painlessly (human-wise) as possible.
- R
- Quanteda: a
package for the Quantitative Analysis of Textual Data, designed for R users needing
to apply natural language processing to texts, from documents to final analysis
- Tidytext: using tidy data principles can make many text mining tasks easier, more
effective, and consistent with tools already in wide use. Tidytext provides
functions and supporting data sets to allow conversion of text to and from tidy
formats, and to switch seamlessly between tidy tools and existing text mining
packages
- spaCyR: an R wrapper for the Python spaCy library which can be integrated with Quanteda
or Tidytext
Resources
Visual Presentation and Analysis
Example Projects
Popular Tools, Platforms, and Standards
- IIIF, the International Image Interoperability Framework, provides a
series of API specifications which various image servers and viewers implement. IIIF
servers provide easily manipulated tiled images, and IIIF viewers present JSON manifests
which include one or more canvases and may also include annotations.
- IIIF Awesome Resources
- Harvard IIIF Website
-
Image servers
-
Image viewers
- Mirador 3
- Universal Viewer
-
Annotation servers
Spatial Analysis and Web Mapping
Example Projects
Popular Tools and Platforms
-
Esri ArcGIS: powerful desktop GIS software suite for mapping, geoprocessing,
cartography, and spatial analysis
- QGIS: FOSS alternative to ArcGIS
-
Palladio: a Stanford Humanities + Design Lab online tool for visualizing complex
historical data
-
CARTO (previously CartoDB): paid Software as a Service platform for spatial analysis and
GIS
-
Esri ArcGIS Online: cloud-based GIS platform for creating and sharing interactive maps
and analyzing spatial data
- Neatline: a suite of
add-on tools for Omeka designed to help tell stories with maps, images, and timelines
Popular Software Packages
- Leaflet: open-source Javascript library for interactive web maps
-
D3.JS: open-source Javascript library for general data visualization, including web
mapping
- Google Maps API: one of the most popular general purpose mapping libraries
-
OpenLayers: Javascript library for displaying maps and analyzing geographical data
-
Mapbox GL: SDK for web maps powered by Mapbox, which allows users to design and publish
beautiful maps
Network Analysis
Example Projects
Popular Tools
- Gephi: a FOSS
interactive visualization and exploration platform for all kinds of networks and complex
systems, dynamic and hierarchical graphs
- Palladio: a Stanford Humanities + Design Lab online tool for visualizing complex historical
data
- Cytoscape:
originally designed for biological research, Cytoscape is now a general, open source
software platform for complex network analysis and visualization
- NodeXL: a network analysis and visualization plugin
for Excel
Popular Software Packages
- NetworkX (Python): a package
for the creation, manipulation, and study of the structure, dynamics, and functions of complex
networks
- igraph (R and Python): a
collection of network analysis tools with an emphasis on efficiency, portability, and easy
of use
- visNetwork (R): an R package for network visualization using vis.js
-
D3.js (Javascript): more for network visualization and presentation than network
analysis
Timelines and Temporal Analysis
Popular Timeline Creation Tools
- TimelineJS:
an open-source tool that enables anyone to build visually rich, interactive timelines
using nothing more than a Google Sheet
- Chronos Timeline: designed specifically for needs in the humanities and social sciences to represent
time-based data
- Neatline: a suite of
add-on tools for Omeka designed to help tell stories with maps, images, and timelines
Machine Learning
Popular Platforms
- AWS Sagemaker
- IBM Watson Studio
- Google Cloud AI
- H20.ai
- KNIME
Popular Software Packages
- Python
- Keras: deep learning and neural networks
- PyTorch: deep learning, computer vision
- Scikit-Learn: data preprocessing, text vectorization, classification,
clustering
- TensorFlow: deep learning
- R
- caret:
this package (short for Classification And REgression Training) is a set of
functions that attempt to streamline the process for creating predictive models.
Includes tools for data splitting, pre-processing, feature selection, model tuning
using resampling, and variable importance estimation
Resources
Database Development
Popular Databases
- Relational
- Document / NoSQL
- MongoDB
- Elasticsearch
- Solr
- Key-value Store
- Graph
Popular Database Tools and Database Management Systems (DBMS)
- DBVisualizer
- DataGrip
- Postico
- SQL Server Management Studio
Data Cleaning
Popular Software
- Python
- Pandas: a
fast, powerful, flexible and easy to use open source data analysis and manipulation
tool
- NumPy: the
fundamental package for scientific computing with Python
- Jupyter Notebooks / Jupyter Lab: a web-based interactive development environment for Jupyter notebooks, code, and
data
- R
- Tidyverse:
an opinionated collection of R packages designed for data science. All packages
share an underlying design philosophy, grammar, and data structures.
-
Language-agnostic (mostly): Regular Expressions: special text strings which specify a
search pattern to match other strings
Popular Tools
- Google Sheets
- OpenRefine: a
powerful tool for working with messy data: cleaning it; transforming it from one format
into another; and extending it with web services and external data.
Research Data Management
Popular Software Packages
- Git: a free and open
source distributed version control system designed to handle everything from small to
very large projects with speed and efficiency. Git allows you to track changes to files,
collaborate with other developers or researchers, understand how changes are occurring
and revert them, and create branches to test new work.
Popular Tools and Platforms
- DataVerse: an open
source research data repository developed by the Harvard Institute of Quantitative Social Sciences (IQSS)
- Github desktop clients, such as Github Desktop or GitKraken
- Tropy: free, open-source
software that allows you to organize and describe photographs of research material. Once
you have imported your photos into Tropy, you can combine photos into items (e.g.,
photos of the three pages of a letter into a single item), and group photos into lists.
You can export Tropy projects to JSON-LD and to Omeka.
Project Management
Popular Tools and Platforms
- Trello
- Jira
- Github Projects
- Asana
Citation Management
Popular Tools and Platforms
Digital Collections
Example Projects
Popular Platforms and Frameworks
- Omeka: open source web
publishing platform for sharing digital collections with robust metadata and creating
media-rich online exhibits
- Scalar: a free,
open source authoring and publishing platform designed to make it easy for authors to
write long-form, born digital scholarship online
- Drupal: a free
PHP content management system, with robust and flexible plugins for intense data
modeling
- WordPress: free PHP
content management system, originally designed for blogs but flexible enough for general
web publishing
Digital Editions
Example Projects
Popular Tools and Platforms
- Juxta: an open-source tool for comparing and
collating multiple witnesses to a single textual work
- Oxygen: a
comprehensive suite of XML authoring and development tools
- Manifold Scholarship: a robust digital publishing platform
Data Visualization
Popular Tools and Platforms
- Tableau: a
business intelligence, data analytics, and data visualization platform which allows
users to create dashboards to get insight into their data. Paid and free (data must be
publicly available to share) options
- Flourish: no-code
online platform which helps you explore and explain your data with beautiful
visualizations and stories. Like Tableau, free (public data) or paid options
- Datawrapper:
no-code online platform which lets you show your data as beautiful charts, maps or
tables with a few clicks
- Google Data Studio: turn your data into compelling stories of data visualization art, or quickly build
interactive reports and dashboards with Data Studio’s web based reporting tools. Allows
more control over who sees your data and reports than the free flavors of Flourish,
Tableau, or Datawrapper.
Popular Software Packages
- Javascript
- D3.js: a JavaScript
library for manipulating documents based on data
- Plotly: interactive charts and maps
- R
- ggplot2:
a system for declaratively creating graphics, based on The Grammar of Graphics. You
provide the data, tell ggplot2 how to map variables to aesthetics, what graphical
primitives to use, and it takes care of the details
- Plotly: interactive charts and maps
- Python
- Matplotlib: a
comprehensive library for creating static, animated, and interactive visualizations
in Python
- Seaborn: a Python data visualization library based on matplotlib, Seaborn provides a
high-level interface for drawing attractive and informative statistical graphics
- Plotly: interactive charts and maps
Annotation
Popular Tools and Platforms
- IIIF viewers such as Mirador provides ways for users to annotate images
- Scalar
Web Development
Popular Stacks, Frameworks, and Infrastructures
- Backend
-
Python: Django (heavier, more fully featured, requires relational database) or Flask
(lighter, good for APIs, can bring any database)
-
PHP
-
Many of the most popular Content Management Systems (WordPress, Drupal, Omeka,
Scalar) are PHP-based
- PHP frameworks include Laravel and CodeIgniter
- Ruby: Rails
- NodeJS: Meteor, Express, Nest, Hapi
- Frontend
- Javascript: React, Vue, Angular, Svelte, jQuery
- CSS: Bootstrap, Tailwind, Bulma, SemanticUI, Materialize
- Databases
- Infrastructure
- Static Sites