Choosing Digital Methods and Tools

Text Analysis

Example Projects

Methods

  • Topic Modeling: an unsupervised, clustering machine learning technique which surfaces the abstract “topics” which occur in a corpus by sorting documents into groups which share some similarities. LDA (Latent Dirichlet Allocation) is a popular method for fitting topic models.
  • Information Retrieval: full text search to return documents with high similarity scores based on user input, often using a search engine such as Lucene, Solr, or Elasticsearch
  • Text Classification: a class of supervised machine learning methods which assign predefined classes or categories to text – either a binary classification or multiclassification. Examples of classifiers and classification algorithms include logistic regression, Naive Bayes, k-Nearest Neighbors (kNN), decision trees, and Support Vector Machine (SVM). Deep Learning and Convolutional Neural Networks are also proving popular and powerful.
  • Sentiment Analysis: classification based on opinion polarity (usually a positive / negative spectrum, but sometimes more nuanced)
  • Word Frequency Analysis: measures the most frequently occurring words using TF-IDF (term frequency-inverse document frequency) or other methods
  • Concordancing: also known as KWIC (KeyWord In Context) – each word in a text is presented in its immediate context, eg with words of context on either side
  • Named Entity Recognition: identifying named things such as people, places, organizations, time periods, and more
  • Collocation: determining which words commonly appear near each other
  • Word Embeddings: word meanings are embedded within highly dimensional numerical vectors, where semantically similar words are similarly vectorized. Compared to one-hot encodings or bag-of-words models, word embeddings better capture word context within documents. Vector math allows comparisons and operations using between word vectors, eg `king – man + woman = queen`. Principal component analysis can be performed and visualized through dimensionality reduction, eg from several hundred layers to two or three for visual representation in 2D or 3D space. Popular word vector algorithms include Word2Vec, Doc2Vec, and GloVe.
  • Transformer Models: transformers are an increasingly popular choice for modern NLP and computer vision tasks. While word embeddings embed information about word context into a single vector per word, transformer models do a better job at disambiguating word sense because they are contextually embedded, using the entire sequence of words (a sentence or paragraph). Transformer models such as GPT, GPT-2, GPT-3, and BERT (Bidirectional Encoder Representations from Transformers) use the transfer learning approach: they are pretrained on massive corpora, and then fine-turned for specific tasks. Transformer models are useful for classification and text generation.

Popular Tools

  • Voyant Tools: web based reading and analysis environment for digital texts
  • Mallet: a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text
  • WordSeer 4: a text analysis environment that combines visualization, information retrieval, sensemaking and natural language processing to make the contents of text navigable, accessible, and useful
  • Antconc: a freeware corpus analysis toolkit for concordancing and text analysis

Popular Programming Languages and Packages

  • Python
    • NLTK (Natural Language Toolkit): a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning
    • spaCy: industrial-strength NLP written in Cython for speed; supports multiple languages, pretrained transformer models (BERT, etc), custom models from PyTorch and TensorFlow (common ML libraries), POS and NER tagging, text classification, and more
    • Gensim: free open-source Python library for representing documents as semantic vectors, as efficiently (computer-wise) and painlessly (human-wise) as possible.
  • R
    • Quanteda: a package for the Quantitative Analysis of Textual Data, designed for R users needing to apply natural language processing to texts, from documents to final analysis
    • Tidytext: using tidy data principles can make many text mining tasks easier, more effective, and consistent with tools already in wide use. Tidytext provides functions and supporting data sets to allow conversion of text to and from tidy formats, and to switch seamlessly between tidy tools and existing text mining packages
    • spaCyR: an R wrapper for the Python spaCy library which can be integrated with Quanteda or Tidytext

Resources

Visual Presentation and Analysis

Example Projects

Popular Tools, Platforms, and Standards

  • IIIF, the International Image Interoperability Framework, provides a series of API specifications which various image servers and viewers implement. IIIF servers provide easily manipulated tiled images, and IIIF viewers present JSON manifests which include one or more canvases and may also include annotations.
    • IIIF Awesome Resources
    • Harvard IIIF Website
    • Image servers
      • Loris
      • Cantaloupe
    • Image viewers
      • Mirador 3
      • Universal Viewer
    • Annotation servers
      • CatchPy

Spatial Analysis and Web Mapping

Example Projects

Popular Tools and Platforms

  • Esri ArcGIS: powerful desktop GIS software suite for mapping, geoprocessing, cartography, and spatial analysis
  • QGIS: FOSS alternative to ArcGIS
  • Palladio: a Stanford Humanities + Design Lab online tool for visualizing complex historical data
  • CARTO (previously CartoDB): paid Software as a Service platform for spatial analysis and GIS
  • Esri ArcGIS Online: cloud-based GIS platform for creating and sharing interactive maps and analyzing spatial data
  • Neatline: a suite of add-on tools for Omeka designed to help tell stories with maps, images, and timelines

Popular Software Packages

  • Leaflet: open-source Javascript library for interactive web maps
  • D3.JS: open-source Javascript library for general data visualization, including web mapping
  • Google Maps API: one of the most popular general purpose mapping libraries
  • OpenLayers: Javascript library for displaying maps and analyzing geographical data
  • Mapbox GL: SDK for web maps powered by Mapbox, which allows users to design and publish beautiful maps

Network Analysis

Example Projects

Popular Tools

  • Gephi: a FOSS interactive visualization and exploration platform for all kinds of networks and complex systems, dynamic and hierarchical graphs
  • Palladio: a Stanford Humanities + Design Lab online tool for visualizing complex historical data
  • Cytoscape: originally designed for biological research, Cytoscape is now a general, open source software platform for complex network analysis and visualization
  • NodeXL: a network analysis and visualization plugin for Excel

Popular Software Packages

  • NetworkX (Python): a package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks
  • igraph (R and Python): a collection of network analysis tools with an emphasis on efficiency, portability, and easy of use
  • visNetwork (R): an R package for network visualization using vis.js
  • D3.js (Javascript): more for network visualization and presentation than network analysis

Timelines and Temporal Analysis

Popular Timeline Creation Tools

  • TimelineJS: an open-source tool that enables anyone to build visually rich, interactive timelines using nothing more than a Google Sheet
  • Chronos Timeline: designed specifically for needs in the humanities and social sciences to represent time-based data
  • Neatline: a suite of add-on tools for Omeka designed to help tell stories with maps, images, and timelines

Machine Learning

Popular Platforms

  • AWS Sagemaker
  • IBM Watson Studio
  • Google Cloud AI
  • H20.ai
  • KNIME

Popular Software Packages

  • Python
    • Keras: deep learning and neural networks
    • PyTorch: deep learning, computer vision
    • Scikit-Learn: data preprocessing, text vectorization, classification, clustering
    • TensorFlow: deep learning
  • R
    • caret: this package (short for Classification And REgression Training) is a set of functions that attempt to streamline the process for creating predictive models. Includes tools for data splitting, pre-processing, feature selection, model tuning using resampling, and variable importance estimation

Resources

Database Development

Popular Databases

  • Relational
    • PostgreSQL
    • MySQL
  • Document / NoSQL
    • MongoDB
    • Elasticsearch
    • Solr
  • Key-value Store
    • Redis
    • AWS DynamoDB
  • Graph
    • Neo4J

Popular Database Tools and Database Management Systems (DBMS)

  • DBVisualizer
  • DataGrip
  • Postico
  • SQL Server Management Studio

Data Cleaning

Popular Software

  • Python
    • Pandas: a fast, powerful, flexible and easy to use open source data analysis and manipulation tool
    • NumPy: the fundamental package for scientific computing with Python
    • Jupyter Notebooks / Jupyter Lab: a web-based interactive development environment for Jupyter notebooks, code, and data
  • R
    • Tidyverse: an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
  • Language-agnostic (mostly): Regular Expressions: special text strings which specify a search pattern to match other strings

Popular Tools

  • Google Sheets
  • OpenRefine: a powerful tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data.

Research Data Management

Popular Software Packages

  • Git: a free and open source distributed version control system designed to handle everything from small to very large projects with speed and efficiency. Git allows you to track changes to files, collaborate with other developers or researchers, understand how changes are occurring and revert them, and create branches to test new work.

Popular Tools and Platforms

  • DataVerse: an open source research data repository developed by the Harvard Institute of Quantitative Social Sciences (IQSS)
  • Github desktop clients, such as Github Desktop or GitKraken
  • Tropy: free, open-source software that allows you to organize and describe photographs of research material. Once you have imported your photos into Tropy, you can combine photos into items (e.g., photos of the three pages of a letter into a single item), and group photos into lists. You can export Tropy projects to JSON-LD and to Omeka.

Project Management

Popular Tools and Platforms

  • Trello
  • Jira
  • Github Projects
  • Asana

Citation Management

Popular Tools and Platforms

  • Zotero
  • EndNote
  • Mendeley

Digital Collections

Example Projects

Popular Platforms and Frameworks

  • Omeka: open source web publishing platform for sharing digital collections with robust metadata and creating media-rich online exhibits
  • Scalar: a free, open source authoring and publishing platform designed to make it easy for authors to write long-form, born digital scholarship online
  • Drupal: a free PHP content management system, with robust and flexible plugins for intense data modeling
  • WordPress: free PHP content management system, originally designed for blogs but flexible enough for general web publishing

Digital Editions

Example Projects

Popular Tools and Platforms

  • Juxta: an open-source tool for comparing and collating multiple witnesses to a single textual work
  • Oxygen: a comprehensive suite of XML authoring and development tools
  • Manifold Scholarship: a robust digital publishing platform

Data Visualization

Popular Tools and Platforms

  • Tableau: a business intelligence, data analytics, and data visualization platform which allows users to create dashboards to get insight into their data. Paid and free (data must be publicly available to share) options
  • Flourish: no-code online platform which helps you explore and explain your data with beautiful visualizations and stories. Like Tableau, free (public data) or paid options
  • Datawrapper: no-code online platform which lets you show your data as beautiful charts, maps or tables with a few clicks
  • Google Data Studio: turn your data into compelling stories of data visualization art, or quickly build interactive reports and dashboards with Data Studio’s web based reporting tools. Allows more control over who sees your data and reports than the free flavors of Flourish, Tableau, or Datawrapper.

Popular Software Packages

  • Javascript
    • D3.js: a JavaScript library for manipulating documents based on data
    • Plotly: interactive charts and maps
  • R
    • ggplot2: a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details
    • Plotly: interactive charts and maps
  • Python
    • Matplotlib: a comprehensive library for creating static, animated, and interactive visualizations in Python
    • Seaborn: a Python data visualization library based on matplotlib, Seaborn provides a high-level interface for drawing attractive and informative statistical graphics
    • Plotly: interactive charts and maps

Annotation

Popular Tools and Platforms

  • IIIF viewers such as Mirador provides ways for users to annotate images
  • Scalar

Web Development

Popular Stacks, Frameworks, and Infrastructures

  • Backend
    • Python: Django (heavier, more fully featured, requires relational database) or Flask (lighter, good for APIs, can bring any database)
    • PHP
      • Many of the most popular Content Management Systems (WordPress, Drupal, Omeka, Scalar) are PHP-based
      • PHP frameworks include Laravel and CodeIgniter
    • Ruby: Rails
    • NodeJS: Meteor, Express, Nest, Hapi
  • Frontend
    • Javascript: React, Vue, Angular, Svelte, jQuery
    • CSS: Bootstrap, Tailwind, Bulma, SemanticUI, Materialize
  • Databases
  • Infrastructure
  • Static Sites
    • Github Pages
    • Jekyll