Research Computing – Arts and Humanities Research Computing

Barajas Dean’s Innovation Fund 2020 Applications Now Open

The Dean of Arts and Humanities is pleased to announce continued funding available for initiatives in the Digital Arts and Humanities, thanks to the generosity of the Barajas Dean’s Innovation Fund for Digital Arts and Humanities. The application guidelines are below.

This fund is intended to encourage innovation in the arts and humanities by supporting small and medium scale projects that will move these fields to the center of the digital revolution. Proposals may include (but are by no means limited to) course development and support, interfaculty collaborations, technology and training, experiential learning opportunities, and undergraduate, graduate, or faculty research.

All ladder faculty and senior lecturers, including those without a previous history of digital innovation, are encouraged to apply. New applicants will be favored; earlier recipients of Barajas grants will be considered for extension of funding or for funding of new projects. A report on the activities and spending of the prior award is required with submission. While proposals may include some funding for digitization of materials, this should not be the primary goal of the project.

Please submit a one-paragraph Statement of Intent by Friday, March 13, 2020 to artshum@fas.harvard.edu. Full applications are due Friday, March 27, 2020. The maximum amount to be awarded is $20,000 but proposals with a more modest budget are encouraged. Proposals received after the deadline will not be considered.

Applicants may want to consult with the Academic Technology Group (contact: Annie Rota, rota@fas.harvard.edu), the Harvard College Library (contact: Marty Schreiner, schrein@fas.harvard.edu), and/or Arts and Humanities Research Computing (contact: Rashmi Singhal, rashmi_singhal@harvard.edu) on technology issues and currently available resources.

Proposals will be favored that:

use digital media and techniques to expand the reach of scholarly inquiry in the arts and/or humanities
show creativity and will have significant impact on a field of research or on university teaching
take advantage of existing (digital) resources at the University and will make them available to a wider audience
indicate how the project will benefit undergraduate and/or graduate students

Proposals must include:

A 1-3 page account of the project with the following information
- Goals of the project
- Technical support needs, either from staff/students to be hired with money awarded, or from staff already in place
- A one-sentence abstract for possible publication on the Arts and Humanities website, should the project be funded
Anticipated beginning and end date of the project
A detailed budget for the entire project, indicating how the innovation funds will be allocated and how the project will be sustained past the grant period. N.B. Applicants should make explicit what other sources of funding have been or will be requested
A list of collaborators with ranks and affiliations

If granted, recipients will be required to:

Submit a report on their activities, including the expenditure of all funds
Return any unused funds. If the funds awarded are for an event to be held in the following fiscal year or in the upcoming academic year, unused funds need only be returned after the event is complete and all expenses covered by the award have been paid. If there is a question about timing for the detailed report or return of funds, please have your financial administrator contact our office

Awardees will also be strongly encouraged to publicize their projects on the web. Arts and Humanities Research Computing is prepared to help in this effort if faculty do not want to do it through their own channels (see prior years’ projects at http://darthcrimson.org/barajas).

Proposals should be submitted as an electronic attachment to arts-hum@fas.harvard.edu, Subject: Barajas Innovation Fund Proposal.

Research Databases and the Future of Digital Humanities Applications

Introduction: Next Generation Research Computing

Research computing within the arts and humanities has evolved in tandem with rapidly advancing digital methodologies, nuanced datasets, and increasingly robust web programming environments. More than ever, scholars are engaging with shared, scalable research ecosystems that often include content annotation, text/network analysis, data visualization, and crowdsourcing functionalities, among others.

Despite these major shifts within the digital research landscape, commonly adopted databases for content storage and retrieval do not always prioritize the needs of an increasingly sophisticated user base, nor have they been optimized for the immediacy and scalability that modern research applications demand.

The following document surveys modern databases and theories of data modeling in order to compare and contrast differing approaches to database-driven digital research. All examples within this overview represent a selection of projects designed by Harvard University faculty in collaboration with Arts & Humanities Research Computing. Each makes use of varying database technologies, both common and cutting-edge, and was designed with long-term, flexible, and sustainable research applications in mind.

Relational Databases as a Point of Departure

Relational databases are the most commonly used database technology. They were originally developed to keep track of large numbers of interrelated people, objects, and processes, such as patients in a hospital, students at a university, or books that Amazon sells. The software has become so standard and cost effective that relational databases have become the default technology, even when they may not be the best tool for the job.

There are proprietary variations among relational database applications, but generally speaking all share the same fundamental data model: organizing information in one or more tables of rows and columns. Tables contain people, objects, events, etc. of a single type (patients, students, books). Rows describe information pertaining to one instance (single patient, particular student, specific book). Columns contain fields, a type of data that describes aspects of an item within a table (patient name, student campus address, book title). Typically, one of the fields will be a unique identifier that can label a specific row in the table (patient ID, student ID, ISBN).

The most basic example of this is the Excel spreadsheet and its related forms (comma/tab separated values). Relational databases have been in use for decades, and remain the most flexible choice for data storage and retrieval because many are open-source and have large communities of practice. MySQL is perhaps the most popular open source database for digital research projects, blogs, and other web applications.

Good & Bad Relations: Design Thinking for an Opera Social Network

In order to illustrate relational database fundamentals, the following example will walk through the process of creating a simple social network of operas and their performances around the world. This demonstration draws on data from Operabase, an online archive containing information about performances, artists, opera companies, and more.

[/vc_column_text][vc_single_image media=”49584″ caption=”yes” media_width_percent=”80″ alignment=”center”][vc_column_text]

Fidelio, the only opera written by Ludwig van Beethoven, tells the story of Leonore, who disguises herself as a prison guard to free her husband Florestan from certain death within a political prison. It narrates a heroic story about liberty and justice, and has multiple performances playing around the world in 2017.

Beethoven’s opera is one of many within the Operabase archives. In a relational database model, these entries would exist within a table called operas. Each row would represent one opera and each column would represent a particular type of data about the opera such as NAME, ID, COMPOSER, FIRST_PERFORMANCE, etc. Operas are only one type of table. Other tables may include artists, opera companies, theaters, and more. Here is an example of what the operas table may look like:

[/vc_column_text][vc_single_image media=”49588″ caption=”yes” media_width_percent=”100″ alignment=”center”][vc_column_text]

In order to connect performance dates, locations, and other information with each opera, the user must create new tables to represent these additional entities. The first table will be called theaters, and the second table, called performances, will combine information from operas and theaters into a single item, a performance of a particular opera at a specific opera house. First, here is an example of the theaters table:[/vc_column_text][vc_single_image media=”49589″ caption=”yes” media_width_percent=”100″ alignment=”center”][vc_column_text]Each theater has a unique THEATER_ID, NAME, COUNTRY, CITY, CAPACITY, SEASON, and WEBSITE. According to Operabase, each theater above will host a production of Fidelio during the 2017 season. Furthermore, each performance will have its own respective beginning and closing dates.

The performances table will illustrate the fundamental behavior of the relational database model: creating relationships between tables. This new table will store performance begin/end dates, and will also draw from the operas table and the theaters table to create instances of connected, queryable information. Connecting data in this manner is called joining, and data joins can serve to create permanent new tables within a database, or as a way of connecting data in a live instance to be used and then discarded from memory once the query has completed. In this example, the data join will connect information together within one table that will become a permanent part of the underlying database.

[/vc_column_text][vc_single_image media=”49595″ caption=”yes” media_width_percent=”100″ alignment=”center”][vc_column_text]

Within this table structure, PERFORMANCE_BEGIN and PERFORMANCE_END connect to a respective opera and theater, referenced by OPERA_ID and THEATER_ID. In the original operas table, the unique id representing each opera—also called a primary key—is used to refer to itself from within another table. Establishing a primary key within a table allows for the relational database to be able to distinguish between unique entries within the table.

When an item from one table is referenced within another table, the primary key in the original table links to the secondary table in the form of a foreign key. In other words, in the operas table, Fidelio has a primary key (id number) of 1. In the performances table it is necessary to use that same unique id number to instruct the database to pull opera number 1 from the original table; however, within the new table, the primary key of 1 shifts to become a foreign key, which provides instructions to the database that the key originally comes from the operas table, and that is where it can look to find the original values.

This is a basic example of a growing network of operas, theaters, and performances. For a more robust social network, it would be necessary to create separate tables for performers, composers, opera companies, and more. This would dramatically increase the complexities of the database as new relationships would have to be mapped to performance instances. Beyond performances alone, the complexity of the data would also sharply increase if relationships between performers, composers, and other artists were described.

Mapping data to a specific structure within a database is called data modeling, and with relational databases the data model is extremely important for the integrity of the database. One of the most important factors in the use of relational databases is the time spent planning before any data can be entered. This planning must account for potential changes or augmentations to the original data in advance, because increasingly complex datasets can be challenging to append to pre-structured tables. Graph Databases: New Opportunities for Connected Data asserts “whereas relational databases were initially designed to codify paper forms and tabular structures—something they do exceedingly well—they struggle when attempting to model the ad hoc, exceptional relationships that crop up in the real world. Ironically, relational databases deal poorly with relationships” (Robinson, Webber & Eifrem, 11).

For arts and humanities research computing projects, creating meaning amongst data is as important, if not more important, than storing the data in the first place. Scholars increasingly look to database technologies with the express purpose of distant reading their data, finding patterns, and testing queries. Thus, the fundamental goal is always to create and expose relationships among data, and the manner of approaching this problem differs depending on the type of project in question.

The following use cases will demonstrate how newer types of databases, known collectively as NoSQL, can more elegantly accommodate arts and humanities data. This new generation of technologies breaks the dominance of relational tables and fields to allow for a more intuitive and robust environment to explore and ask questions of data.

Faculty Case Study: The Giza Project

The Giza Project at Harvard University provides access to “the largest collection of information, media, and research materials ever assembled about the Pyramids and related sites on the Giza Plateau.” The project is led by Peter Der Manuelian, Philip J. King Professor of Egyptology and Director of the Harvard Semitic Museum. Rashmi Singhal of Arts & Humanities Research Computing oversees data architecture and technical development for the project’s website Digital Giza.

Digital Giza is a veritable research environment consisting of highly structured scholarly big data: photos, bibliographies, dates, dig site findings, drawings, documents (published and unpublished), diary entries, ancient people (e.g. identified bodies within a tomb), modern people (e.g. archaeologists, photographers, scholars, etc.), and a host of other archeological information. The goal of the integrated research platform, featuring unique pages for each entity within the database and associated information, is illustrated in the following image of the tomb of Meresankh III:

[/vc_column_text][vc_single_image media=”49694″ caption=”yes” media_width_percent=”100″ alignment=”center”][vc_single_image media=”49695″ caption=”yes” media_width_percent=”100″ alignment=”center”][vc_single_image media=”49696″ caption=”yes” media_width_percent=”100″ alignment=”center”][vc_column_text]

The underlying database that informs Digital Giza is called The Museum System (TMS), a relational database optimized for museums and archival collections. TMS serves the needs of the Giza team, allowing for complex, systematic data entry and manipulation. In addition, the team receives support from Harvard University because TMS is in use elsewhere on campus. There was, however, one major drawback: extremely complex queries against TMS were required in order to build the unique website displays on the Digital Giza web application.

In the above example of the tomb of Meresankh III, there are hundreds of photos, finds, videos, and people associated with the object. A typical query to pull together all of this information involves multiple data joins and easily surpasses one hundred lines of code. In addition, future iterations of the dataset may include new types of objects and relationships not previously included within the data model. This poses a challenge for the efficiency and flexibility of the web environment.

It became clear while working on the project that the cost and time required to switch the Giza team to a completely new workflow and database technology would be too prohibitive, and therefore the solution changed: a NoSQL database called ElasticSearch would sit atop TMS, one that could act as a layer between TMS and the web application.

NoSQL frees up the rigidity of the relational data model by implementing a number of alternative data storage solutions. The term itself—NoSQL—represents a wide variety of database types with differing data models, and for Digital Giza this specifically translated to a document model.

This type of data model creates a unique document for each object within the database. This data format is self-contained, meaning the manner of parsing the data within each document is self-described by that document. This is in direct contrast to relational models where tables have universal, well-defined structures that correspond to data entities. In a document model, no two documents need to have the exact same schema. Below are two examples of documents within the database (in JSON format).

[/vc_column_text][vc_gallery el_id=”gallery-202494″ medias=”49711,49712″ gutter_size=”3″ screen_lg=”1000″ screen_md=”600″ screen_sm=”480″ images_size=”one-one” single_overlay_opacity=”50″ single_padding=”2″][vc_column_text]

In addition, documents are typically encoded using commonly accepted web language standards: XML, JSON, BSON, etc. This makes it easy for developers to incorporate the results of database queries into web applications directly. Document models solve a structural problem created by relational databases; they support a flexible data model that can withstand the demands of a rapidly evolving project with polymorphic data and multiple users logging in from around the world.

In this use case, there is a translation of data from one system (TMS) to another (ElasticSearch). Each database solves a different problem, the former serving as a tool for the storage and long-term upkeep of the data using universally accepted museum standards, and the latter to efficiently populate a user interface for navigating the various archeological data on a research platform built for the web.

Faculty Case Study: Russian Modules

Russian Modules, led by Steven Clancy, Senior Lecturer in Slavic Languages and Literatures and Director of the Slavic Language Program, is a Russian linguistics application designed to support curriculum building, language learning, and related research on the structure of the Russian language. Christopher Morse of Arts & Humanities Research Computing oversees application development, and the project uses a NoSQL graph database called Neo4j.

In its current iteration, users can type Russian text into a form that will automatically parse the information into a variety of categories. These categories range from part of speech, to word inflection, to clusters of common meaning, also called domains. In addition, the tool provides word difficulty levels based on the Russian language curriculum within the Slavic Department in order to gauge whether or not a particular text is too challenging (or not challenging enough) for students to undertake.[/vc_column_text][vc_single_image media=”49668″ caption=”yes” media_width_percent=”100″ alignment=”center”][vc_column_text]Data modeling within a graph database environment is quite different than within a tabular structure. In fact, graph databases do not have a set data modeling style because they were designed to emphasize the relationships between data more than the data itself. They are typically used to model social networks, families, organizations, epidemiological contagion paths, and other interconnected ideas.

The graph structure is straightforward: data must take the form of a node, relationship, or property. The resulting structure is reminiscent of a mind map, a series of circles connected by lines that represent some kind of relationship. Each circle within a graph database is referred to as a node or vertex, and each line that connects two nodes is referred to as an link or edge.

Here is an example view of the word “book” in the Russian Modules database:

[/vc_column_text][vc_custom_heading heading_semantic=”h4″ text_size=”h4″]Lemma Search: книга (book)[/vc_custom_heading][vc_raw_html]JTNDZGl2JTIwaWQlM0QlMjJydV9ib29rJTIyJTNFJTBBJTNDcHJlJTNFTUFUQ0glMjAlMjhsJTNBTGVtbWElMjktJTVCciUzQUlORkxFQ1RTX1RPJTVELSUzRSUyOGYlM0FGb3JtJTI5JTIwV0hFUkUlMjBsLmxhYmVsJTIwJTNEJTIwJTI3JUQwJUJBJUQwJUJEJUQwJUI4JUQwJUIzJUQwJUIwJTI3JTIwUkVUVVJOJTIwbCUyQ3IlMkNmJTNCJTNDJTJGcHJlJTNFJTBBJTNDJTJGZGl2JTNF[/vc_raw_html][vc_column_text]

Within the Russian Modules database, all word forms are nodes. Dictionary forms of the word are labeled lemmas, and grammatical inflections of the lemma are labeled forms (e.g. lemma: книга; form: книги [genitive, ‘of the book’]). Encoded within the relationship are additional properties such as what part of speech a word is inflecting to, or what difficulty level the word has. In the above interactive example, the central word is the lemma, and the connections fanning out from the center each represent different forms of that lemma. The relationship between the lemma and its forms also has a unique name: INFLECTS_TO. This name exhibits the relationship between the nodes very clearly.

The querying language used to interface with Neo4j, Cypher, encourages modeling data in plain English. This philosophy echoes the distant yet perennial wisdom of Abelson and Sussman who wrote in their preface to Structure and Interpretation of Computer Programs that “programs must be written for people to read, and only incidentally for machines to execute.” The Neo4j back end includes built-in graphical functionality powered by D3js, and the Cypher query language allows users to intuitively work with their data. The above example can be simplified into plain English without much of a jump:

[/vc_column_text][vc_raw_html]JTNDcHJlJTNFTUFUQ0glMjAlMjhsJTNBTGVtbWElMjktJTVCciUzQUlORkxFQ1RTX1RPJTVELSUzRSUyOGYlM0FGb3JtJTI5JTIwV0hFUkUlMjBsLmxhYmVsJTIwJTNEJTIwJTI3JUQwJUJBJUQwJUJEJUQwJUI4JUQwJUIzJUQwJUIwJTI3JTIwUkVUVVJOJTIwbCUyQ3IlMkNmJTNCJTNDJTJGcHJlJTNF[/vc_raw_html][vc_column_text]

The query searches for all lemmas that match the label “книга”, and then searches for all associated relationships. The relationship is called INFLECTS_TO, but also has a directional component. (Lemma)-[INFLECTS_TO]->(Form) has “–>” inside of it, a graphical way of describing the relationship.

This free data model allows users to create and remove nodes and connections on the fly with very little code. Traversing the various relationships across the graph is also simplified in comparison to extremely complex relational databases with multiple intermediary tables that connect data. Take for example the following query that searches for the word sister, all of its inflected forms, the domain of words it belongs to, and the other words that also make up that domain:

[/vc_column_text][vc_raw_html]JTNDcHJlJTNFTUFUQ0glMjAlMjhsJTNBTGVtbWElMjAlN0JsYWJlbCUzQSUyNyVEMSU4MSVEMCVCNSVEMSU4MSVEMSU4MiVEMSU4MCVEMCVCMCUyNyU3RCUyOS0lNUJyJTNBSU5GTEVDVFNfVE8lNUQtJTNFJTI4ZiUzQUZvcm0lMjklMkMlMjAlMjhsJTI5LSU1QnIxJTNBSEFTX0RPTUFJTiU1RC0lM0UlMjhkJTI5JTNDLSU1QnIyJTNBSEFTX0RPTUFJTiU1RC0lMjhvJTI5JTBBUkVUVVJOJTIwbCUyQ3IlMkNmJTJDcjElMkNyMiUyQ28lM0MlMkZwcmUlM0U=[/vc_raw_html][vc_single_image media=”49625″ caption=”yes” media_width_percent=”100″ alignment=”center”][vc_column_text]

Future iterations of the project endeavor to include more robust natural language processing functionality, interactive word visualizations, and the integration of the database into an eBook for Russian language study.

[/vc_column_text][vc_raw_js]JTNDc2NyaXB0JTIwc3JjJTNEJTIyaHR0cCUzQSUyRiUyRmQzanMub3JnJTJGZDMudjMubWluLmpzJTIyJTNFJTNDJTJGc2NyaXB0JTNFJTBBJTNDc2NyaXB0JTNFJTBBJTBBdmFyJTIwaGVpZ2h0JTIwJTNEJTIwNTAwJTNCJTBBdmFyJTIwd2lkdGglMjAlM0QlMjAxMjAwJTIwJTJBJTIwLjY2JTNCJTBBJTBBdmFyJTIwZm9yY2UlMjAlM0QlMjBkMy5sYXlvdXQuZm9yY2UlMjglMjklMEElMjAlMjAlMjAlMjAuc2l6ZSUyOCU1QndpZHRoJTJDJTIwaGVpZ2h0JTVEJTI5JTBBJTIwJTIwJTIwJTIwLmNoYXJnZSUyOC00MDAlMjklMEElMjAlMjAlMjAlMjAubGlua0Rpc3RhbmNlJTI4MTAwJTI5JTBBJTIwJTIwJTIwJTIwLm9uJTI4JTIydGljayUyMiUyQyUyMHRpY2slMjklM0IlMEElMEF2YXIlMjBkcmFnJTIwJTNEJTIwZm9yY2UuZHJhZyUyOCUyOSUwQSUyMCUyMCUyMCUyMC5vbiUyOCUyMmRyYWdzdGFydCUyMiUyQyUyMGRyYWdzdGFydCUyOSUzQiUwQSUwQXZhciUyMHN2ZyUyMCUzRCUyMGQzLnNlbGVjdCUyOCUyMiUyM3J1X2Jvb2slMjIlMjkuYXBwZW5kJTI4JTIyc3ZnJTIyJTI5JTBBJTIwJTIwJTIwJTIwLmF0dHIlMjglMjJ3aWR0aCUyMiUyQyUyMHdpZHRoJTI5JTBBJTIwJTIwJTIwJTIwLmF0dHIlMjglMjJoZWlnaHQlMjIlMkMlMjBoZWlnaHQlMjklM0IlMEElMEF2YXIlMjBsaW5rJTIwJTNEJTIwc3ZnLnNlbGVjdEFsbCUyOCUyMi5saW5rJTIyJTI5JTJDJTBBJTIwJTIwJTIwJTIwbm9kZSUyMCUzRCUyMHN2Zy5zZWxlY3RBbGwlMjglMjIubm9kZSUyMiUyOSUzQiUwQSUwQWQzLmpzb24lMjglMjJodHRwJTNBJTJGJTJGZGFydGhjcmltc29uLm9yZyUyRmFydHNodW1yYyUyRmRhdGFiYXNlczIuMCUyRnJ1X2Jvb2suanNvbiUyMiUyQyUyMGZ1bmN0aW9uJTI4ZXJyb3IlMkMlMjBncmFwaCUyOSUyMCU3QiUwQSUyMCUyMGlmJTIwJTI4ZXJyb3IlMjklMjB0aHJvdyUyMGVycm9yJTNCJTBBJTBBJTIwJTIwZm9yY2UlMEElMjAlMjAlMjAlMjAlMjAlMjAubm9kZXMlMjhncmFwaC5ub2RlcyUyOSUwQSUyMCUyMCUyMCUyMCUyMCUyMC5saW5rcyUyOGdyYXBoLmxpbmtzJTI5JTBBJTIwJTIwJTIwJTIwJTIwJTIwLnN0YXJ0JTI4JTI5JTNCJTBBJTBBJTIwJTIwbGluayUyMCUzRCUyMGxpbmsuZGF0YSUyOGdyYXBoLmxpbmtzJTI5JTBBJTIwJTIwJTIwJTIwLmVudGVyJTI4JTI5LmFwcGVuZCUyOCUyMmxpbmUlMjIlMjklMEElMjAlMjAlMjAlMjAlMjAlMjAuYXR0ciUyOCUyMmNsYXNzJTIyJTJDJTIwJTIybGluayUyMiUyOSUzQiUwQSUwQSUyMCUyMG5vZGUlMjAlM0QlMjBub2RlLmRhdGElMjhncmFwaC5ub2RlcyUyOSUwQSUyMCUyMCUyMCUyMC5lbnRlciUyOCUyOS5hcHBlbmQlMjglMjJnJTIyJTI5JTBBJTIwJTIwJTIwJTIwJTIwJTIwLmF0dHIlMjglMjJjbGFzcyUyMiUyQyUyMCUyMm5vZGUlMjIlMjklMEElMjAlMjAlMjAlMjAlMjAlMjAub24lMjglMjJkYmxjbGljayUyMiUyQyUyMGRibGNsaWNrJTI5JTBBJTIwJTIwJTIwJTIwJTIwJTIwLmNhbGwlMjhkcmFnJTI5JTNCJTBBJTBBJTIwJTIwbm9kZS5hcHBlbmQlMjglMjJjaXJjbGUlMjIlMjklMEElMjAlMjAlMjAlMjAlMjAlMjAuYXR0ciUyOCUyMnIlMjIlMkMlMjAxMiUyOSUzQiUwQSUyMCUyMG5vZGUuYXBwZW5kJTI4JTIydGV4dCUyMiUyOSUwQSUyMCUyMCUyMCUyMCUyMCUyMC5hdHRyJTI4JTIyZHglMjIlMkMlMjAxMiUyOSUwQSUyMCUyMCUyMCUyMCUyMCUyMC5hdHRyJTI4JTIyZHklMjIlMkMlMjAlMjIuMzVlbSUyMiUyOSUwQSUyMCUyMCUyMCUyMCUyMCUyMC5hdHRyJTI4JTIyZm9udC1zaXplJTIyJTJDJTIwMTQlMjklMEElMjAlMjAlMjAlMjAlMjAlMjAuYXR0ciUyOCUyMmZvbnQtZmFtaWx5JTIyJTJDJTIwJTIyc2Fucy1zZXJpZiUyMiUyOSUwQSUyMCUyMCUyMCUyMCUyMCUyMC5hdHRyJTI4JTIyc3Ryb2tlJTIyJTJDJTIwJTIyMHB4JTIyJTI5JTBBJTIwJTIwJTIwJTIwJTIwJTIwLnRleHQlMjhmdW5jdGlvbiUyOGQlMjklMjAlN0IlMjByZXR1cm4lMjBkLmxhYmVsJTIwJTdEJTI5JTNCJTBBJTBBJTBBJTdEJTI5JTNCJTBBJTBBZnVuY3Rpb24lMjB0aWNrJTI4JTI5JTIwJTdCJTBBJTIwJTIwbGluay5hdHRyJTI4JTIyeDElMjIlMkMlMjBmdW5jdGlvbiUyOGQlMjklMjAlN0IlMjByZXR1cm4lMjBkLnNvdXJjZS54JTNCJTIwJTdEJTI5JTBBJTIwJTIwJTIwJTIwJTIwJTIwLmF0dHIlMjglMjJ5MSUyMiUyQyUyMGZ1bmN0aW9uJTI4ZCUyOSUyMCU3QiUyMHJldHVybiUyMGQuc291cmNlLnklM0IlMjAlN0QlMjklMEElMjAlMjAlMjAlMjAlMjAlMjAuYXR0ciUyOCUyMngyJTIyJTJDJTIwZnVuY3Rpb24lMjhkJTI5JTIwJTdCJTIwcmV0dXJuJTIwZC50YXJnZXQueCUzQiUyMCU3RCUyOSUwQSUyMCUyMCUyMCUyMCUyMCUyMC5hdHRyJTI4JTIyeTIlMjIlMkMlMjBmdW5jdGlvbiUyOGQlMjklMjAlN0IlMjByZXR1cm4lMjBkLnRhcmdldC55JTNCJTIwJTdEJTI5JTNCJTBBJTIwJTIwbm9kZS5hdHRyJTI4JTIydHJhbnNmb3JtJTIyJTJDJTIwZnVuY3Rpb24lMjhkJTI5JTIwJTdCJTIwcmV0dXJuJTIwJTIydHJhbnNsYXRlJTI4JTIyJTIwJTJCJTIwZC54JTIwJTJCJTIwJTIyJTJDJTIyJTIwJTJCJTIwZC55JTIwJTJCJTIwJTIyJTI5JTIyJTNCJTIwJTdEJTI5JTNCJTBBJTBBJTBBJTdEJTBBJTBBZnVuY3Rpb24lMjBkYmxjbGljayUyOGQlMjklMjAlN0IlMEElMjAlMjBkMy5zZWxlY3QlMjh0aGlzJTI5LmNsYXNzZWQlMjglMjJmaXhlZCUyMiUyQyUyMGQuZml4ZWQlMjAlM0QlMjBmYWxzZSUyOSUzQiUwQSU3RCUwQSUwQWZ1bmN0aW9uJTIwZHJhZ3N0YXJ0JTI4ZCUyOSUyMCU3QiUwQSUyMCUyMGQzLnNlbGVjdCUyOHRoaXMlMjkuY2xhc3NlZCUyOCUyMmZpeGVkJTIyJTJDJTIwZC5maXhlZCUyMCUzRCUyMHRydWUlMjklM0IlMEElN0QlMEElMEElM0MlMkZzY3JpcHQlM0U=[/vc_raw_js][/vc_column][/vc_row]

Conclusion: Looking Forward

The technologies discussed herein comprise a limited selection of database options, each of which varies in complexity, cost, and community size. At Harvard, successful iterations of digital humanities projects making use of both SQL and NoSQL databases have contributed to local communities of practice that continuously push the envelope to explore and describe modern (and emerging) practices for architecting data.

Looking forward, researchers and scholars should always expect at least a minimal learning curve with any new technology. In the end, however, they should not be completely overwhelmed by their digital tools, otherwise the technology has failed to serve its intended purpose. That being said, new does not always mean better, and sometimes the traditional model or approach is the way to go. And finally, sustainability and the projected lifespan of a project are important considerations for any research application built for the web. Be aware of the required upkeep (yearly hosting costs, software upgrades, deprecations, general maintenance, etc.), and use caution with any technology that lacks compatibility with accepted open standards.

Digital Japanese Literature: Aozora Bunko

Introduction

Higuchi Ichiyō, featured at Aozora

Aozora Bunko (青空文庫) is a digital archive of Japanese literature in the public domain. In addition to its web presence, the corpus is also available on GitHub where it can be downloaded in its entirety. This makes it possible to perform a distant reading of the collection, and the following information serves as a general introduction for data analysis and parsing.

First and foremost, you will need to clone the GitHub repository to your computer. A warning: the download is quite large (~4 GB) and therefore may take some time. To install the repository locally, access the command line and input the following:

git clone https://github.com/aozorabunko/aozorabunko.git

If you have not yet installed git on your computer you will first need to follow these directions for your respective operating system.

Aozora Bunko Schema

A large majority of Aozora Bunko story entries maintain a shared structure: header information, main text, and bibliographic information. Please see the following example of Higuchi Ichiyō’s short story Tsuki no Yo.

It is possible to isolate the elements that make up each story thanks to the standardized HTML output format of the archive; however, one particular challenge is filtering out all furigana in order to properly mine the text. Furigana is used as a reading aid in Japanese—syllabic characters can be appended to ideographic characters (kanji), especially for kanji that are rare or have a special pronunciation. Within the first few lines of Tsuki no Yo there are several instances of furigana usage.

Behind the scenes this looks a bit messy, and it can be difficult to parse. Furigana entries use ruby characters, a web standard element that behaves as an annotation of sorts for logographic characters. For more information on the ruby specifications, please see the W3C documentation page. The following is an example of Tsuki no Yo viewed as HTML.

Parsing Aozora Texts

Molly Des Jardin, Japanese Studies Librarian at the University of Pennsylvania, has written a script in Python that will automatically strip the ruby characters from a text. This script and additional resources for Japanese language analysis can be found on her Japanese Text Analysis library guide. In order to run this script you will need Python installed on your computer, and in addition you will need to install the following dependencies: BeautifulSoup & TinySegmenter. If you are new to Python, make sure to install the pip tool, thereafter you can install the two libraries from the terminal as such:

pip install beautifulsoup4

pip install tinysegmenter

import os
import glob 
import sys
from bs4 import BeautifulSoup
from tinysegmenter import *

for filename in glob.iglob('*.html'):

# Remove ruby and <rt> <rp> tags from text

with open(filename, 'r') as f:
input = f.read()
print filename

soup = BeautifulSoup(input)
tagname = 'rt'
for tag in soup.findAll(tagname):
tag.extract()

tagname = 'rp'
for tag in soup.findAll(tagname):
tag.extract()

tagname = 'span'
for tag in soup.findAll(tagname):
tag.extract()
nonruby = unicode(soup)

# Remove all HTML tags and attributes, then write the file to (filename).txt

nonruby = re.sub('<[^<]+?>', '', nonruby)

segmenter = TinySegmenter() 

tokenized = segmenter.tokenize(nonruby)
tokenized = tokenized[0:(tokenized.index(u'底本'))-1]

tokenized = ' '.join(tokenized)

file = open(filename + '.txt', 'w')
file.write(tokenized.encode('utf-8'))
file.close()

Running Molly’s script is quite simple: place the file into a folder containing the HTML files of each story you would like to parse. From the terminal, simply execute the following command:

python rubydetokenize.py

Please note that this script was designed for Python 2, but can be converted for Python 3 by making small changes to the code (notably, changing the print statements). The script will iterate through the various files to remove all HTML tags and ruby annotations, and will output them as text files with only the text of the story remaining.

Before (tsuki_no_yo.html):

After (tsuki_no_yo.txt):

In order for this script to work correctly on Tsuki no Yo, it was first necessary to add a few lines of additional code. The reason for this is because there is a language usage note made within the text itself, and represented as an HTML <span> element that the script does not originally scan for:

The following code was added to the original script to target any HTML elements called “span,” thereby removing the language usage note entirely. While working with various stories you may discover that there are internal inconsistencies that require you to target specific HTML elements that causing the script to either break or parse incorrectly.

tagname = 'span'
    for tag in soup.findAll(tagname):
    tag.extract()

You may also notice that the script tokenizes words, meaning it attempts to group words based on common lexical patterns in Japanese. This work is done by TinySegmentor, one of many parsing tools for East Asian languages. Another useful parsing tool is MeCab, which also works with Python. No parser is 100% accurate (at least not yet), especially for stories within the Aozora database which may contain antiquated morphological patterns that are no longer in use.

Header Image: Harvard Art Museum, 1977.202. “The Former Deeds of Bodhisattva Medicine King,” Chapter 23 of the Lotus Sutra (Hokekyô) Calligraphy.

Visualisation des Billets Vendus

Visualizing Theater History

Odéon Theater seating layout

Visualisation des Billets Vendus, a data interactive created by Christophe Schuwey, Lecturer at Université de Fribourg (Switzerland), and Christopher Morse, Senior Research Computing Specialist at DARTH, reveals ticket sales at performances during the 1784-1785 season at the Odéon-Théâtre de l’Europe in Paris. The project was conceived during the May 2016 Pratiques théâtrales & archives numérisées Projet des Registres de la Comédie-Française (1680-1793) conference cohosted by Harvard and MIT.

Inspired by the work of conference presenters Pannill Camp, Associate Professor of Drama at Washington University at St. Louis, Juliette Cherbuliez, Associate Professor of French at the University of Minnesota, and Derek Miller, Assistant Professor of English at Harvard University, Christophe and Christopher deliberated over possible methods of representing an interactive theater space within a browser. Although the project is still in its nascent stages, it has already revealed interesting perspectives on performance attendance.

Thanks to the meticulous recordkeeping of the Comédie-Française, it is possible to reconstruct with some certainty how full or empty the theater was during each performance between 1680-1793. In addition to cast lists, show dates, and other relevant performance information, The Comédie-Française Registers Project database contains digitized receipts of daily ticket sales. Visualisation des Billets Vendus uses this data to reveal how crowded the theater was on a given night in the form of a heat map—the hotter the color, the busier the performance.

Ticket sales during the opening night of The Marriage of Figaro (April 27, 1784), source.

Designing a Theater

Odéon theater layout, source.

D3JS is a data visualization library built in JavaScript for use in web programming. It was an obvious choice for this particular project because D3 simplifies the task of rendering interactive shapes within the browser and associating them with dynamic data. Using various theater layout diagrams it was possible to abstract the general shape of the theater into a seating chart.

The Odéon consists of several seating areas for which there is recorded data: the parterre assis, galerie, prémier loge, deuxième loge, troisième loge, and at the top of the theater, paradis. Similar to other theaters, each seating area is divided into additional sections. For now, each of the individual seating areas were combined on each floor to create a total count.

Each floor of the theater was represented as an arc and assigned its own unique color. A slider at the bottom of the page allows users to cycle through each performance date, and in the top left there is a legend that helps to distinguish how busy a particular floor was in relation to the others. With every movement of the slider, lines representing each floor will appear on the legend to show how that night’s attendance compares with the entire season.

When The Marriage of Figaro opened on April 27, 1784, after years of rewrites and censorship, the Odéon was completely packed. The floor plan is bright red, save for one row: the troixième étage. Could this be an error in the data, or perhaps attributable to special seating arrangements, or season passes purchased in advanced? With a visual guide to each performance it becomes far easier to discovery and query these inconsistencies.

Seating chart during the opening of The Marriage of Figaro (April 27, 1784)

Continuing the Tradition

Visualisation des Billets Vendus, while still in its early stages, has been an interesting thought experiment in theater representation and history, and presents a number of unique challenges. For example, how should one visualize a theater? Does it suffice to abstract a theater into shapes like a seating chart one might see on a website like Ticketmaster? What can be learned (or not) by specificity, that is to say, by attempting to recreate each individual seating area, or even each seat?

Afterlife of the 1680 Comédie-Française Repertoire, by Derek Miller

Moving forward, the visualization seeks to encompass the entirety of the Comédie-Française registers collection, totaling over one hundred years of ticket sales, and various user interface improvements over time will make it easier for users to work with the heat map in more detail.

In addition to this project, the conference hackathon also inspired a number of other data visualizations and digital presentations, including one by Derek Miller, which can be read here: Four Perspectives on the Comédie-Française Repertoire.

Harvard IIIF

DARTH has just released its newest project in collaboration with the Harvard Library, the Harvard Art Museums, HarvardX, Harvard University Information Technology (Arts & Humanities Research Computing, the FAS Academic Technology Group, Library Technology Services), and various academic departments: Harvard IIIF. The Harvard University International Image Interoperability Framework (IIIF) website is a centralized resource for documentation, development, and use case scenarios regarding the display and sharing of cultural heritage materials stored within the various Harvard University collections.

Harvard University has adopted the International Image Interoperability Framework standards as developed by the IIIF Consortium for describing and sharing digital assets, and has co-developed the IIIF image viewing software Mirador.

Visitors can explore the site to learn more about Harvard’s work with IIIF, the Mirador viewer, and exciting innovations happening around campus that are made possible by these new standards and technologies.