About: Part II - Methods

The Stainforth Library of Women’s Writing is a collaborative digital humanities project that, since 2012, has involved the work of thirteen members employed as faculty, staff, and students at the University of Colorado Boulder and Dartmouth College. The aim of the project is to create a digital edition of Stainforth’s library catalog manuscript and, down the road, a digital version of the library. The 740-page catalog lists approximately 8,804 volumes written by or containing writings by 3,721 authors who published in North America, Europe, Asia, and Australia. 

Statement of Our Collaborative Ethos

The Stainforth Library of Women’s Writing would not exist without the contributions of many editors since the project's naissance. The project team will remain committed to collaboration with core team members, proper attribution for work on the project (where the definition of “work” is fluid), and preserving the scholarly integrity of our data and the project as a whole.

Our collaboratory methods include working together in person when feasible, but because the team is split between multiple institutions, we usually work remotely. To raise questions and discuss solutions in real-time, we use Slack for internal communication with the team. We also use Twitter to pose our questions to a larger public audience of intellectuals who can approach our queries with fresh eyes and an array of expertise. We also invite responses to our blog posts and hope that in the future they will also be sites for lively discussion.

The project began in 2012 as a collaboration between Kirstyn Leuner, Deborah Hollis, and Holley Long in University Libraries at the University of Colorado Boulder. Collaborations between CU-Boulder, Dartmouth College and, beginning in Fall 2017, Santa Clara University are a consequence of the project director being employed full-time at those institutions after completing her PhD at CU-Boulder, first as a postdoctoral fellow in the Neukom Institute of Computational Science and, subsequently, as Assistant Professor of English at SCU.

Generous funding and support of many varieties has been provided by: an Innovative Seed Grant (CU-Boulder), The President’s Fund for the Humanities (CU-Boulder), Special Collections and Archives (CU-Boulder), The Neukom Institute for Computational Science (Dartmouth College), and the Digital Humanities working group at Dartmouth College. For a list of our current and former contributors, see our About Team Stainforth page.

We adopted a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license. We hope that you will use our project to explore the Stainforth library and learn more about this important nineteenth-century private library of women’s writing and its contents. We also encourage scholars to conduct their own research and publish digital or non-digital scholarship on the Stainforth library. According to our CC license, you are free to “copy and redistribute the material in any medium or format” and “remix, transform, and build upon the material” as long as you:

  • Give appropriate credit and cite the Stainforth project. See our “how to cite” page for help.

  • Use the same license we do (CC BY-NC-SA 4.0) and provide a link to this license, so that subsequent work will again use the same license.

  • Use the material for non-commercial purposes.

Transcription

Before we began transcribing, we assigned each line in the catalog a unique identifier so that we could track transcriptions. Thus, we scanned the catalog and created a PDF file for each page that labeled page and line numbers. We access them here.

Once the PDF pages were created, the team set about transcribing the entire catalog page by page, line by line. It was a painstaking process that required the work of nine editors over three years. Before the transcription process started and during the work, a complex set of Guidelines took shape in the team’s Google Drive folder that established mutually agreed upon editorial rules for how the team would consistently treat the data. For example, our guidelines specify that where data is unreadable due to damage (torn pages, spilled ink, etc.), we include tags around the damaged text with x’s to denote missing characters and that the type of damage be recorded in the notes field for the line. These guidelines also cover how to deal with incomplete data, unreadable data, deleted entries, or entries with data added above or below the line, for example. Guidelines further include rules for what data types to enclose in specified tags that accord with the Text Encoding Initiative P5 Guidelines. All of this was done to ensure transcriptional fidelity with Stainforth’s original entry and consistency among an evolving group of collaborators.

For each page of the manuscript, editors created a new Google Sheet from a template and transcribed the manuscript line by line according to our guidelines. This process could be as easy as noting that every line was blank or could be an involved process requiring extensive research and consultation with other editors. Some transcription puzzles were even solved with the help of crowdsourcing solutions on Twitter. The resources we used most often for transcription research include the Sotheby’s auction catalog, worldcat.org, The Orlando Project, Google Books, Internet Archive, and HathiTrust.

Before an editor could begin work on a page, they would “sign” that page out on a tracking sheet maintained in Google Drive. Most editors signed out blocks of approximately ten pages at a time. This same tracking sheet template was use during the editorial process as we double checked each other’s work.

Editing

Once each line of the manuscript was transcribed, the Team began editing the data in stages. Raw transcriptions were saved in a separate Google Drive folder and edits made to our transcriptions in new files so that we (a) had backups of our data and (b) could compare the original or raw transcription with the edited version if needed. The complete editorial process involved comparing our transcription to the manuscript in three different phases. If we found a discrepancy, we would communicate as a team via email to get confirmation from at least one other team member on our proposed change. Our first phase confirmed the shelfmark transcriptions and edited for interlines—places where Stainforth wrote an entry between two lines. Then we looked solely at content to ensure no information was missing from the body of the transcription. Finally, we reviewed our TEI tags to check for any errors. The purpose for editing in multiple phases was to focus our attention on one aspect of the data at a time and thereby create a more accurate and streamlined editorial process. (We discovered that trying to do all of these editorial tasks at once led to too many human errors.) After these three phases were completed, spot-checked every 10th line (about 2 lines per page) and revisited challenging areas. In an effort to ensure that our data is as clean as possible, we plan to include feedback forms on our site to crowdsource edits to human error that we did not catch in our review despite our best efforts.

After every page had been thoroughly edited, we compiled all of our data (each ms page had its own file) into two “master” spreadsheets with all pages combined: one “master” for the acquisitions side of the catalog (Stainforth’s holdings) and another one for the “Wants” catalog, or wish-list in the back of the catalog, which was transcribed during the same initial phase. At this stage, we needed to further edit our data in preparation for parsing it. This involved assigning an Entry Type (such as “AuthorOnly,” “Title Only,” “Blank”) to each line of data. For example, most of Stainforth’s catalog entries include an author and a title, and these lines would therefore be labeled with an “AuthorTitle” entry type. We also located entries that contained multiple editions of a work and added rows so that each individual work and edition was represented by its own line of data. For instance, Fisher’s Drawing Room Scrapbook was a literary annual for which Stainforth had acquired a complete run of editions. He originally listed this collection in one row, as “F8 Fisher's Drawing Room Scrap Book 1832-49.” To account for each edition, we split this title up so that each year had its own row reflecting a single publication date, though of course we did not alter the transcription itself. The parser separated the data in each transcription into distinct columns. These include shelfmark, author, title, publication place, publication year, and book format.

Once the parser sorted our data into the necessary columns that would ultimately form the basis for our database, we needed to edit it once again to eliminate parser errors. For example, where lines of data were irregular, the parser added, subtracted, or otherwise misrepresented information from some columns. We also had to ensure that data was represented consistently throughout. For instance, Stainforth’s primary method of listing authors is by last name then first name or initials in parenthesis--e.g., “Bronte (Charlotte)”. If an irregular line had brackets around a name instead of a parenthesis, or if Stainforth listed the surname last instead of first, we changed it manually in the AuthorText column but left the transcription so that it accurately reflects the manuscript.

Database

Finally, we developed a custom MySQL database with back-end user interfaces for project editors and administrators. We assigned each field in the database a TEI tag set. From the editors’ console or the public interface, users will be able to export a well-formed XML document containing all of the transcription data as well as person and title authority records. The Team is currently at work on completing person and title Authority records.

To create these records, editors use the Virtual International Authority File (VIAF.org), The Orlando Project, The Oxford Dictionary of National Biography, and WorldCat.org as resources. Our team began the process of verifying the identity of each person listed in the catalog and creating “person records” or authority records for each author in our database. Each record contains the person’s name, as verified by two of the sources listed above if possible, role (e.g., editor, author, typesetter, co-editor), birth and death year, and nationality, for example. We also include links to their authority file in VIAF, their Wikipedia page, and their ODNB page, if available. As in our previous stages, we pose any questions or points of confusion to the entire team for discussion and resolution via email or Slack, a real-time messaging app that our team has come to rely on as our primary method of communication. We also created guidelines for our authority record process which are saved in Google Drive. Each member has editorial privileges in our guidelines document and we update this document as new editorial situations arise. Authorities research will take quite some time to complete given the extent of the data and the obscurity of many of the writers and titles listed.

As part of our authority work, we are also keeping a list of authors who have little to no Internet presence with the intent to edit or create a Wikipedia page for each. This list is available on our blog to encourage other scholars and enthusiasts to join our efforts. The Stainforth Library project is an ever expanding and ever evolving project, and our hope is to expand the membership in “Team Stainforth” and involve scholars and students alike in the process of discussing and recovering the work of the authors, editors, printers, and translators featured in the library.