Workflows and Pipelines

Daniel Maxwell

7 Workflows and Pipelines

7.1 Introduction

In this chapter, I walk through the steps I typically take while executing a research project, highlighting some useful AI-enabled tools and places where this technology might add value. Unlike other chapters in this book, this one is autobiographical in nature. Every scholar has their own set of unique working methods. These are mine.

7.2 Initiating a Research Project

Before we get into the nitty-gritty, let’s step back and take a broader look at the research process. The purpose of this move is to identify those points where AI-enabled tools might add value to the historian’s work.

The starting point of any project is to answer the question, “What am I interested in?” The scholar’s level of interest is an important consideration as projects can last for days, weeks, or even years. One therefore needs to be passionate about what they are studying. Many a project has been started and then abandoned during those long stretches of time devoted to humdrum research activities. Only a strong inner desire and curiosity will get one through those inevitable research deserts.

With a research focus established, the background reading and bibliographic work now begins. The first order of business is to find quality sources on the topic of interest. Of course, the nature of those sources should reflect one’s level of analysis. Am I working at the macro, meso, or micro level? A civilizational inquiry will differ dramatically from the study of a nation or a single individual or community.

The library’s online public access catalog (OPAC) still plays a vital role in the discovery process. Even so, AI-enabled search and summary tools have become extremely useful. Google recently announced Deep Research, a new research tool that solves a vexing AI problem. The problem is that ChatGPT and other large language models (LLMs) often pull citations out of thin air. In a word, they hallucinate. That’s not the case with this new generation of AI-powered search tools. Nothing’s made up. The tools list the sources used to generate a response. And with a single click, the scholar can jump straight to them.

My favorite AI search tool is Liner. To illustrate how Liner works, consider the case of a scholar who has developed an interest in the sumptuary laws of Venice during the Renaissance. (Actually, I’m the one who has this interest, though I don’t claim to be a professional historian.). After arriving at the Liner site, I prompt it as follows:

What has been written about sumptuary laws in Renaissance Venice?

Liner responded with many excellent academic resources, one of the best being Sumptuary Law in Italy: 1200 – 1500 by Catherine Killerby. It also found a set of great articles in English, a few in Italian, and some specific to Venice. I was pleased with the results, though it was clear that many had already researched this topic. That’s the problem with Venice. Everybody loves her, including the hundreds of historians who’ve studied almost every facet of the city’s existence. My background reading, though, suggested that there might be topics that have not yet been investigated. While reading Samuel Huntington’s excellent book – The Clash of Civilizations – I asked myself, “Has anyone ever studied how Venice managed its relationships with its Mediterranean and European competitors? Has anyone written a diplomatic history of the Venetian republic?” Venice was and is unique in that almost its entire livelihood was based on trade. Might the story of how it negotiated its way to trading success, first with the Byzantine and then the Ottoman civilizations, be of any value today? Given the current talk about global trade and tariffs, this seemed to be a thread worth exploring. I headed off to Liner and prompted it as follows:

Has anything been written about the Venetian empire’s diplomatic history?

Once again, I was pleased with the model’s response. However, this response was qualitatively different from the earlier sumptuary conversation. The resources, in this case, were primarily websites. Only a single item, a book written in 1944 by Mary Shay, was diplomatic in nature, a study of the dispatches from the Venetian Baili (ambassadors) to the Ottoman court from 1720 to 1734. This was good news in that it indicated that Venetian diplomacy and its international relations with the known world at that time had yet to be studied in depth. Of course, I need to investigate this further to verify that this is indeed the case.

7.3 A First Look at the Research Workflow

We begin this section with a view of the overall research process.

For search and bibliography tasks, we discussed traditional library tools in the last section, including Liner and Google Deep Research, two recently launched AI search and summary services. Of course, Google Scholar is an excellent resource as well. After reviewing Liner’s summary of academic sources on Venetian diplomacy, I searched Scholar using “Venetian Diplomacy” as the search phrase. It found a lot of excellent material, including a dissertation by Tessa Beverly, a scholar who studied the Venetian diplomatic corp from 1454 – 1494. Surprisingly, these items were not included in the results from Liner or Google Deep Research. Unfortunately for me, it looks like this topic has already been well-researched.

I use EndNote (Clarivate) for bibliographical management. There are a variety of citation tools on the market, Zotero and Mendeley being two popular alternatives. Citation and bibliographic management software has been around for a while. And though these tools are not AI-powered – at least not yet – they play a vital role in any research project. The work of correctly formatting citations alone makes these tools worth having. As an initial reading list takes shape, reading commences, and so does note-taking. About 20 years ago, I tried Dragon, one of the first AI enabled speech-to-text products to hit the market. It was terrible. That is no longer the case. I use Google’s voice typing all the time now. It’s free (Dragon retails for $1,000) and accurate. Yes, it still requires some manual intervention. Even so, it has saved me a lot of typing time. Some writers can compose articles through dictation alone. Churchill dictated many of his books, talking while his secretaries dutifully typed everything he said. Not me. I have to sweat every word.

When writing begins, I sometimes use ChatGPT, Llama, or other large language model to flesh out an initial draft. This has changed recently as I increasingly find AI-generated writing to be bland and boring. I usually end up rewriting everything anyway. Maybe it’s just me, but I like to write. But having said that, I have fallen in love with Grammarly. Grammarly sharpens my writing, improves my style, and fixes hundreds of grammatical errors. This is the way AI tools should work, as knowledgeable assistants not substitutes.

7.4 Archive Data Pipelines

The archive lies at the heart of historical research. And to highlight its importance, the archive box is colored light green in figure 7.1 from the previous section.

Let’s jump into the archive process to see where AI might be helpful. Broadly speaking, archive documents fall into four categories: Images, Mixed Image / Text, Handwritten, and Printed. Again, these are rough categories as I am not trying to define a final or ultimate archive classification here. The critical point to keep in mind right now is that AI models train best with simple text or image files. They can train on annotated files, but those files should be free of XML tags. I realize this contradicts accepted digital humanities data preparation practices, specifically those promoted by the text encoding initiative (TEI). In a TEI project, the scholar’s first task is to mark up a document set using the organization’s standardized XML tags. The repetition of tags, though, can gum up model training processes, creating data distortions where they are not wanted or needed.

AI models train best with simple text or image files.

The Text Encoding Initiative (TEI)

When TEI was launched in 1987, it was a great idea, allowing humanities scholars to share annotated documents with each other seamlessly. It still is. TEI delivers on its promise of document interoperability, and it should not be abandoned. So, how should one move forward? Here’s what we did on a recent project. We first made a copy of the dataset and then wrote and ran a function to strip all the XML tags from the documents in that copy. This left us with two datasets, one with tags for digital humanities work and another for AI model training.

Let’s now turn our attention to specific document processing workflows. The first workflow is the mixed document one that contains both images and text. Here we see a book that was created by scribes and illuminators prior to or shortly after Gutenberg’s invention of the printing press circa 1436. Books printed shortly after the introduction of the printing press are now called incunabula, the Latin word for cradle. During this period, hand-written texts continued to be produced even as printed works entered the market. Here’s the initial workflow for artifacts written by hand. It all begins with a scan of the document, letter, or text.

Later, I will discuss do-it-yourself (DIY) document scanning. But for now, I want to say that it’s always a good idea to consult with a friendly archivist, a professional who can help you design a scanning workflow that fits your budget and project. Once a clean scan is available, we then use AI object detection to split the document into two parts, iconography (images) and handwritten text. A format split is required because the underlying AI models are format specific. That is, model architecture differs depending on the kind of data being processed. From there, content is passed to either the iconography or handwriting workflows.

Let’s now consider the iconography workflow, labelled 1 in Figure 7.2. Content arrives from two sources – the mixed workflow pictured in Figure 7.2 and document scans of iconography or artwork. Because this workflow only processes visual images, the content coming from the mixed workflow or from new page scans contains only images, no text.

Next, we use an AI-enabled object detection model to read and classify the images. As shown here, the model has added captions to each image. We might then run this data through a visual natural language processing (VNLP) model, resulting in an annotated document with a narrative of the action depicted in the artwork. VNLP is a relatively new field at the nexus of computer vision and natural language processing (NLP). This technology allows machines to derive meaning from visuals and any accompanying text.

Pictured in Figure 7.4 is the handwriting workflow. Like its iconography equivalent, data enters this process from the mixed workflow, including new document scans.

Many handwritten documents are in a cursive script and need to be converted to plain text. Paleography is the field of study where scholars acquire the skills to read and convert these documents. AI models can now do this same kind of work. A European company called Transkribus offers a variety of AI paleography models for a small monthly fee. Interestingly, cursive scripts and even typed or block print can vary considerably. Thus, models are specific to a time and place. On the Transkribus public AI models website, for example, the scholar can select from a wide variety of models, including “Nordic Typewriter 1900 – 1950”, “Portuguese Handwriting 16th – 19th Centuries”, “Russian Print of the 18th Century”, and many others. I think you get the idea. There is, however, one caveat. The technology is not perfect. In some cases, AI model accuracy rates are equal to those of humans. But in many others, they fall short. The situation will probably improve over time, though it continues to be a factor right now.

After a document has been transcribed, it can then be annotated using a named entity recognition (NER) model. NER models can detect personal names, geographic locations, and public and private organizations. As always, NER accuracy is a function of the data on which the model was trained. Because the names of geographic features (cities, rivers, roads, etc.) can change over time, models trained on period-specific data will need to be created, tested, and deployed. At a recent conference, a scholar told me that their experience with NER had been a total failure. The model had identified just a handful of entities in a corpus of medieval documents. The problem, as I quickly discovered, was that they had used a model trained on modern documents, not one exposed to medieval names, places, and organizations. In this case, failure was inevitable. The answer is a new model.

Figure 7.5 pictured here is the workflow for printed documents.

Figure 7.5. The Printed Documents Pipeline

Earlier, I mentioned that I would discuss a simplified scanning process. What follows is a description of a low-volume (personal) scanning workflow, not a high-volume one for a large project. At the University of Florida, the libraries have placed KIC Bookeye scanners in various locations. The Bookeye is an excellent choice because it offers adjustable cradles for holding a book’s left and right sides. This can be helpful when a book refuses to lie flat while scanning. A user can even separate the cradles from each other, creating a slot for the spine. Better still, KIC’s touchscreen interface is simple and intuitive. You’ll be scanning and emailing .pdfs to yourself in no time.

Once a scan is complete, the software packages everything into a single .pdf file and asks for an email address. If your organization limits the size of email attachments, you’ll need to limit the number of pages scanned. At UF, that limit is 25 pages or so. After the .pdf file hits my inbox, I save it to a local drive and then upload it to Adobe Acrobat’s OCR tool if it has text in it. OCR is an acronym for Optical Character Recognition, a technology that converts scanned images of text in a .pdf file into searchable, editable text. This allows users to copy, edit, or highlight the content. Once again, the objective is to end up with simple text files as this is what our AI models need to do their work.

And finally, Figure 7.6 zooms out to show the larger picture. As shown here, the annotated, plain-text documents produced by the three workflows are staged in a shared location, making them accessible to our AI tools.

In summary, AI has added some new tools to the historian’s research toolbox. Even though the tools have changed, the craft of historical research remains unchanged. Nothing can replace the thoughtful work of a human researcher.

Media Attributions

Top Level Workflow
Mixed Document Workflow
The Iconography Pipeline
The Handwriting Pipeline
The Printed Documents Pipeline
The All Documents Pipeline