2 AI and the Research Project Lifecycle
2.1 Introduction & Steps
In this chapter, we introduce the AI project lifecycle and the steps a principal investigator takes to execute a project. A quick overview is presented here, with a minimum of detail.

Step One
The first step is to pose a creative and interesting research question. If your goal is to deliver a service, the process is similar but with a slightly altered outcome. Interestingly, a researcher can arrive at their research question in multiple ways. As shown in the Figure below, bi-directional relationships exist between question, data, and method.

In most cases, a researcher already has a research question or a set of questions that reflect their existing interests and area of specialization. But sometimes, a dataset sets a project in motion. An organization, for example, may have just digitized a collection of documents and released them via a publicly accessible web interface. When that’s the case, research opportunities often pop up, and the astute researcher takes advantage of whatever comes their way. Indeed, data informs and delimits the questions that can be answered. The data and questions, in turn, drive the selection of suitable research methods. In other words, the research question, the data, and the method mutually influence each other. Together, they form a holy and indivisible trinity. That is, the researcher cannot isolate one from the other two. On this three-part foundation, an AI research project rests.
Much less frequently, a researcher may have an in-depth understanding of a machine learning technique and then looks for ways to apply it. Although this is a bit backwards, it, too, is an acceptable way to launch a project. With a technique selected, the researcher must then find a dataset and formulate their research questions. As in all research projects, the key here is to pose questions that can be answered, given your time and budget constraints.
Step Two
Once you’ve articulated your research questions, it’s time to gather data. The critical point to remember during this step is that the data must represent the population under study. In statistics, we say that a data sample ought to be representative. That is, it ought to accurately reflect the attributes of a specific population. The same holds true when collecting data for an AI project. The goal here is a balanced dataset, largely free from distorting bias, able to answer your research questions, and aligned with your goals and values. A bigger dataset is generally better, provided it meets the criteria outlined above.
Step Three
Step three is data cleaning and preparation. On a typical AI project, about 80% of one’s time is spent gathering data (step 2) and then cleaning and preparing it. For example, identifying outliers and imputing missing data are often done during this step. Visualization is an excellent way to identify these kinds of problems while acquiring a feel for the dataset. Typically, any feature or data engineering is also performed at this time. Normalizing data so it’s all on the same scale, converting categorical data to numeric data (one-hot encoding, etc.), and augmenting data are but three examples of data engineering. The list of options here is practically endless.
Step Four
With the data staged, you’re ready to choose a model. This step is closely linked to the previous one, as AI models require data to be presented to them in specific formats. The variety of model types and the total number of available models is expanding daily. Hugging Face is a great place to find pre-trained models. Two things primarily drive the choice of model: 1) the research question(s) and 2) the type of data collected. Convolutional neural networks, for example, are designed to handle unstructured image data. Thus, a CNN model is usually the first choice when analyzing or classifying images.
Step Five
Once a model has been chosen, it’s time to train it. Training time can be dramatically reduced when you select a pre-trained model. This is called transfer learning. With transfer learning, we take a model trained on one task and then partially retrain it to work for us on a similar task. The other option is to design a model from the ground up. This, however, requires a high level of technical expertise, and this approach can be slow and expensive. Training commences as cleaned data is fed to the model. As training progresses, the model adjusts its parameters in response to what it learns about the data in relation to a specific learning goal. If all goes well, the model converges (finds) an optimal solution, improving its accuracy along the way.
Step Six
In this step, we evaluate the trained model. This is done by presenting data to it that it has never seen before. Now, just before model training, we often divide our data into two parts: a training dataset to train the model and a test dataset to evaluate it. In this step, we use the test dataset to see how well the model’s predictions match the real world, which is called ground truth. A researcher can take several actions to enhance model fit if a model does not perform well during testing. Hyperparameters can be tweaked. Or additional feature/data engineering performed. We then repeat the previous step (training) and evaluate the model once again. Or stated another way, the relationship between steps 5 and 6 can be circular and repeated until the desired results are obtained.
Step Seven
And finally, we deploy our trained model, making it available to a larger audience. In this final step, you might decide to create a custom web interface that allows users to interact easily with your model. The Hugging Face platform will do some of this work for you, automatically generating model cards to facilitate interaction. If a custom interface is unnecessary, the best practice is to make all project-related code and data available on an open platform like GitHub. Some journals require that this be done as part of the article submission process.
Media Attributions
- Question, Data, Method © Dan Maxwell