"

2 Alex API

The Alex API: Module { xml_to_text }

Overview

The xml_to_text module contains functions that strip xml tags from plain text files.  It can process either an entire directory of files process_directory() or a single file process_xml_file()

Call Tree

process_directory()

process_xml_file()

replace_tags()

extract_metadata()

process_body()

clean_text()

format_metadata()

Function Descriptions

extract_metadata()

The extract_metadata function is designed to extract metadata from the XML header section. It begins by initializing an empty dictionary named metadata to store the extracted information. The function then searches for the HEADER section within the XML tree, starting from the root element using the find method. If the HEADER section is found, the function iterates through all elements within this section. For each element, it converts the tag name to lowercase and checks if the tag is already present in the metadata dictionary. If the tag is not present, it adds the tag as a key with its text content as the value, stripping any leading and trailing whitespace. If the tag is already present, it appends the new text content to the existing value, separated by a semicolon. Finally, the function returns the metadata dictionary containing the extracted metadata from the HEADER section.

process_body()

The process_body function processes and cleans the body content of an XML element. It starts by initializing empty lists named sections and current_section.  Set a flag start_processing to False. Iterate through all elements in the XML. If the element tag is DIV1, set start_processing to True and continue. If start_processing is True, process the elements based on their tags (HEAD, Q, L, P). Append the cleaned text to the current_section or sections list. Return the processed sections as a single string.

get_text()

This function concatenates the element’s text and the text of its children. Then it returns that text.

clean_text()

The clean_text function is designed to clean and format text content. It starts by removing specific unwanted characters, such as ‘∣’ and ‘▪’, using the replace method. Next, it removes any HTML-like tags from the text by using the re.sub function with a regular expression that matches any content within angle brackets. The function then replaces multiple spaces or tabs with a single space, again using the re.sub function with a regular expression that matches one or more spaces or tabs. It also replaces multiple newlines with double newlines. Finally, the function returns the cleaned text with leading and trailing whitespace removed by using the strip method.

replace_tags()

This function uses string replacement and regular expressions to replace specific tags in the text that can be deemed as important. This returns the modified XML content. More tags can be added based on the meeting with the scholars.

process_xml_file()

The process_xml_file function is designed to parse an XML file and process its content. This function opens each XML file and reads its content. It then replaces specific tags in the content and then parses the modified file. The root element is retrieved, and the metadata is extracted from the root element. The body content is processed and cleaned. The metadata and cleaned content are returned separately as two files are needed.

format_metadata()

The format_metadata function is designed to format metadata for display. It takes a dictionary named metadata as input, where each key-value pair represents a piece of metadata. The function uses a list comprehension to iterate over each key-value pair in the dictionary. For each pair, it converts the key to uppercase and formats it together with the value into a string in the format KEY: value. The function includes only those key-value pairs where the value is not empty. It then joins the formatted strings with newline characters to create a single string, with each key-value pair on a new line. Finally, the function returns this formatted string, making the metadata easy to read and display.

process_directory()

The process_directory function processes all XML files in a specified directory and saves the output to another specified directory. It begins by iterating through all files in the directory. If the file is an XML file, the information is processed. The metadata and cleaned content are both extracted from the file. The metadata is formatted for display. The metadata and the cleaned content are directed towards their specified directories (preferably “text” and “metadata” when calling the function). The program finally prints out when the files have been processed and saved to their specified directories.


The Alex API: Module { search_dataset }

Overview

This module is still under development and has not yet been released.

Signature

search_dataset (dataset, search_terms[list], bool_oper, search_type, variants, meta_only, from_date, to_date)

Function Arguments

dataset

The document dataset to be searched.  The documents for each of our curated datasets reside in a top-level folder named for that dataset.  The complete list of curated Alex datasets is at: /blue/data/reference/humanities

Global constants:

EEBO

EEBOP

MEDICI

DONNE

search_terms[list]

search_terms: This argument can accept a single search term (type: string) or multiple search_terms in a Python list.  If no value is passed, the function throws a missing search term error.

When the search_type is ‘Hybrid’, this function has two options for the vector search.  It can search either the dense or sparse vector.  If the argument has one or two search terms, the function searches the sparse vector. If more than two search terms are given – that is, a phrase is passed – then the function searches the dense vector.  The sparse vector is best for keyword searches whereas the dense vector performs better with semantic searches.

bool_oper

The boolean operator is used when multiple search terms are provided in the search_terms argument.  This value can be either ‘and’ or ‘or.’  It defaults to ‘or’ when no value is passed.  When searching for a single term, this argument can be blank.

search_type

search_type:  This argument offers multiple ways to search a dataset.  Here are the available options for this argument with brief descriptions:

  1. ‘ANN’   : The Approximate Nearest Neighbor or ANN search works well when you need to find documents in the dataset that have a similar meaning to the phrase or sentence entered as your search term.  ANN is an inexact kind of search that calculates a semantic distance when deciding if a specific document ought to be returned or not.
  2. ‘Hybrid’: The Hybrid search option allows you to execute a semantic search if the number of words in the search term is more than three.  Otherwise, it executes a case-sensitive keyword search. This search type uses the BGEM3 word embedding, a different one from that used in ANN.
  3. ‘Index’ : With this option, the algorithm searches a full-text index of all the words in the dataset.  Additionally, it generates a list of spelling variants and searches those too.  Note: the first search can be a bit slower as the index is loaded into memory.  This search type is similar to traditional index searches from vendors such as ProQuest.

For additional information about ANN and Hybrid searches, please see the Milvus User Guide.  Both execute searches against the Milvus vector database.

variants

This argument can be assigned a value of ‘True’ or ‘False’ depending on whether the user wants a ‘FullText’ search to use variant spellings (True) of the search_terms or not (False).  If not specified, this argument defaults to False.

meta_only

This argument accepts either a Python boolean True or False.  If not specified, it defaults to False.  Set to True if you only want to search meta-data and not full-text.  Technical note: each of our dataset folders has both /meta and /text sub-folders.  Thus, this argument determines the physical location of the document files to be searched.

from_date

This argument sets the start date for a date range search.  It limits the retrieved documents to those that are greater than or equal to the date specified in this argument.  Dates should be formatted as: ‘mm/dd/yyyy’.

to_data

This argument sets the end date for a date range search.  It limits the retrieved documents to those that are less than or equal to the date specified in this argument.  Dates should be formatted as: ‘mm/dd/yyyy’.

 

License

The Alex Research System Copyright © by Daniel Maxwell. All Rights Reserved.