CoCo Data Transformation: A Note and Reflection
One of the main goals of project CoCo is integrating data that were scattered in various siloed collections of many institutions. The integration is hoped to enrich the data and gave a platform for users to analyze and harvest insights from the data. One of the main phases of data integration is data transformation. The goal of data transformation is to standardize the structure and format of the various data obtained from different cultural heritage institutions. I’ve been developing data transformation scripts for the CoCo project and transforming data from SKS, SLS, and Edelfelt, and here, I want to highlight some of the interesting insights or things that I’ve learned from my work.
1) The Diversity of Data Structure
When I first started to do data transformation, I’ve had been briefed about the circumstances at hand. We have a sheet denoting the information of each dataset: Name, size format, etc. I can see from that sheet that data comes in all sorts of formats. We have the tabular CSV and Excel files, the structured XML file, and JSON endpoints that we need to crawl first, even the less structured Word files. To an extent, I had expected this diversity of format. But what exceeds my expectation is how information is stored within the file. My first task is to transform JSON files that were served under the HTTP API endpoint containing data by Edelfelt. This data source provides neat and structured data using easy-to-process JSON format. As the JSON is provided through API I also have to crawl the API and look for every relevant JSON. I created a Python script to do the scrapping, luckily the API doesn’t pose a rate limiter and the performance is pretty good thus making the process not too time-consuming. As such, while the crawl poses some problems but overall processing this data source doesn’t give too many difficulties. The second file that I handled is an Excel file provided by the Swedish literature society in Finland, SLS. The interesting process starts with getting the Excel file that was given to us through a USB stick (that I have to say has a beautiful custom design) that I need to pick up personally from the SLS office. The Excel file has several columns that describe the data in Swedish and the row contains the relevant data. While the data is nicely structured in a tabular format there is still some data processing required. Such as I have to combine several columns to create a full name. An interesting information extraction also starts to emerge like I have to determine whether the sender or receiver is a person or institution based on which of the name columns is filled. The last file that I process is the CSV file from the Finnish Literature Society. The CSV is simple with not too many columns. The way I handle and process the structure is pretty straightforward forward like SLS’s file. But, behind this simplicity lies a complexity that took me by surprise.
2) Harvesting Information From The Nook and Cranny
Harvesting information doesn’t always come in a straightforward manner. While in most cases, all we need to do is extract value from structured data, in some cases we have to delve deeper to get the information that we want. And the SKS file conversion is one good example of that.
As has been explained, the SKS CSV has a relatively smaller number of columns compared to the other dataset, while at first glance this seems to indicate that there is less work to do it turns out that extracting information is more complex than the other dataset. One example is the extraction of the sender and receiver. In the dataset, there only exist columns denoting the name of the sender of correspondence while the name of the receiver is not described in any column. However, I noticed that the data have a potentially helpful column that denotes the archive name. An anonymized, typical example would look like “W. Ainosen kirje J.W. Erikssonille”. From this example, we can see that the column describes the sender and receiver names albeit in a natural language manner. Hence, in order to extract the sender and receiver we have to use a natural language processing method namely lemmatization to extract the name of the receiver. As I never did Finnish NLP, I looked around for tools and libraries to use, and for this case, I use a web API tool provided by TurkuNLP, a NLP research group at the University of Turku. Another case from the same dataset is extracting the number of letters being exchanged. The dataset only has a column that, in Finnish, translates to table of content that roughly described the number of letters being exchanged. However, the values happen to be not-so-nicely decorated with keywords, so it would be like “5 kirjettä, 1 kirjekortti.” or “22 kirjettä, 4 korttia.”. As can be seen, to extract size we have to extract the number and add them together. I also want to add, that these “keywords” could get, to put it lightly, interesting with values ranging from the rather normal “Korttia” or “kirjettä” to the more interesting yet significantly rarer “lippu” or “kiitoskirje”. Thus, we had a discussion about whether to include all keywords or not.
3) The reaped benefits
The data transformation effort has been anything but simple and straightforward. However, this complexity also proves the potential benefit of the data integration being done in Project CoCo. From different languages, data formats to data structure, The scattered, untidy, and differing structure of these datasets would’ve made analyzing and gaining insight from this data an almost impossible task. The siloed nature of these data not only make the information from one dataset is not connected to another dataset but also let the dataset to be very different from each other.
We have a routine SPARQL Query session where we practice our SPARQL query by creating a query to answer questions related to our dataset. Questions such as “Which actors sends the most letter” or “which occupation send the most letter” not only bring learning challenges but also made me realize just what kind of insight that is made available by the data integration that we did. On a side note, a positive side-effect of this SPARQL session is detecting mistakes from the data transformation process such as when we found a correspondence between two persons that last more than 100 years or when we detect an error because a data pattern from one particular data source is different from the rest.
4) Personal Notes
For me, developing CoCo’s data transformation has been a colorful and fruitful journey. Aside from the technical learning experience, I also met and worked together with various people from various backgrounds. I learn new perspectives, new approaches to problems, and how to face problems and design solutions with this diversity of mind that we have.
For my old self, a simple computer science student in Indonesia, working hand-in-hand with professional computer scientists, historians, and humanists to work on historical data in Europe seems like an idea straight out of a daydream. While now the reality is less than ideal and I have to overcome various challenges, it is even more exciting and fruitful of an experience than I imagined. It is my sincere hope that my contribution here in this project would bring merit and benefit to many people.
M. F. W.
June 2023