Mining the Meaning (MiMe) is a project financed by the University of Copenhagen Data+ grant that aims to explore the literary and social modernization of Scandinavian societies during the latter part of the 19th century. Unlike traditional historiography, MiMe employs advanced natural language processing (NLP) techniques and semantic analysis to examine a broad corpus of 895 Danish and Norwegian novels published between 1870 and 1900. The project’s methodology includes developing state-of-the-art computational semantic methods and training large language models towards written late 19th-century Danish and Norwegian. The project also plans to integrate existing conceptual knowledge resources and develop methods for incorporating extra-linguistic information in semantic parsing and analysis.
The MiMe project is closely related to the Measuring Modernity (MeMo) project financed by the Carlsberg Foundation, but the two have distinct focuses. While both initiatives explore the reflections of societal change in Scandinavian literature from the same historical period, MeMo primarily focuses on the literary analysis research, investigating how Denmark and Scandinavia became modern and the role of literature in that process. On the other hand, MiMe is more oriented towards the development of computational methodology, aiming to create a higher level of abstraction in text analysis, enabling a fine-grained but large-scale investigation of various phenomena. Both projects, however, share the common goal of offering new insights into the processes of modernization in this formative period in the literary and social history of Scandinavia.
People
- Jens Bjerring-Hansen, Department of Nordic Studies and Linguistics, University of Copenhagen (Co-PI)
- Daniel Hershcovich, Department of Computer Science, University of Copenhagen (Co-PI)
- Ali Al-Laith, Department of Computer Science, University of Copenhagen (Postdoc)
Sub-projects
- Distinguishing Contemporary and Historical Novels. The classification of historical and contemporary novels is a nuanced task that has traditionally relied on expert literary analysis. We introduce a novel dataset, annotated by literary scholars, to distinguish between historical and contemporary works. While this manual classification is time-consuming and subjective, our approach leverages pre-trained language models to streamline and potentially standardize this process. We evaluate their effectiveness in automating this classification by examining their performance on titles and the first few sentences of each novel. After fine-tuning, the models show good performance but fail to fully capture the nuanced understanding exhibited by literary scholars. We underscores the potential and limitations of NLP in literary genre classification and suggests avenues for further improvement, such as incorporating more sophisticated model architectures or hybrid methods that blend machine learning with expert knowledge. We contribute to the broader field of computational humanities by highlighting the challenges and opportunities in automating literary analysis.
- Noise and Sound. We develop a framework for detecting and categorizing noise in literary texts. Noise, understood as “aberrant sonic behaviour,” is not only an auditory phenomenon but also a cultural construct tied to the processes of civilization and urbanization. By leveraging topic modeling techniques and fine-tuned BERT-based language models trained on Danish and Norwegian texts, we analyze the MeMo corpus to extract and examine noise-related topics. We identify and track the prevalence of noise in these texts, offering insights into the literary perceptions of noise during the Modern Breakthrough period. We develop a comprehensive dataset annotated for noise-related segments and their categorization into human-made, non-human-made, and musical noises. We illustrate the framework’s potential for enhancing the understanding of the relationship between noise and its literary representations, providing a deeper appreciation of the auditory elements that enrich literary works.
- The Unhappy Texts. This term is rooted in a literary hypothesis that suggests 19th-century Scandinavian texts written by female authors were characterized by a negative sentiment, reflecting the societal constraints and patriarchal norms that women faced during this era. The authors of these texts often depicted characters who lacked agency and were disillusioned, reflecting their own experiences in a restrictive society. However, it’s important to note that this hypothesis is based on a limited selection of texts and is subject to ongoing analysis and interpretation. The use of sentiment analysis tools and methodologies, such as those developed for the analysis of historical Danish and Norwegian literary texts, can provide a more nuanced understanding of these ‘unhappy texts’ and the societal conditions they reflect.
- Language Models for Historical Literary Scandinavian Texts. This sub-project develops pre-trained language models specifically designed for historical Danish and Norwegian texts. It fills a crucial gap in NLP, which, despite a wealth of English language resources, lacks models tailored for historical Scandinavian literature. Leveraging the unique MeMo corpus, we investigate the potential of fine-tuning pre-trained language models on historical data and training a language model from scratch using it. As part of this endeavor, we will collect additional historical documents, including newspapers and novels from various time periods, to train new pre-trained language models that encapsulate a richer historical context. We plan to experiment with various architectures, including encoder-only, encoder-decoder, and decoder-only. Furthermore, we create annotated datasets to benchmark these models, and use them to enable large-scale and nuanced literary analysis. We also aim to explore the applicability of these models to other corpora and their potential to enhance other research initiatives. While our project shares similar goals with initiatives like the Danish Foundation Models project, it stands out due to its specific focus on historical Danish and Norwegian texts. We anticipate synergies with other projects, such as resource sharing and expertise exchange, and potentially using their contemporary models as a foundation for our historical models.
- The Fate of the Modern Breakthrough. The modernization processes in Scandinavia in the latter half of the 19th century changed how we perceive the world around us and existence in general, but that did not happen overnight. Through a conceptual historical analysis of the concept of skæbne (fate/destiny), this sub-project explores the dialectical relationship between pre-modern and modern perceptions of the world as it unfolds in Scandinavian literary history. The literary use of the concept reflects, on the one hand, the new secular and scientific ideals that gained ground during the period, while on the other hand, it retains the concept’s religious and metaphysical roots. With many of the novels now forgotten, the project adapts and utilizes unsupervised machine learning models to provide a conceptual overview of the period. Further, it creates annotated datasets to fine-tune and deploy language models for more fine-grained analyses.
Publications
Unhappy Texts? A Gendered and Computational Rereading of The Modern Breakthrough. Kirstine Nielsen Degn, Jens Bjerring-Hansen, Ali Al-Laith and Daniel Hershcovich. Scandinavian Studies, 2. 97, 2025 (Accepted/In press).
Literary Time Travel: Distinguishing Past and Contemporary Worlds in Danish and Norwegian Fiction. Jens Bjerring-Hansen, Ali Al-Laith, Daniel Hershcovich, Alexander Conroy and Sebastian Ørtoft Rasmussen. CHR 2024.
Noise, Novels, Numbers. A Framework for Detecting and Categorizing Noise in Danish and Norwegian Literature. Ali Al-Laith, Daniel Hershcovich, Jens Bjerring-Hansen, Jakob Ingemann Parby, Alexander Conroy and Timothy R Tangherlini. EMNLP 2024.
Development and Evaluation of Pre-trained Language Models for Historical Danish and Norwegian Literary Texts. Ali Al-Laith, Alexander Conroy, Jens Bjerring-Hansen and Daniel Hershcovich. LREC-COLING 2024.
Sentiment Classification of Historical Danish and Norwegian Literary Texts. Ali Al-Laith, Kirstine Nielsen Degn, Alexander Conroy, Bolette S. Pedersen, Jens Bjerring-Hansen and Daniel Hershcovich. NoDaLiDa 2023.