DR Eszter Kovacs
About two montha ago, I had the opportunity to participate in the second ConCorDial Conference, held in Lyon, at the École normale supérieure de Lyon. This event followed the first ConCorDial Conference, held in Grenoble in 2022, and gathered together linguists, literary scholars, historians, historians of ideas, even if the conference’s main scope was in historical linguistics.
My paper, based on our teamwork, presented how the project VERITRACE constituted a corpus on three levels—that is, our four ancient corpora, the Close Reading Corpus, and the Distant Reading Corpus. I underscored that our CRC, technically speaking, is similar to a smaller corpus in historical linguistics, while our DRC is a plurilingual big data corpus to search for the influence of prisca sapientia between 1540 and 1728. Even though we are studying the change in belief in a universal knowledge of divine origin in the works of early modern philosophers and not the change of language usage during the same period, we can benefit from the computational methods developed in corpus linguistics.
The conference was very instructive as for our goals and our methodology: the constitution and handling of large corpora that cover several centuries were the focus of the different papers. Presentations on medieval and early modern corpora were particularly helpful and gave us new insights.
Three keynote speakers presented their research during this two-day conference. First, Sascha Diwersy, in his talk “Analyse de données diachroniques en linguistique de corpus et textométrie”, considered the aspects of time, text, data, and change, as well as the analysis of these factors by computational methods. A decisive question is how to identify turning points in a corpus from a historical point of view. Secondly, Thierry Poibeau, in his talk “L’IA peut-elle vraiment nous aider à explorer les grands corpus littéraires”, presented automated methods for the analysis of 19th-century novels, among others by the French version of BookNLP, a natural language processing pipeline for fiction. How to analyse big literary corpora with AI? What are the benefits and what kind of loss is produced in a non-human reading?—Poibeau raised such questions for collective discussion. Finally, on the second day, Céline Poudat presented “Le projet Open French Corpus”, an annotated and indexed reference corpus for the French language.
The panel papers also presented insightful studies, for example the presentation of the Project ECOLE by Denisa Bumba and Gabriella Parusa, on the growing importance of automatic transcription of manuscripts by AI, namely with tools trained for a medieval and early modern corpus. David Kahn and Sascha Diwersy presented research on the evolution of the usage of words in a special context, such as terms on heterodoxy in 16th-century Spain. Their inquiry is based on spiritual works and two inquisition processes. Another key question was how to detect relevant change in language usage via digital corpora? Tanguy Lemoine studied the role of semicolon in different genres in a 16-17th—century corpus.
An automatic recognition of a specialized discourse in large corpora is also possible, like in the project GÉODE led by Aline Brenon, which focuses on geographical discourse in encyclopaedias from the Encyclopédie by D’Alembert and Diderot to 19th-century encyclopaedias. Lucas Lévêque, Florian Cuny, and Noé Gasparini presented state-of-the-art methods of computational processing of dictionaries in the public domain.
An inspiring paper by Lucence Ing on the disappearance of lexemes between the 13th and 16th centuries illustrated that words that occur rarely do not yet mean a complete disappearance, until a systematic replacement can be detected in the corpus. Mathieu Dehouck, Sophie Prévost, Mathilde Regnault, and Loïc Grobol, in a collective paper on syntactic change in a Neolatin corpus, pointed out that “noise” can appear in the corpus as for linear regression. While the mother tongue of the authors influenced the way they wrote in Latin, the latter as the learned language also had its rules. The last paper, by Natasha Romanova and Rayan Ziane, treated of the syntactical annotation of diachronic corpora and the progress Large Language Models allowed if trained for the target corpus.
As the papers of this conference pointed out, quantitative methods are more and more important in corpus linguistics and the related fields, but human analysis remains necessary for making research questions and answering them. The progress of computational analysis has been exponential over the past decades and over the past years. However, researchers are aware of the shortcomings of these methods as for an in-depth analysis of some linguistic or thematic phenomena found in the corpus. LLMs and AI help our work, but they had to be trained and fine-tuned for the target corpus.
The notion of “noise” in the corpus is of particular interest for our methodology, because it can appear due to several factors, such as poor OCR quality, change in print typology during the period, change in orthography, but also on a semantic and conceptual level. “Noise” however can also become an indicator for interesting phenomena to study in the corpus.
Large corpora, diachronic corpora, corpora which can only be handled by computational methods have become part of linguistical, literary, and historical studies, and conquer fields such as the history of science, the history of religion, the history of medicine, etc. While a syntactically annotated corpus can seem less important for historians (after all, parts of speech do not seem to change facts) in a comparative study of style and content, this method can also play a role for historians. Semantic annotation, topic modelling, text matching, named entity recognition, latent semantic analysis, etc. can be beneficial for all these fields, particularly those centred on such key human societal factors as language usage and intellectual debates.
