WP1-en – CORTEX

work plan

WP1. Commonsense, Semantic, World Knowledge and Infrastructures for Natural Language Generation

Semantic world knowledge is essential for resolving a variety of deep, complex decisions in natural language understanding and generation. Moreover, for efficiently and effectively managing all existing and potential new knowledge, appropriate infrastructures are also necessary. The purpose of this work package is to explore multiple and heterogeneous knowledge sources and existing infrastructures to obtain, infer and manage knowledge, ensuring its quality for later integration in the NLG process (WP2). This WP will allow the achievement of objectives OB1 and OB2, through the successful completion of the following three tasks.

Task 1.1 Exploration of existing knowledge sources and language infrastructures

Knowledge in NLP was studied initially through the creation and development of knowledge resourcesand ontologies (e.g. Wordnet (Fellbaum, 1998), BabelNet (Navigli and Ponzeto, 2010), Cyc (Lenat, 1995)). Although more recent initiatives, such as ConceptNet (Speer, Chin and Havasi, 2017), ATOMIC (Sap et al., 2019) or LETO (Estevez-Velarde et al., 2019) are very powerful for natural language understanding, to the best of our knowledge, their exploitation for NLG has been limited. Together with these aforementioned knowledge resources, we can also find several language infrastructures, for instance, CLARIN or DARIAH-EU, whose development and use thus far lacks participation from Spanish-based research groups. Additionally, projects, whose main objective is to obtain open and generic language models for research and industry development purposes, can also be found. Examples of these projects are MarIA and LEIA (focused on the Spanish language), or, Nós, or AINA; the latter two ensure the use of Galician and Catalan language in the digital age, respectively. Together with these infrastructures and models, huge datasets are also available, e.g., Colossal Clean Crawled Corpus (C4) and its multilingual versions mC4 in more than 100 languages.

Therefore, the objective of this task is to explore and analyse in-depth so as to compile existing and available knowledge, infrastructures and language models, thereby identifying the potential of these resources as well as their limitations for multilingual NLG. This task results in a specific computationally appropriate knowledge compilation for NLG. Moreover, it explores to what extent the available large language models and infrastructures can be used as a basis for further research in tasks 1.2, 1.3, as well as the forthcoming work packages.

Milestone: Multilingual language models, linguistic infrastructures and existing knowledge sources and datasets for NLG.

Task 1.2 Knowledge quality assurance and extraction

Ensuring high knowledge quality and precision is crucial to create NLG models that learn to avoid incorporating societal biases and inaccurate information in further steps (Sheng et al., 2021). This is also necessary prior to extracting information which may be also used in the development of subsequent tasks and applications, with potentially negative outcomes if the employed datasets and language resources are not properly cleaned and bias-free. Hence, the aim of this task is to define a methodology and metrics to analyse and detect possible biases in the knowledge sources analysed in Task 1.1.

Milestone: Methodology and set of metrics that determine and ensure the quality of language models and to guarantee that they are bias-free.

Task 1.3 Knowledge discovery and representation

Once the knowledge from the previous tasks (Task 1.1 and Task 1.2) has been properly cleaned and prepared, the goal of this task is to centralise and represent it by means of an interactive Knowledge Lake (KL) that will contain heterogeneous and multilingual information. Our idea is to draw inspiration from the resource “Know Your Data” (28) created by Google, but with the novelty of extending it with unstructured information (i.e., text) and not only structured data, as is the case now. The KL we 6 de 20 propose could be exploited in the following ways: i) as an independent knowledge resource to perform analytical activities to obtain further insights about a topic, entity, etc., potentially relying on ontologies for this purpose; ii) to discover implicit knowledge that can be inferred from the knowledge it already contains via deep learning and neural networks, for instance; and, iii) by being extended and enriched with new input knowledge always ensuring the quality of the knowledge to be incorporated.

Whether discovering new implicit knowledge or extending existing knowledge, knowledge graphbased approaches, as in COMET (Bosselut et al., 2019), as well as bootstrapping techniques combined with machine learning or deep learning (Consuegra-Ayala et al., 2021) could be employed.

Milestone: Development of an interactive, heterogeneous, and multilingual Knowledge Lake.