Research topics

Current projects
Past projects


My research activity is multidisciplinary and is concerned with several areas: computer sciences, linguistics, Natural Language Processing and biomedical area. More particularly, I'm involved in several research topics (which results have been used and tested in several research projects):

  • Automatic creation of terminological resources, which becomes a special research area further to the development of semantic methods for the information access (ie, semantic web). More specifically, I investigate the detection of semantic relations between terms (equivalence, hierarchical or transversal relations). The methods exploited and designed rely on the lexical inclusion and on the contribution of morphology. More recently, in collaboration with Thierry Hamon, LIMSI, Université Paris 13, I work on the detection of synonymy relations through the compositionality principle. The evaluation of the acquired resources shows that the obtained precision is often higher than 95%. The comparison with the existing synonymy resource WordNet indicates that the recovery between the resources is very low: our method provides with many new synonymy relations not recorded in the existing resource. Currently, we work also on the reliability of the acquired synonyms and exploit for this the endogeneously generated clues and the structure of the graphs.
  • Improvement of the access to information thanks to the information retrieval and extraction methods. My first experiments addressed the access to the medical portal CISMeF: query expansion and proposal of more key-words to the users. The expansion has been done with morphological variations of the medical terms and showed a positive effect on the search results.
  • Caracterization of Web information addresses various points of view: detection of racist content on the Web and detection of the technicity level of health documents. In both areas, the methodology relies on the contrastive analysis of the documents and on the exploitation of their internal properties (lexicon, document structure, colors, morphology, stylistics...). Machine learning and lexicometrical algorithms have been used and provided with similar and convergent results. Moreover, the detection of the technicality of the health documents showed nearly 90% precision and recall, which is a very good performance of the automatic system and of the chosen features (morphology).
  • Quality of the health online information consists into automatic detection of the reliability and medical quality of online health literature. I started working on this topics in 2006 being member of the Health on the Net Foundation in Geneva, Switzerland. The developed tool implements the ethical HONcode. It is based on machine learning algorithms. The evaluation indicates that the results generated are at least as good as those provided by human annotators.
  • Information extraction consists into detection in narrative documents elements which present an interest to a given task. My main experience is related to the participation in the international NLP challenges led by the I2B2 initiative. The exploited methods are based upon semantic resources, rule-based and/or machine learning approaches. Some of the addressed tasks are: extraction of medications and of their characteristics (dosage, frequency, duration...), of clinical events (medical problems, lab examinations, treatments), of causal and temporal relations between different clinical events.

Current projects

Demonext (Dérivation Morphologique en Extension), French project accepted within the ANR frame

  • Duration: April 2018 - March 2022
  • Main investigator: Fiammetta Namer
  • Partners: ATILF (UMR 7118 CNRS, University of Lorraine), CLLE-ERSS (UMR 5263 CNRS, University Toulouse Jean-Jaurès), STL (UMR 8163 CNRS, University of Lille), LLF (UMR 7110 CNRS, University Paris-Cité)
  • Objectives: Demonext consists in the construction of a French morphological database (MDB) that describes the derivational properties of words in a systematic manner. The MDB will meet multiple needs, such as empirical confirmation of morphological hypothesis and elaboration of new ones, design of natural language processing (NLP) tools, vocabulary teaching and the treatment of developmental or acquired language disorders.
    The lexicon of a language like French is composed mainly of morphologically complex words: prefixed, suffixed, converted or compound. This structural information is generally available in the etymological sections of dictionaries, but the variability of its formulation makes it difficult to exploit. For languages such as English, German, Dutch or Czech, there are morphological databases (MDB) that describe the derivation properties of words in a systematic way: : CELEX, CatVar, DerivBase, etc….. This information is essential because many others can be inferred from it, the most important being the meaning of these words. Currently, there is a prototype of the MDB, the Demonette database (see here and here), developed by the two main partners of the project and which can be considered as an exploratory study of the present project. Having a widely covered MDB with rich and reliable descriptions in French would make it possible to meet multiple needs, such as empirical confirmation and hypothesis development in morphology, the development of NLP tools, vocabulary teaching, and the diagnosis and treatment of developmental or acquired lexical disorders.
    To meet these challenges, we propose to build the Demonext MDB. This large-scale resource will have rich descriptions of lexemes (i. e. lexical units) and derivation relationships and the paradigms in which they fit, represent information explicitly and uniformly, ensure systematic traceability of all the information it provides, and be compatible with the main current morphological theories (morpheme-based; lexeme-based; paradigm-based).
  • Responsibility:
    • Coordination of the subtask Evaluation/Validation
    • Applications

CLEAR (Communication, Literacy, Education, Accessibility, Readability), French project accepted within the ANR frame

  • Duration: Jan 2018 - Dec 2020
  • Main investigator: Natalia Grabar
  • Partners: MESHS-STL, UPR 3251 LIMSI, EA3412 LEPS, AFH (Association Française des Hémophiles), Synapse Développement
  • Objectives: The CLEAR project proposes innovative methods allowing creation of linguistic resources and software dedicated to the simplification of medical texts written in French. The software is expected to be the mediator in the communication between patients and medical professionals. The project addresses several challenges, such as: research on patient needs, process large corpora with heterogeneous and non structured data, adapt automatic methods to the medical field, create base with knowledge adapted to the explicitation of medical terms in French. The project will produce resources that can be exploited by medical professionals for improving their interactions with patients. As for the patients, they obtain a tool that provides the possibility to access knowledge on pathologies and their treatment, in order to make possible a better management of pathologies by patients and their increased participation to the social life despite their disease.
  • Responsibility:
    • Coordination of the project
    • Acquisition of resources for simplification
    • Methods for simplification
    • Evaluation and Valorization

MIAM (Maladies, Interactions Alimentation-Médicaments), French project accepted within the ANR frame

  • Duration: Jan 2017 - Dec 2019
  • Main investigator: Thierry Hamon
  • Partners: UPR 3251 LIMSI, MESHS-STL, Université de Bordeaux, Centre National Hospitalier d’Information sur le Médicament (CNHIM), ANTIDOT
  • Objectives: Given the huge amount of unstructured data in bibliographic databases, but also the development of open knowledge bases, accessing the knowledge they contain require to have a global view of multiple heterogeneous sources of information. To achieve this purpose, the MIAM project aims at proposing methods which rely on Natural Language Processing and text mining but also knowledge representation and modeling, in order to aggregate those data and knowledge issued from knowledge bases, Linked Open Data, scientific articles with research results, etc. To evaluate the results of the project in a real use case, the MIAM project focuses on the interactions existing between drugs and food which might lead to an adverse drug effect. Indeed, such information is currently fragmented and scattered over heterogeneous sources. Aggregating this information will help to formalize and visualize the description of these interactions for avoiding such adverse effects.
  • Responsibility:
    • responsibility for the MESHS-STL team
    • information extraction
    • certainty of information

REM (Vers une nouvelle conception des constructions modales en anglais : des paradigmes basés sur des traits distinctifs à la représentation probabliliste basée sur l’usage), international project accepted within the ANR frame, France-Switzerland project

  • Duration: Jan 2017 - Dec 2019
  • Main investigator: Ilse Depraetere, Martin Hilpert
  • Partners: Université de Neuchâtel, Switzerland; Université Lille 3, France
  • Objectives: One of the central features of human language is that speakers can verbalize states of affairs that are not factual, but that rather should, might, or could be the case. Non-factual ideas can be expressed through words and constructions that belong to the grammatical domain of modality (Palmer 2001). In linguistics, the study of modality has given rise to a substantial research literature (De Haan & Hanssen 2009, Nuyts & Van der Auwera 2016) that forms the context of this project, which focuses on modal verbs in the grammar of English. Specifically, its focus will be on five core modal auxiliaries (may, might, can, should, and must), two semi-modals (have to, ought to), and a periphrastic construction (be able to). The main question of this project relates modality to human cognition and the mental representation of language: How are modal expressions mentally represented? It is here that we see a gap in the research landscape that has so far not been sufficiently addressed: We are interested in the linguistic knowledge that speakers of English have that allows them to choose between expressions such as You should go home now, You have to go home now, or You ought to go home now. These examples express non-factual ideas that are very similar, but subtly different. An idea that is still relatively widely held in the literature on modality (cf. Van der Auwera and Plungian 1998) is that the meanings of modal expressions can be distinguished on the basis of binary features such as the distinction between obligation and permission, “weak” and “strong” modality, and deontic and epistemic modality. To illustrate, the sentence You should go home now encodes an obligation, whereas the sentence You may go home now denotes a permission. You must go home now denotes a stronger obligation than the sentence You should go home now. While we do not dispute the usefulness of categorical semantic distinctions between different expressions of modality, we question whether these distinctions exhaustively capture speakers’ linguistic knowledge of modal expressions and whether matrices of cross-cutting categorical features adequately represent that knowledge. This project advances an alternative view that aligns itself with two recent theoretical developments in linguistics, namely the frameworks of Cognitive Construction Grammar (Goldberg 1995, 2006) and usage-based linguistics (Bybee & Hopper 2001, Bybee 2010). We hypothesize that knowledge of modal expressions is exemplar-based and probabilistic. In other words, speakers’ knowledge of modal expressions is not to be modeled as a paradigm of forms that can be fully described through a set of cross-cutting categorical features, but rather as a network of form-meaning pairs (Hilpert 2014, Hilpert & Diessel 2016) in which the forms of modal expressions are connected to a range of meanings through associative links. Differences in association strength account for the fact that speakers choose a certain modal expression in a certain speech situation. We thus view speakers’ knowledge of modal expression not as a discrete one-to-one mapping between a form and a list of semantic features, but rather as knowledge of the probability that a given form will convey a certain meaning in a certain context.
  • Responsibility:
    • study of modality

FIGTEM (Fine-grained text mining), international bilateral project France-Brazil accepted within the INS2I CNRS frame

  • Duration: 2016 - 2018
  • Main investigator: Vincent Claveau, Claudia Moro
  • Partners: IRISA-CNRS (Institute for Research in IT and Random Systems), Rennes, France; PUCPR (Pontifícia Universidade Católica do Paraná), Curitiba, Brazil; HBD (Heath Big Data), LTSI/Inserm UMR1099, Rennes, France; STL (Savoirs, Textes, Langage), UMR CNRS 8163, Université de Lille 3, Lille, France
  • Objectives: Current medical needs, the growth of targeted therapies and personalized medicines, and escalating R&D costs result in formidable cost pressures on healthcare systems and the pharmaceutical industry. At the same time, clinical research grows in complex, labour intensity and cost. There is a growing realization that the development and integration of advanced Electronic Health Record systems (EHRs) for medical research can enable substantial efficiency gains and thereby make clinical centers more attractive for R&D investment whilst also providing patients in the region more rapid access to innovative medicines and improved health outcomes. In the clinical research process, meeting patient recruitment targets for the growing portfolio of clinical trials and observational studies conducted across the globe is an unprecedented challenge for the industry.
    Clinical trials (CTs) are fundamental for evaluating therapies or new diagnosis techniques. They are the most common research studies designed to test the safety and/or the effectiveness of interventions. A CT may address issues such as prevention, screening, diagnosis, treatment, quality of life or genetics, and each trial is designed to answer specific scientific questions. CTs are based on statistical tests and population sampling, and because they rely on adequate sample sizes it is common for CTs to fail in their objectives because of the difficulty of meeting the necessary recruitment targets in an effective time and at reasonable cost [Fletcher et al, 2012]. The number of clinical trials being conducted worldwide has increased from 49,000 in 1997 up to 199.313 in 2015 [https://www.clinicaltrials.gov/ct2/resources/trends], while the number of hospitals has not increased proportionally. This increases pressure on resources at the sites and also results in greater competition within the same patient pool, limiting the number of available patients to participate in clinical trials. Enrolling participants with similar characteristics helps to ensure that the results of the trial will be due to what is under study and not other factors. A second function of eligibility criteria is to exclude patients who are likely to be put at risk by the study, minimizing the risk of a subject’s condition worsening through participation. The features of the population of interest for a clinical trial are defined by the eligibility criteria of the trial. These characteristics determine the rules to be applied for building the sample of subjects. They may include age, gender, medical history, treatment, biomarkers or any other information about the patient. Eligibility criteria, which are a part of the study protocol, are still written in free text. They are available in English on clinicaltrial.gov website which is a worldwide trial databank.
    Paper based and EHRs are the main source of information used to rule in or rule out the CT criteria. It is worth to note that despite the development of EHRs, significant amount of patient data, such as medical reports or clinical notes, are still captured and available as free narrative text in the native language of a given country. This is the reason why only human operators (ie, principal investigators and clinical research assistants) are still the only ones capable to efficiently detect eligible patients [Campillo-Gimenez et al, 2015]. Moreover, this task is laborious, time- and cost-consuming and leads to a bottleneck in the clinical research process. Thus, a recent study by the Tufts Center for the Study of Drug Development found that the median number of procedures per clinical trial increased by 49% between 2000-03 and 2004-07, while the total effort required to complete those procedures grew by 54%. This puts greater strain on investigational sites and dissuades volunteers from participating. In fact, almost half of all trial delays are caused by participant recruitment problems and the percent of studies that complete enrolment on time is extremely low within the world's clinical trial markets: 18% in Europe; 17% in Asia-Pacific, 15% in Latin America; 7% in the USA [Center Watch, 2013]. Many attempts have been carried out in the last two decades [Cuggia et al, 2011] in order to develop computerized recruitment support systems. These systems are facing the gap between, on one hand, the free text representation of clinical information and eligibility criteria and, on the second hand, a mandatory formal representation which is required to perform automatic reasoning to achieve the recruitment task [Ross et al, 2010; Embi et al, 2005; Pressler et al, 2012; Olasov & Sim, 2006; Tu et al, 2011; Shivade et al, 2014]. Most of these systems proposed to fill this gap manually: human operator have to transform eligibility criteria into formal rules and use only structured (and poor) part of the patient data coming from EHRs. Natural Language Processing methods should help to overcome this issue by automatically extracting the eligibility criteria and the corresponding patient data in order to fulfill the formal representation framework that can be used by a recruitment report system.
    In this project, we will demonstrate by a proof of concept, how clinical research might leverage NLP and automatic reasoning methods in order to speed up the recruitment process at an international level. The specific objectives are:
    1. To develop methods of information extraction and indexing dedicated to clinical trial domain and in the scope of 3 languages (French, Portuguese and English);
    2. To populate a Recruitment Support System with this information. For that a formal data model adapted to clinical data and eligibility criteria will be defined;
    3. To evaluate the added value of these methods in a cross border recruitment scenario.
  • Responsibility:
    • creation and testing of methods for information extraction
    • terminological annotation and indexing
    • uncertainty of information

BIGCLIN (Big data analytics for unstructured clinical data), French project accepted within the LABEX CominLabs frame

  • Duration: 2016 - 2018
  • Main investigator: Vincent Claveau, Marc Cuggia
  • Partners: IRISA/Inria LinkMedia; IRISA/Inria Dionysos & Cidre; INSERM/LTSI Health Big Data; CNRS/STL
  • Objectives: As defined by the Data To Knowledge initiative, Health Big Data (HBD) is more than just a very large amount of data or a large number of data sources. HBD refers to the complexity, challenges, and new opportunities presented by the combined analysis of data. The data collected or produced during the clinical care process are now potentially sharable and reusable. They can be exploited at different levels and across different domains, especially concerning questions related to clinical and translational research. It has been demonstrated that for instance Electronic Health Records (EHRs) surpass many existing registries and data repositories in volume, and the reuse of these data may diminish the costs and inefficiencies associated with clinical research. To leverage these big, heterogeneous, sensitive and multidomain clinical data, new infrastructures are arising in most of the academic hospitals, which are intended to integrate, reuse and share data for research. For instance, the 5 academic hospitals of the French West region (Tours, Poitiers, Nantes, Angers, Rennes, Brest) are currently implementing the same Clinical Data Warehouse technology (eHOP) and are creating the first clinical data research network at the national level. A wide range of applications are potentially concerned by this new infrastructure and organisation dedicated to health big data reuse: clinical research, epidemiology, bio surveillance, clinical practice assessment or hospital management are some of the applications where data reuse can have decisive benefits.
    However, if integration and exploitation of structured data is now almost addressed, a well-known challenge for secondary use of EHR data is that much of detailed patient information is embedded in narrative text, mostly stored as unstructured data. Natural Language Processing (NLP) technologies, which are able to convert unstructured clinical text into coded data, have been naturally introduced into the biomedical domain and have demonstrated promising results in English-speaking countries. For instance, it has been shown that such narratives contain a huge amount of additional information compared with the one actually encoded (i.e. available as structured data).
    However, the lack of efficient NLP resources dedicated to clinical narratives, especially for French, leads to the development of ad-hoc NLP tools with limited targeted purposes. Moreover, the scalability and real-time issues are rarely taken into account for these possibly costly NLP tools, which make them inappropriate in real-world scenarii.
    Secondary use of health data still need to be more mature, with the above barriers being resolved before they can lead to more significant knowledge and practice consequences. Besides, the further processing of such massive data becomes thus possible. Some other today’s challenges when reusing Health data are still not resolved: data quality assessment for research purposes, scalability issues when integrating heterogeneous health “big data” or patient data privacy and data protection. These barriers are completely interwoven with unstructured data reuse and thus constitute an overall issue which must be addressed globally.
    This project thus proposes to address the essential need to leverage the above barriers when reusing unstructured clinical data at a large scale:
    1. We propose to develop new clinical records representation relying on fine-grained semantic annotation thanks to new NLP tools dedicated to French clinical narratives.
    2. Since, the aim is to efficiently map this added semantic information to existing structured data to be further analysed in a Big data infrastructure, the project also addresses distributed systems issues: scalability, management of uncertain data and privacy, stream processing at runtime, ...
  • Responsibility:
    • creation of NLP tools and methods for clinical texts
    • testing of tools in various applications

PACHA (Développement et validation d’indicateurs automatisés de pertinence de la prescription des anticoagulants oraux en médecine adulte à partir du système d’information hospitalier), French project accepted within the PREPS frame

  • Duration: 2016 - 2019
  • Main investigator: Frantz Thiessart
  • Partners: CHU de Bordeaux; Hôpital Européen Georges-Pompidou; CHU de Rennes; UMR8163 STL, Lille
  • Objectives: La pertinence des prescriptions d’anticoagulants oraux (Anti-Vitamines K et Anticoagulants Oraux Directs) constitue un enjeu majeur pour l’amélioration de la qualité, de la sécurité et de l’efficience des soins. La large population cible des anticoagulants oraux, leur fréquence de prescription et leur fort risque iatrogénique, en établissement de santé notamment, justifient l’intérêt de développer des indicateurs de pertinence de la prescription des anticoagulants oraux et de les mettre en œuvre de façon automatisée à partir du système d’information hospitalier, en vue d’une restitution régulière aux prescripteurs. Le constat est celui du manque d’indicateurs validés et du besoin de les développer et les appliquer à la pratique clinique hospitalière, dans le cadre d’une démarche continue d’amélioration des pratiques professionnelles. Chaque système d’information hospitalier (SIH) étant différent, nous souhaitons proposer des outils les plus transposables possible à d’autres ES permettant le calcul automatisé de ces indicateurs.
    Objectif principal : développer et étudier la validité de critère des indicateurs de pertinence des prescriptions d’anticoagulants oraux en médecine adulte automatisés à partir du SIH.
    Objectifs secondaires : (i) analyser la fiabilité et la robustesse des indicateurs de pertinence de prescription des anticoagulants oraux ; (ii) analyser la capacité des requêtes de construction des indicateurs à être utilisées dans plusieurs établissements de santé aux systèmes d’information hospitaliers différents ; (iii) identifier les moyens de réutiliser les mêmes indicateurs dans des hôpitaux n’utilisant pas le même entrepôt de données et tester cette adaptation dans un autre établissement.
    Le schéma d’étude combinera successivement une méthode de consensus, des techniques de recherche d’information, de Traitement Automatique de Langue et une synthèse des données médicales à partir du système d'information, puis une étude transversale pour l'analyse des performances métrologiques des indicateurs.
    Le projet comportera trois étapes successives : 1) identification des indicateurs de pertinence des prescriptions d’anticoagulants oraux et de leurs seuils de pertinence, et analyse de leur utilité potentielle et de leur caractère opérationnel à partir d’un consensus Delphi modifié ; 2) mise en œuvre opérationnelle des indicateurs à partir du système d’information hospitalier, utilisant des techniques et outils permettant leur généralisation à d’autres systèmes d’information hospitaliers ; 3) évaluation des performances métrologiques et de la robustesse de ces indicateurs.
    Ce projet permettra le développement d’une batterie d’indicateurs de pertinence de la prescription des anticoagulants oraux, utiles (pertinents), faisables, valides, fiables et robustes qui pourront être extraits de façon automatisée du système d’information des établissements de santé. L’intégration des données médicales hétérogènes se fera par l’intermédiaire d’un entrepôt de données ayant la même structure dans deux centres investigateurs. Les indicateurs seront calculés à partir de requêtes communes qui pourront être réutilisées par d’autres établissements : Utilisation des mêmes outils développés dans le cadre de l’entrepôt de données ou bien utilisation d’un web-service pour le TAL. L’évaluation de l’impact clinique et médico-économique de la communication de tableaux de bord réunissant ces indicateurs serviront de base à l’analyse de la sensibilité au changement de ces indicateurs. Ce projet pourra constituer un modèle sur lequel s’appuyer pour étendre les travaux à d’autres classes médicamenteuses et à la médecine de ville. En particulier, la constitution d’entrepôts de données dans les établissements, simplifie les études suivantes en facilitant la mise à disposition des données nécessaires au sein de l’établissement.
  • Responsibility:
    • modélisation des indicateurs de la qualité des prescriptions
    • extraction d'information pour le calcul de la qualité des prescriptions
    • tests et évaluation

Past projects

ADELP (Analysis of the DPRK English Language Propaganda (2007-2015)), French project accepted within the frame MESHS Partenarial

  • Duration: 2016 - 2017
  • Main investigator: Mason Richey, Natalia Grabar
  • Partners: Hankuk University of Foreign Studies, Seoul, South Korea; UMS8163 Savoirs Textes Langage - STL, Lille, France
  • Objectives: Les politiques domestique et internationale se croisent, et souvent leur rhétorique peut être contradictoire. Naturellement, la plupart des dirigeants internationaux adaptent leurs discours pour mieux refléter leurs intentions. En pratique, cela signifie souvent que les leaders politiques emploient (a) un langage relativement dur pour communiquer à leurs propres peuples, et (b) un langage plutôt diplomatique pour la communication internationale. Ce phénomène est bien compris par les théoriciens de relations internationales et les affaires internationales (Fearon 1994; Weeks 2008; Weiss 2013). S'ils sont bien utilisés, ces deux discours interagissent pour maximiser la crédibilité de la coercition potentielle tout en signalant l'ouverture vers des approches coopératives sur un point donné.
    Malgré la nature intéressante du cas de la Corée du Nord, il existe peu de travaux sur la propagande internationale de ce pays. Actuellement, la plupart de travaux sur la rhétorique de la Corée du Nord sont focalisés sur l'analyse des discussions de son programme nucléaire (Rich 2012, 2014). D'autres travaux examinent la rhétorique belliqueuse, mais uniquement sur une dimension, par exemple, la provocation militaire (Joo 2015). Le manque d'attention à la rhétorique belliqueuse de la Corée du Nord en anglais laisse un grand vide dans nos connaissances actuelles, et la question est de savoir pourquoi ce pays a adopté ce mode de communication au niveau international.
    Le projet ADELP a pour objectif d'investiguer la question de l'utilisation de la rhétorique belliqueuse de la Corée du Nord à destination de l'audience internationale, malgré le fait que la plupart de dirigeants considèrent de telles méthodes diplomatiques inefficaces. Pour étudier différentes questions de recherche en jeu, le projet commence par la collecte de données nécessaires, qui réunissent la propagande de la Corée du Nord en anglais telle que diffusée par les chaînes comme KCNA et Twitter. La période couverte va de 2007 à 2016, en distinguant la période intense de provocations de la Corée du Nord, et le régime transitoire vers Kim Jong Un. Ensuite, des méthodes de Traitement automatique des langues sont utilisées pour explorer ces données. Cela consiste en différentes étapes: classification selon les thématiques, description linguistique, indexation, négation et incertitude.
    Il est supposé, par exemple, que selon les périodes et les thématiques les schémas de la propagande sont modifiés en augmentant ou diminuant la provocation.
  • Responsibility:
    • catégorisation des articles de la propagande
    • indexation
    • analyse de la propagande

Les espaces du patrimoine culturel numérique : topologies et topographies des itinéraires culturels, French project accepted within the frame PEPS/Université de Lille

  • Duration: 2015 - 2016
  • Main investigator: Marta Severo
  • Partners: Université Lille 3 (CNRS UMR 8163 STL); Université Lille 2
  • Objectives:
  • Responsibility:
    • Mining of instagrams for touristic information on Via Francigena

Révolutions scientifiques et histoire de la macroéconomie d'hier à aujourd'hui, French project accepted within the frame MESHS Partenarial

  • Duration: May 2015 - dec 2016
  • Main investigator: Goulven Rubin
  • Partners: Université Lille 3 (CNRS UMR 8163 STL); Université Lille 2
  • Objectives:
  • Responsibility:
    • Study of evolution of topics across years

SEPAPH (Segmentation de la parole chez les patients aphasiques), French project accepted within the frame MESHS Emergent

  • Duration: May 2015 - dec 2016
  • Main investigator: Anahita Basirat
  • Partners: Université Lille 3 (CNRS UMR 9193 SCALAB; CNRS UMR 8163 STL); Université Lille 2; Centre Espoir
  • Objectives: Study the capacity to segment the speech into words by aphasic patients.
  • Responsibility:
    • Collecting and preparing linguistic data for the tests

EQU1, EQU2 (Ethique Qualité Urgence), French project accepted within the frame Projet de l'Etablissement at Université Lille 3

  • Duration: Dec 2014 - Dec 2016
  • Main investigator: Natalia Grabar
  • Partners: Université Lille 3 (CNRS UMR 8163 STL, GERIICO); SAMU CH d'Arras
  • Objectives: Study of the communication during the phone calls received at the SAMU call center in Arras.
  • Responsibility:
    • Managing of the project, organization of meetings
    • Collecting and preparing linguistic data
    • Detection of the paraphrases from expert and non-expert discourse

TECTONIQ (les TEchnologies de l'information et de la communication au Coeur du TerritOire NumérIQue pour la valorisation du patrimoine: analyse des dispositifs et de leurs usages), French project accepted within the frame PEPS and inter-MSH

  • Duration: dec 2014 - dec 2015
  • Main investigator: Éric Kergosien
  • Partners: Université Lille 3 (GERIICO; CNRS UMR 8163 STL); Universoté de Montpellier 2 (TETIS); Université de Lyon (ERIC); Université de Strasbourg (LIVE)
  • Objectives: Analyse electronic sources created for the diffusion and exchange of information relative to the natural and cultural heritage of the region, as well as their usage by different actors (citizens, companies, scientists, social and local actors, etc.).
  • Responsibility:
    • NLP and linguistic methods: information extraction
    • Analysis of the real usage of this information

Parlons de nous, Patients' mind, projets français de l'appel MSH-M and inter-MSH

  • Duration: jan 2013 - juin 2015
  • Partners: AMIS-TATOO/LIRMM, Université Montpellier 3; Centre d'Investigation Clinique, Université Montpellier 1; CNRS UMR 8163 STL Université Lille 3; UPS-IRIT (UMR 5505), Université Paul Sabatié
  • Main investigator: Sandra Bringay
  • Objectives: Mining the social media sources for processing and extracting patients-related information on their well-being and quality of life
  • Responsibility:
    • Relation between uncertainty and emotions
    • Automatic distinction between medical genres

Ravel (Recherche et Visualisation des informations dans le dossier patient électronique), projet français de l'appel TecSan de l'ANR (Agence Nationale de Recherche)

  • Duration: jan 2012 - juillet 2015
  • Partners: Inserm U936 - Rennes Université Rennes 1; CHU Rouen; ISPED, Université de Bordeaux; CNRS UMR 8163 STL Université Lille 3; VIDAL; MEDASYS
  • Objectives: In the Ravel project, we propose to exploit semantic information in clinical records in order to reach two operational objectives:
    • Allow an information retrieval centered on patient within clinical heterogeneous data,
    • Group and present the patient data efficiently, and satisfy the existing clinical needs and challenges
    Within the frame of the Ravel project, we will develop a beta-system which will be evaluated in real clinical conditions by health professionals.
  • Responsibility:
    • Participation in all the tasks
    • Alignment of terminologies (MedDRA/SNOMED CT/ICD10)
    • Indexing of clinical patient records
    • Status and contextualization of information (negation, modality, temporality...)
    • Semantic extension of queries
    • Adaptation of the content of the clinical records for patients (exploratory task)
    • Visualization of clinical data

Scientific Research Network (WOG) Contrastive Linguistics: constructional and functional approaches, projet du Research Foundation - Flanders (FWO)

  • Duration: jan 2011 - jan 2016
  • Partners: Universiteit Gent (Contragram) -- leader, Katholieke Universiteit Leuven ((i) Functional Linguistics Leuven ; (ii) Franitalco), Universiteit Antwerpen (Centre for Grammar, Cognition and Typology), Université Catholique de Louvain (Centre for English Corpus Linguistics), Universiteit Leiden (Cognitive linguistics and Construction Grammar), University of Edinburgh (Construction Grammar Research Group), University of Bergen (Indo-European Case and Argument structure in a Typological Perspective), Université de Lille 3 (Savoirs, Textes, Langage (STL)), Université de Caen (Crisco), Universidad Complutense de Madrid (Functional Linguistics English-Spanish and its applications), University of Santiago de Compostela (Scimitar), University of Hong Kong (School of English)
  • Composition d l'équipe STL: Georgette Dal, Katia Paykin-Arroues, Sandra Benazzo, Natalia Grabar
  • Objectives: promouvoir l'expertise dans l'interaction entre la linguistique contrastive et constructionnelle et/ou functionnnelle
  • Responsibility:
    • Étude contrastive des documents biomédicaux en français et anglais
    • Étude contrastive diachronique en français
    • Morphologie évaluative contrastive

POMELO (PaThologies, MEdicaments, aLimentatiOn), French project accepted within the frame MESHS émergent

  • Duration: nov 2013 - mars 2015
  • Partners: CNRS UMR 8163 STL Université Lille 3; ISPED, Université de Bordeaux; LIMSI
  • Objectives: Study the interactions between drugs and food, for a given disorder
  • Responsibility:
    • Main investigator: Natalia Grabar
    • NLP and linguistic methods: information extraction
    • Exploring existing databases
    • NLP queries over linked data

DICO-Risque (Développement d'une boite à outils pour l'analyse de l'incertitude et de la qualité de la connaissance, dans les évaluations des risques des perturbateurs endocriniens: application à l'étude de cas du Biphenol-A), projet français PNRPE (Programme National de Recherche sur les Perturbateurs Endocriniens du Ministère de l'Écologie, du Développement durable, des Transports et du Logement)

  • Duration: oct 2011 - oct 2014
  • Partners: Laura Maxim, ISCC, Paris; Natalia Grabar, UMR 8163 STL, Lille; Sandrine Blanchemanche, Unité Met@risk, INRA; Jeroen van der Sluijs, Utrecht University, the Netherlands; Akos Rona-Tas, University of California San Diego, États-Unis.
  • Partenaire associé au projet: ANSES (Agence nationale de sécurité sanitaire de l'alimentation, de l'environnement et du travail)
  • Objectives: Développement et évaluation, sur l'étude de cas des risques chimiques du Bisphenol A, d'un ensemble d'outils de caractérisation qualitative et quantitative de l'incertitude, en mobilisant à la fois les méthodes d'évaluation par les experts et les méthodes de Traitement Automatique des Langues.
  • Responsibility:
    • Extraction automatique d'information: critères de qualité définis par les experts et la modalité associée à ces critères
    • Rôle de la certitude et de la modalité dans le domaine du risque chimique
    • Collaboration sur l'ontologie du risque alimentaire et chimique

SKATE (La Sous-[k]atégorisation verbale et son évaluation en contexte multilingue), projet BQR à l'Université Lille1&3

  • Duration: jan 2012 - déc 2012
  • Partners: Cédric Patin, Fayssal Tayalati, Natalia Grabar
  • Nous proposons d'étudier (i) le sens et (ii) la réalisation syntaxique des compléments phrastiques des verbes matrices, en nous appuyant sur un corpus issu de quatre langues appartenant à des familles différentes. Ce choix conférera aux conclusions futures un fort degré de généralisation, lequel fait défaut dans les quelques travaux existants consacrés à cette question.
  • Responsibility:
    • Préparation et expérience avec les propositions en Ukrainien
    • Préparation des corpus
    • Analyse des résultats

CoMeTe (COmposition MEdicale en TErminologie), projet français de l'appel à projets émergents de la MESHS (Maison européenne des sciences de l'homme et de la société)

  • Duration: oct 2011 - oct 2012
  • Partners: STL UMR8163 U Lille 3 (Georgette Dal, Dany Amiot, Thi Mai Tran, Natalia Grabar); URECA EA1059 U Lille 3 (Sévérine Casalis); ATILF UMR7118 U Nancy 2 (Stéphanie Lignon, Fiammetta Namer); LIM-BIO EA3969 U Paris 13 (Thierry Hamon); CLLE-ERSS UMR5263 U Toulouse-Le Mirail (Nabil Hathout)
  • Nous proposons de réaliser une meilleure description de la langue médicale et de ses régularités, afin de contribuer à assurer une meilleure communication entre spécialistes et patients. Une attention particulière sera portée d'une part aux formations les plus opaques dans la langue médicale - les compositions néoclassiques mettant en jeu au moins un constituant issu du fonds patrimonial (latin ou grec), et d'autre part aux conditions de production et de réception de ces formations par les spécialistes en médecine, par les étudiants en médecine et par les patients.
  • Responsibility:
    • Coordinatrice du projet avec Stéphanie Lignon
    • Préparation des données (corpus, terminologie)
    • Participation dans le travail sur la création de la grammaire de la composition néoclassique
    • Participation dans la création du protocole pour l'étude de la perception des composés
    • Participation dans le travail sur la détection automatique des composés

REACH: Expertise dans l'évaluation réglementaire des risques chimiques, projet français PIR (Programme Interdisciplinaire de Recherche du CNRS)

  • Duration: sep 2011 - sep 2012
  • Partners: Laura Maxim, ISCC, Paris; Natalia Grabar, UMR 8163 STL, Lille; Thierry Hamon, LIM&BIO, Université Paris 13
  • Objectives: Caractérisation du processus d'expertise dans les évaluations des risques réalisées dans le cadre du règlement européen REACH. Il s'agit notamment du travail des structures d'interface entre les scientifiques, les industriels et les politiques, à savoir les agences nationale et européenne concernées (ANSES et ECHA). Ces structures d'interface ont un rôle majeur dans l'orientation des décisions en matière de santé publique, car elles réalisent des évaluations de la connaissance soumise par les industriels et autorisent ou non la mise sur le marché des substances chimiques. Une originalité complémentaire consiste en adaptation du savoir-faire et des protocoles issus du domaine de la Santé Publique (revues systématiques d'articles) et du Traitement Automatique des Langues.
  • Responsibility:
    • Adaptation et évaluation du protocole Cochrane des revues systmétiques d'articles sur les documents relatifs au risque chimique (articles, rapports ...)
    • Étude de la certitude et de la modalité et de leur rôle dans les documents relatifs au risque chimique
    • Extraction automatique d'information

FP7 PROTECT projet européen de l'appel IMI (Innovative Medicines Initiative)

  • Duration: sep 2009 - sep 2014
  • Partners: EMEA, industrie pharmaceutique (GSK, Sanofi, Pfizer, Roche, Novartis, Amgen, Genzyme, Merck, Bayer, AstraZeneca, H Lundbeck, Novo Nordisk), laboratoires de recherche européens (Inserm, FR; University of München, DE; University Mario Negri, IT; University of Groningen, NL; University of Utrecht, NL; Imperial College, UK; University of Newcastle, UK), organisations, associations et entreprises européennes (IAPO, IO; Outcome, CH; AEMPS, ES; FICF, ES; CEIFE, ES; LASER, FR; OMS Uppsala, SE; MHRA, UK; GPRD, UK)
  • Objectives: Amélioration de la détection des effects indésirables des médicaments et de la sécurité des patients
  • Responsibility: participation au projet et encadrement d'une thèse

ReSyTAL, projet français de l'appel PHRC

  • Duration: jan 2009 - jan 2012
  • Partners: HEGP
  • Objectives: Concevoir et tester une méthodologie et un outil pour la sélection d'études diagnostiques dans le cadre de revues systématiques d'articles
  • Responsibility: responsable scientifique du projet, encadrement d'étudiants et d'un ingénieur, réalisation de certaines tâches
    • Création d'un service Web pour la réalisation des revues systématiques:
    • Création d'une terminologie structurée des études diagnostiques
    • Localisation thématique de terminologies cliniques existantes

FP6 Network of Excellence Semantic Mining

  • Duration: 2005 - 2007
  • Partners: Laboratoires européens d'informatique médicale (France, Suède, Finlande, Allemagne, Grande-Bretagne, Italie, Suisse)
  • Responsibility: participation à des tâches:
    • WP20: Medical multilingual lexicon
    • WP25: Biological text mining
    • WP27: Clinical information for patients

DECO: Projet CNRS TCAN

  • Duration: 2004 - 2006
  • Partners: CRIM/Paris, LINA/Nantes, NII/Tokyo
  • Objectives: Étude du multililguisme et de la diversité culturelle
  • Responsibility:
    • responsable de tâches, encadrement d'étudiants
    • méthodes pour l'extraction de descripteurs linguistiques pour la distinction entre les documents expert et non-expert

UMLF (Unified Medical Language for French): Projet Tecsan 2002

  • Duration: 2003 - 2005
  • Partners: STIM/AP-HP, CISMeF/Rouen, CHU/Genève, CHU/Rennes, U Nancy II
  • Objectives: Création du lexique morphologique pour le domaine médical en français
  • Responsibility: participation au projet, contribution méthodologique

FP5 PRINCIP: Projet européen de l'appel Safer Internet Action Plan

  • Duration: 2002 - 2004
  • Partners: CRIM/INaLCO, Cognitec/Bruxelles, DCU/Dublin, University of Magdebourg, Ligue des Droits de l'Homme/Bruxelles
  • Objectives: Detection of illicit content on Internet
  • Responsibility: rédaction du projet, responsabilité de plusieurs tâches
    • Définition et rédaction des spécifications linguistiques et TAL du projet
    • Construction de corpus
    • Proposition de méthodes pour l'analyse automatique des corpus
    • Modélisation de la base de données pour intégration dans le système final

SAFIR: Projet français TIM'99

  • Duration: 2000 - 2002
  • Partners: CRIM/INaLCO, LIP6/Paris, XEROX/Grenoble, EDF/Clamart
  • Objectives: Création d'un moteurs de recherche sémantique.
    Le projet Safir a proposé de développer un logiciel visant à améliorer la recherche d'information sur les réseaux. L'application est conçue autour d'un contexte d'utilisation précis : un utilisateur francophone lance une requête assez vague, mais attend une réponse précise, focalisée sur son domaine d'activité. L'aide à l'utilisateur porte sur la formulation de la requête, la recherche d'informations multilingues (si cette option est demandée), la validation et le tri des résultats. La formulation de la requête est faite à travers une interface adaptée. L'outil répond aux besoins de la veille technologique. La connaissance propre au domaine investi est prise en compte à travers une terminologie de ce domaine. Le domaine étudié est la cogénération, une technique de production combinée d'électricité et de chaleur (vapeur, eau chaude). L'intégration d'outils linguistiques robustes et validés permet de traiter les requêtes et d'accroître la pertinence de l'information rapatriée en mode multilingue (français, anglais, allemand).
  • Responsibility: responsabilité de plusieurs tâches, encadrement des étudiants
    • Constitution de corpus (français, anglais, allemand)
    • Création de l'outil pour le requêtage automatique des moteurs de recherche existants et pour le filtrage de ces documents
    • Acquisition terminologique dans les trois langues
    • Structuration de termes en français
    • Validation de la terminologies structurée avec les experts du domaine
    • Alignement de textes et de termes dans les trois langues
    • Protocole pour le traitement linguistique des documents en ligne

CLEF: Projet français

  • Duration: 1999 - 2000
  • Objectives: Création du corpus du français moderne
  • Responsibility: collection de documents cliniques, Définition de la DTD XML, Encodage XML et description du corpus

Écritures du monde: Projet du Ministère de la Culture et de la Communication

  • Duration: Mai 1999 - Oct 1999
  • Objectives: Création du site pour la présentation des alphabets du monde
  • Partners: CRIM/INaLCO, Ministère de la Culture et de la Communication
  • Responsibility: Création des pages sur l'alphabet cyrillique