Abuse detection in online conversations with text and graph embeddings
- Copyright 2021-24 Noé Cécillon
AlertEmbeddings is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation. For source availability and license information see LICENCE
- Lab site: http://lia.univ-avignon.fr/
- GitHub repo: https://github.com/CompNet/Alert
- Contact: Noé Cécillon [email protected]
This set of scripts aims at learning various embeddings from online conversations to detect online abuse. Two main approaches are implemented: a content-based approach and a graph-based approach, which can also be used jointly. It leverages our library SWGE, which is described in [C'24, CLDA'24], as well as methods from the literature, as described in the experimental protocol from [C'24, CLDA'24]. The Alert repository implements similar functionalities, but using feature engineering instead of learned embeddings. This software is used in [CLDL'20a, C'24, CLD'24] (cf. these publications for more details).
This software was applied to a corpus of chat messages from the French MMORPG SpaceOrigin, already used for Alert, and presented in [PLDL'17, PLDL'17a, PLDL'17b, PLDL'18, PLDL'19, C'19, CLDL'19]. It also requires some signed graphs extracted from this textual corpus, which are available on Zenodo.
These conversational networks are also included as a zip file in this repository: unzip the SpaceOrigin_graphs.zip
archive into the in/graphs
folder. Conversation should be added in the in/text_conversations
folder as a separate file for each conversation with each line corresponding to a message. An example is available on this repository.
Here are the folders composing the project:
- Folder
in
: input data, including the textual conversations and graphs. - Folder
SGCN
: set of scripts to learn embeddings using the SGCN method [CLD'24]. - Folder
signed_graph2vec
: set of scripts to learn embeddings using the SG2V method [CLD'24]. - Folder
emb
: contains all the learned embeddings. - Folder
output
: output files generated by the methods, such as the weights. - Folder
src
: set of scripts to apply the standard unsigned graph embedding models and the text embedding methods. main.py
: main script used to launch all the experiments.
This library requires Python 3.8+. Dependencies car be installed with pip install -r requirements.txt
The Graphormer library requires a separate installation and Python 3.9. It can be installed with:
git clone --recursive https://github.com/microsoft/Graphormer.git
cd Graphormer
bash install.sh
The documentation of Graphormer can be found here.
The main script is the entry point to launch all the experiments. Use python main.py
to run it.
- [PLDL'17] É. Papegnies, V. Labatut, R. Dufour, and G. Linarès. Detection of abusive messages in an on-line community, 14ème Conférence en Recherche d'Information et Applications (CORIA), Marseille, FR, p.153–168, 2017. doi: 10.24348/coria.2017.16 - ⟨hal-01505017⟩
- [PLDL'17a] É. Papegnies, V. Labatut, R. Dufour, and G. Linarès. Graph-based Features for Automatic Online Abuse Detection, 5th International Conference on Statistical Language and Speech Processing (SLSP), Le Mans, FR, Lecture Notes in Artificial Intelligence, 10583:70-81, 2017. doi: 10.1007/978-3-319-68456-7_6 - ⟨hal-01571639⟩
- [PLDL'17b] É. Papegnies, V. Labatut, R. Dufour, and G. Linarès. Détection de messages abusifs au moyen de réseaux conversationnels, 8ème Conférence sur les modèles et lánalyse de réseaux : approches mathématiques et informatiques (MARAMI), La Rochelle, FR, 2017. ⟨hal-01614279⟩
- [PLDL'18] É. Papegnies, V. Labatut, R. Dufour, and G. Linarès. Impact Of Content Features For Automatic Online Abuse Detection, 18th International Conference on Computational Linguistics and Intelligent Text Processing (CICling 2017), Budapest, HU, Lecture Notes in Computer Science, 10762:153–168, 2018. doi: 10.1007/978-3-319-77116-8_30 - ⟨hal-01505502⟩
- [PLDL'19] É. Papegnies, V. Labatut, R. Dufour, and G. Linarès. Conversational Networks for Automatic Online Moderation, IEEE Transactions on Computational Social Systems, 6(1):38–55, 2019. doi: 10.1109/TCSS.2018.2887240 - ⟨hal-01999546⟩
- [C'19] N. Cécillon. Exploration de caractéristiques d’embeddings de graphes pour la détection de messages abusifs, MSc Thesis, Avignon Université, Laboratoire Informatique d'Avignon (LIA), Avignon, FR, 2019. ⟨dumas-04073337⟩
- [CLDL'19] N. Cécillon, V. Labatut, R. Dufour & G. Linarès. Abusive Language Detection in Online Conversations by Combining Content- and Graph-Based Features, IAAA ICWSM International Workshop on Modeling and Mining Socia-Media Driven Complex Networks (Soc2Net), Munich, DE, Frontiers in Big Data 2:8, 2019. doi: 10.3389/fdata.2019.00008 - ⟨hal-02130205⟩
- [CLDL'20a] N. Cécillon, V. Labatut, R. Dufour & G. Linarès. Tuning Graph2vec with Node Labels for Abuse Detection in Online Conversations, 11ème Conférence sur les modèles et l'analyse de réseaux : approches mathématiques et informatiques (MARAMI), Montpellier, FR, 2020. Conference version - ⟨hal-02993571⟩
- [CLDL'20b] N. Cécillon, V. Labatut, R. Dufour & G. Linarès. Graph Embeddings for Abusive Language Detection, Springer Nature Computer Science 2:37, 2020. doi: 10.1007/s42979-020-00413-7 - ⟨hal-03042171⟩
- [CDL'21] N. Cécillon, R. Dufour & V. Labatut. Approche multimodale par plongements de texte et de graphes pour la détection de messages abusifs, Traitement Automatique des Langues 62(2):13-38, 2021. Journal version - ⟨hal-03527016⟩
- [C'24] N. Cécillon. Combining Graph and Text to Model Conversations: An Application to Online Abuse Detection, PhD Thesis, Avignon Université, Laboratoire Informatique d'Avignon (LIA), Avignon, FR, 2024. ⟨tel-04441308⟩
- [CLDA'24] N. Cécillon, V. Labatut, R. Dufour, N. Arınık: Whole-Graph Representation Learning For the Classification of Signed Networks, IEEE Access (in press), 2024. DOI: 10.1109/ACCESS.2024.3472474 - ⟨hal-04712854⟩
- [CLD'24] N. Cécillon, R. Dufour & V. Labatut. Conversation-Based Multimodal Abuse Detection Through Text and Graph Embeddings, submitted, 2024.