Language:

Directions in abusive language training data, a systematic review: Garbage in, garbage out

PloS one, 2020-12, Vol.15 (12), p.e0243300-e0243300 [Peer Reviewed Journal]

COPYRIGHT 2020 Public Library of Science ;COPYRIGHT 2020 Public Library of Science ;2020 Vidgen, Derczynski 2020 Vidgen, Derczynski ;ISSN: 1932-6203 ;EISSN: 1932-6203 ;DOI: 10.1371/journal.pone.0243300 ;PMID: 33370298

Full text available

Citations Cited by

Actions
1. Add to My Research
2. Remove from My Research
3. E-mail
4. Print
5. Permalink
6. Citation
7. EasyBib
8. EndNote
9. RefWorks
10. Delicious
11. Export RIS
12. Export BibTeX

Title:
Directions in abusive language training data, a systematic review: Garbage in, garbage out
Author: Vidgen, Bertie ; Derczynski, Leon
Grabar, Natalia
Subjects: Biology and Life Sciences ; Computer and Information Sciences ; Databases, Factual ; Electronic data processing ; Harassment (Law) ; Hate ; Hate speech ; Humans ; Language ; Machine Learning ; Science Policy ; Social Sciences
Is Part Of: PloS one, 2020-12, Vol.15 (12), p.e0243300-e0243300
Description: Data-driven and machine learning based approaches for detecting, categorising and measuring abusive content such as hate speech and harassment have gained traction due to their scalability, robustness and increasingly high performance. Making effective detection systems for abusive content relies on having the right training datasets, reflecting a widely accepted mantra in computer science: Garbage In, Garbage Out. However, creating training datasets which are large, varied, theoretically-informed and that minimize biases is difficult, laborious and requires deep expertise. This paper systematically reviews 63 publicly available training datasets which have been created to train abusive language classifiers. It also reports on creation of a dedicated website for cataloguing abusive language data hatespeechdata.com. We discuss the challenges and opportunities of open science in this field, and argue that although more dataset sharing would bring many benefits it also poses social and ethical risks which need careful consideration. Finally, we provide evidence-based recommendations for practitioners creating new abusive content training datasets.
Publisher: United States: Public Library of Science
Language: English
Identifier: ISSN: 1932-6203
EISSN: 1932-6203
DOI: 10.1371/journal.pone.0243300
PMID: 33370298
Source: Public Library of Science (PLoS) Journals Open Access
Freely Accessible Journals
AUTh Library subscriptions: ProQuest Central
MEDLINE
PubMed Central
DOAJ Directory of Open Access Journals

Back to results list


INSPIRE LIBRARY - TON DUC THANG UNIVERSITY	(84-028) 37 755 057	Feedback
19 Nguyen Huu Tho St. Dist.7, HCM	thuvien@tdtu.edu.vn	Feedback

Directions in abusive language training data, a systematic review: Garbage in, garbage out

COPYRIGHT 2020 Public Library of Science ;COPYRIGHT 2020 Public Library of Science ;2020 Vidgen, Derczynski 2020 Vidgen, Derczynski ;ISSN: 1932-6203 ;EISSN: 1932-6203 ;DOI: 10.1371/journal.pone.0243300 ;PMID: 33370298

Searching Remote Databases, Please Wait