Language:

Data and its (dis)contents: A survey of dataset development and use in machine learning research

Patterns (New York, N.Y.), 2021-11, Vol.2 (11), p.100336-100336, Article 100336 [Peer Reviewed Journal]

2021 The Authors ;2021 The Authors. ;2021 The Authors 2021 ;ISSN: 2666-3899 ;EISSN: 2666-3899 ;DOI: 10.1016/j.patter.2021.100336 ;PMID: 34820643

Full text available

Citations Cited by

Actions
1. Add to My Research
2. Remove from My Research
3. E-mail
4. Print
5. Permalink
6. Citation
7. EasyBib
8. EndNote
9. RefWorks
10. Delicious
11. Export RIS
12. Export BibTeX

Title:
Data and its (dis)contents: A survey of dataset development and use in machine learning research
Author: Paullada, Amandalynne ; Raji, Inioluwa Deborah ; Bender, Emily M. ; Denton, Emily ; Hanna, Alex
Subjects: datasets machine learning ; Review
Is Part Of: Patterns (New York, N.Y.), 2021-11, Vol.2 (11), p.100336-100336, Article 100336
Description: In this work, we survey a breadth of literature that has revealed the limitations of predominant practices for dataset collection and use in the field of machine learning. We cover studies that critically review the design and development of datasets with a focus on negative societal impacts and poor outcomes for system performance. We also cover approaches to filtering and augmenting data and modeling techniques aimed at mitigating the impact of bias in datasets. Finally, we discuss works that have studied data practices, cultures, and disciplinary norms and discuss implications for the legal, ethical, and functional challenges the field continues to face. Based on these findings, we advocate for the use of both qualitative and quantitative approaches to more carefully document and analyze datasets during the creation and usage phases. Datasets form the basis for training, evaluating, and benchmarking machine learning models and have played a foundational role in the advancement of the field. Furthermore, the ways in which we collect, construct, and share these datasets inform the kinds of problems the field pursues and the methods explored in algorithm development. In this work, we survey recent issues pertaining to data in machine learning research, focusing primarily on work in computer vision and natural language processing. We summarize concerns relating to the design, collection, maintenance, distribution, and use of machine learning datasets as well as broader disciplinary norms and cultures that pervade the field. We advocate a turn in the culture toward more careful practices of development, maintenance, and distribution of datasets that are attentive to limitations and societal impact while respecting the intellectual property and privacy rights of data creators and data subjects. Datasets have become a critical component in the advancement of machine learning research. The ways in which such datasets are collected, constructed, and shared play a significant role in shaping the quality and impact of this research. We conduct a survey of the literature on concerns relating to the design, collection, maintenance, and distribution of machine learning datasets, as well as broader disciplinary norms and cultures that pervade the field.
Publisher: United States: Elsevier Inc
Language: English
Identifier: ISSN: 2666-3899
EISSN: 2666-3899
DOI: 10.1016/j.patter.2021.100336
PMID: 34820643
Source: PubMed Central
DOAJ Directory of Open Access Journals

Back to results list


INSPIRE LIBRARY - TON DUC THANG UNIVERSITY	(84-028) 37 755 057	Feedback
19 Nguyen Huu Tho St. Dist.7, HCM	thuvien@tdtu.edu.vn	Feedback

Data and its (dis)contents: A survey of dataset development and use in machine learning research

2021 The Authors ;2021 The Authors. ;2021 The Authors 2021 ;ISSN: 2666-3899 ;EISSN: 2666-3899 ;DOI: 10.1016/j.patter.2021.100336 ;PMID: 34820643

Searching Remote Databases, Please Wait