skip to main content
Language:
Search Limited to: Search Limited to: Resource type Show Results with: Show Results with: Search type Index

What is an optimal value of k in k-fold cross-validation in discrete Bayesian network analysis?

Computational statistics, 2021-09, Vol.36 (3), p.2009-2031 [Peer Reviewed Journal]

This is a U.S. government work and not under copyright protection in the U.S.; foreign copyright protection may apply 2020 ;This is a U.S. government work and not under copyright protection in the U.S.; foreign copyright protection may apply 2020. ;ISSN: 0943-4062 ;EISSN: 1613-9658 ;DOI: 10.1007/s00180-020-00999-9

Full text available

Citations Cited by
  • Title:
    What is an optimal value of k in k-fold cross-validation in discrete Bayesian network analysis?
  • Author: Marcot, Bruce G. ; Hanea, Anca M.
  • Subjects: Bayesian analysis ; Bias ; Calibration ; Classification ; Collinearity ; Datasets ; Economic Theory/Quantitative Economics/Mathematical Methods ; Independent variables ; Mathematics and Statistics ; Network analysis ; Original Paper ; Probability and Statistics in Computer Science ; Probability Theory and Stochastic Processes ; Statistical methods ; Statistics ; Variables
  • Is Part Of: Computational statistics, 2021-09, Vol.36 (3), p.2009-2031
  • Description: Cross-validation using randomized subsets of data—known as k-fold cross-validation—is a powerful means of testing the success rate of models used for classification. However, few if any studies have explored how values of k (number of subsets) affect validation results in models tested with data of known statistical properties. Here, we explore conditions of sample size, model structure, and variable dependence affecting validation outcomes in discrete Bayesian networks (BNs). We created 6 variants of a BN model with known properties of variance and collinearity, along with data sets of n = 50, 500, and 5000 samples, and then tested classification success and evaluated CPU computation time with seven levels of folds (k = 2, 5, 10, 20, n − 5, n − 2, and n − 1). Classification error declined with increasing n, particularly in BN models with high multivariate dependence, and declined with increasing k, generally levelling out at k = 10, although k = 5 sufficed with large samples (n = 5000). Our work supports the common use of k = 10 in the literature, although in some cases k = 5 would suffice with BN models having independent variable structures.
  • Publisher: Berlin/Heidelberg: Springer Berlin Heidelberg
  • Language: English
  • Identifier: ISSN: 0943-4062
    EISSN: 1613-9658
    DOI: 10.1007/s00180-020-00999-9
  • Source: ProQuest Central

Searching Remote Databases, Please Wait