skip to main content
Language:
Search Limited to: Search Limited to: Resource type Show Results with: Show Results with: Search type Index

S-464 Automated Occupational Encoding to the Canadian National Occupation Classification using an Ensemble Classifier from TF-IDF and Doc2Vec Embeddings

Occupational and environmental medicine (London, England), 2021-10, Vol.78 (Suppl 1), p.A161-A161 [Peer Reviewed Journal]

Author(s) (or their employer(s)) 2021. No commercial re-use. See rights and permissions. Published by BMJ. ;2021 Author(s) (or their employer(s)) 2021. No commercial re-use. See rights and permissions. Published by BMJ. ;ISSN: 1351-0711 ;EISSN: 1470-7926 ;DOI: 10.1136/OEM-2021-EPI.442

Full text available

Citations Cited by
  • Title:
    S-464 Automated Occupational Encoding to the Canadian National Occupation Classification using an Ensemble Classifier from TF-IDF and Doc2Vec Embeddings
  • Author: Suarez Garcia, Cesar Augusto ; Adisesh, Anil ; Baker, Christopher JO
  • Subjects: Algorithms ; Automation ; Classification ; Classifiers ; Coders ; Coding ; Job titles ; Learning algorithms ; Machine learning ; Prototypes ; Support vector machines ; Symposia
  • Is Part Of: Occupational and environmental medicine (London, England), 2021-10, Vol.78 (Suppl 1), p.A161-A161
  • Description: IntroductionOccupational encoding is a technique that allows job titles provided by study participants to be categorized according to their role in the labor force. Encoding has primarily been a slow error-prone manual process which is ripe for automation.ObjectivesOur goals was to design and test an automated coding prototype using machine learning techniques.MethodsThe prototype classification system ENENOC (the ENsemble Encoder for the National Occupational Classification) is comprised of series of steps involving data cleaning, exact match search, multi classifier ensembling, hierarchical classification, and multiple output selection. In the absence of exact matching between job title input and NOC category descriptions, the input data is embedded using the TF-IDF algorithm and Doc2Vec. The embeddings are fed into a hierarchical, ensemble classifier that uses classical machine learning techniques: Random Forests, Support Vector Machine and K-Nearest Neighbour. Ensemble encoding is achieved using a majority-voting system. The hierarchical two tier classification methodology first predicts the first digit of the NOC code followed while the second tier predicts the second third and fourth digit of the NOC code for the input data. The combined approach produces a single, 4-digit code as a top choice, as well as four alternate NOC codes, that serve as additional ranked choice based on the Doc2Vec model.ResultsThe prototype was benchmarked on a manually annotated data set comprising of 64,000 records. It produced a top-1 Per-Digit Macro F1-Score of 0.65 and a top-5 Per-Digit Macro F1-Score of 0.76, both of which are highly within published accuracy ranges for manual coding (44% to 89% inter-annotator agreement). ENENOC coded 30,000 job titles in 3 hours.ConclusionThe ENENOC prototype is a sophisticated ENsemble Encoder for the National Occupational Classification which has state of the art performance accuracy with significant speed improvements over manual coding.
  • Publisher: London: BMJ Publishing Group Ltd
  • Language: English
  • Identifier: ISSN: 1351-0711
    EISSN: 1470-7926
    DOI: 10.1136/OEM-2021-EPI.442
  • Source: Alma/SFX Local Collection
    ProQuest Central

Searching Remote Databases, Please Wait