Skip to content

๐Ÿ”๐Ÿ“š TextDoe: Thai Document Domain Classification Model Based on Bow, LSTM, Roberta-Base

Notifications You must be signed in to change notification settings

chotanansub/TextDoe

Repository files navigation

๐Ÿ”๐Ÿ“š TextDoe: Thai Document Domain Classification Model Based on Bow, LSTM, Pre-trained Roberta-Base

This project is supported by the AI Builder program. The main objective is to classify Thai documents into eight different domains, including:

  • ๐Ÿ”ฎ Imaginative
  • ๐ŸŒฑ Natural & Pure Science
  • ๐Ÿ”ฌ Applied Science
  • ๐Ÿ“š Social Science
  • ๐Ÿ”Ž History
  • ๐Ÿ’ต Commerce & Finance
  • ๐Ÿ–Œ๏ธ Arts
  • ๐Ÿ™ Belief & Thought

๐Ÿ—‚๏ธ Data Information

  • Source: TNC:Thai National Corpus

  • Organization: Department of Linguistics, Faculty of Arts, Chulalongkorn University

  • After data cleaning, the dataset originally consisting of 45,000 articles has been refined to 36,000 articles

  • ๐Ÿ“š Article Sources

Sources Proportion (%)
Physical Book 60%
Journal 25%
Newspaper 5-10%
Other publications (e.g. advertising brochures) 5-10%
Online content <5%
  • ๐Ÿ“Š Data Splitting

Data Split Proportion (%) Volume
Training 70% 25,200
Validation 15% 5,400
Testing 15% 5,400

About

๐Ÿ”๐Ÿ“š TextDoe: Thai Document Domain Classification Model Based on Bow, LSTM, Roberta-Base

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published