๐๐ TextDoe: Thai Document Domain Classification Model Based on Bow, LSTM, Pre-trained Roberta-Base
This project is supported by the AI Builder program. The main objective is to classify Thai documents into eight different domains, including:
- ๐ฎ Imaginative
- ๐ฑ Natural & Pure Science
- ๐ฌ Applied Science
- ๐ Social Science
- ๐ History
- ๐ต Commerce & Finance
- ๐๏ธ Arts
- ๐ Belief & Thought
-
Source: TNC:Thai National Corpus
-
Organization: Department of Linguistics, Faculty of Arts, Chulalongkorn University
-
After data cleaning, the dataset originally consisting of 45,000 articles has been refined to 36,000 articles
Sources | Proportion (%) |
---|---|
Physical Book | 60% |
Journal | 25% |
Newspaper | 5-10% |
Other publications (e.g. advertising brochures) | 5-10% |
Online content | <5% |
Data Split | Proportion (%) | Volume |
---|---|---|
Training | 70% | 25,200 |
Validation | 15% | 5,400 |
Testing | 15% | 5,400 |