Korean text: WARNING: Untokenizable #1176
-
Version number of KH Coder3.Beta.07e Details about the error or the problemOS I was trying to do pre-processing on a Korean text corpus until it ran into warnings, then the program crashed. I have to redo my pre-processing. To Reproduce I got my dataset from https://data.statmt.org/cc-100/, so the text file is 50GB in size. So I can't really manually go in and exclude special characters/symbols. How can I avoid this error? Contents of the console windowEncoding of this Console: cp932
Encoding of this file system: cp932
This is KH Coder 3.Beta.07e on MSWin32.
CWD: C:/khcoder3
Available Physical Memory: 2047MB
Checking MySQL connection...
R Version: 3.1, x86_64
Using un-threaded functions...
kh_msg: missing msg: screen_code::assistant, wordcloud_button2
Monitors: 0, 2560, 0, 1440
new window: 294, 337
Connected to MySQL 5.6, khc2.
MySQL integrity check: pass, c:/khcoder3/dep/mysql
Checking icode (en)... iso-8859-1 or utf8
NFD lines found: 40824
Server cmd: java -showversion -mx300m -cp "C:/khcoder3/dep/stanford-postagger/stanford-postagger.jar" edu.stanford.nlp.tagger.maxent.MaxentTaggerServer -outputFormat xml -outputFormatOptions lemmatize -port 32020 -model "C:/khcoder3/dep/stanford-postagger/models/wsj-0-18-left3words-distsim.tagger"
Starting server, pid: 13208, Connecting.openjdk version "1.8.0_262"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_262-b10)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.262-b10, mixed mode)
Loading default properties from tagger C:/khcoder3/dep/stanford-postagger/models/wsj-0-18-left3words-distsim.tagger
Reading POS tagger model from C:/khcoder3/dep/stanford-postagger/models/wsj-0-18-left3words-distsim.tagger ... done [0.4 sec].
. ok. Tagging...12 24, 2023 9:19:06 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ㎡ (U+33A1, decimal: 13217)
12 24, 2023 9:19:06 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ㎡ (U+33A1, decimal: 13217)
12 24, 2023 9:19:06 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+338D, decimal: 13197)
12 24, 2023 9:19:11 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+321C, decimal: 12828)
12 24, 2023 9:19:17 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: 『 (U+300E, decimal: 12302)
12 24, 2023 9:19:20 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+2027, decimal: 8231)
12 24, 2023 9:19:20 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+321C, decimal: 12828)
12 24, 2023 9:19:21 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+200D, decimal: 8205)
12 24, 2023 9:19:23 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+2024, decimal: 8228)
12 24, 2023 9:19:23 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+2024, decimal: 8228)
12 24, 2023 9:19:23 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+2024, decimal: 8228)
12 24, 2023 9:19:23 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+2024, decimal: 8228)
12 24, 2023 9:19:23 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: 「 (U+300C, decimal: 12300)
12 24, 2023 9:19:26 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: Ⅲ (U+2162, decimal: 8546)
12 24, 2023 9:19:26 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: Ⅰ (U+2160, decimal: 8544)
12 24, 2023 9:19:26 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: Ⅲ (U+2162, decimal: 8546)
12 24, 2023 9:19:26 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: Ⅴ (U+2164, decimal: 8548)
12 24, 2023 9:19:26 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: Ⅵ (U+2165, decimal: 8549)
12 24, 2023 9:19:26 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: Ⅶ (U+2166, decimal: 8550)
12 24, 2023 9:19:26 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: Ⅷ (U+2167, decimal: 8551)
12 24, 2023 9:19:26 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: Ⅸ (U+2168, decimal: 8552)
12 24, 2023 9:19:26 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: Ⅹ (U+2169, decimal: 8553)
12 24, 2023 9:19:26 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: 「 (U+300C, decimal: 12300)
12 24, 2023 9:19:26 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: 「 (U+300C, decimal: 12300)
12 24, 2023 9:19:26 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: 「 (U+300C, decimal: 12300)
12 24, 2023 9:19:26 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: 「 (U+300C, decimal: 12300)
12 24, 2023 9:19:26 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: 「 (U+300C, decimal: 12300)
12 24, 2023 9:19:26 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: 「 (U+300C, decimal: 12300)
12 24, 2023 9:19:26 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: 「 (U+300C, decimal: 12300)
12 24, 2023 9:19:26 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: 「 (U+300C, decimal: 12300)
12 24, 2023 9:19:26 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: 「 (U+300C, decimal: 12300)
12 24, 2023 9:19:26 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: 「 (U+300C, decimal: 12300)
12 24, 2023 9:19:26 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: 「 (U+300C, decimal: 12300)
12 24, 2023 9:19:26 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: Ⅲ (U+2162, decimal: 8546)
12 24, 2023 9:19:27 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: 「 (U+300C, decimal: 12300)
12 24, 2023 9:19:28 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: 【 (U+3010, decimal: 12304)
12 24, 2023 9:19:28 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: 【 (U+3010, decimal: 12304)
12 24, 2023 9:19:28 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: 【 (U+3010, decimal: 12304)
12 24, 2023 9:19:28 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: 【 (U+3010, decimal: 12304)
12 24, 2023 9:19:28 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+FE0E, decimal: 65038)
12 24, 2023 9:19:28 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+FE0E, decimal: 65038)
12 24, 2023 9:19:28 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+FE0E, decimal: 65038)
12 24, 2023 9:19:28 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+FE0E, decimal: 65038)
12 24, 2023 9:19:28 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+FE0E, decimal: 65038)
12 24, 2023 9:19:29 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+2024, decimal: 8228)
12 24, 2023 9:19:29 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+2024, decimal: 8228)
12 24, 2023 9:19:30 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+2027, decimal: 8231)
12 24, 2023 9:19:30 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+2027, decimal: 8231)
12 24, 2023 9:19:31 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: Ⅲ (U+2162, decimal: 8546)
12 24, 2023 9:19:31 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: Ⅳ (U+2163, decimal: 8547)
12 24, 2023 9:19:31 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: Ⅴ (U+2164, decimal: 8548)
12 24, 2023 9:19:31 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: Ⅵ (U+2165, decimal: 8549)
12 24, 2023 9:19:31 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: Ⅶ (U+2166, decimal: 8550)
12 24, 2023 9:19:31 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: 《 (U+300A, decimal: 12298)
12 24, 2023 9:19:31 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: 《 (U+300A, decimal: 12298)
12 24, 2023 9:19:32 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: 《 (U+300A, decimal: 12298)
12 24, 2023 9:19:32 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: 《 (U+300A, decimal: 12298)
12 24, 2023 9:19:32 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: 《 (U+300A, decimal: 12298)
12 24, 2023 9:19:32 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: 《 (U+300A, decimal: 12298)
12 24, 2023 9:19:32 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: 《 (U+300A, decimal: 12298)
12 24, 2023 9:19:32 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: 《 (U+300A, decimal: 12298)
12 24, 2023 9:19:32 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: 《 (U+300A, decimal: 12298)
12 24, 2023 9:19:32 午後 edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: 《 (U+300A, decimal: 12298)
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
First, you have to select "Korean" in the "New Project" window. Currently you are selecting "English." Second, please try 5MB file at first. Not 50 GB. I recommend that you perform random sampling to reduce the data size. I tried up to 200MB. I am not sure about GB. |
Beta Was this translation helpful? Give feedback.
First, you have to select "Korean" in the "New Project" window. Currently you are selecting "English."
Second, please try 5MB file at first. Not 50 GB.
I recommend that you perform random sampling to reduce the data size. I tried up to 200MB. I am not sure about GB.