WEKO3
アイテム
MiNgMatch?A Fast N-gram Model for Word Segmentation of the Ainu Language
https://kitami-it.repo.nii.ac.jp/records/8938
https://kitami-it.repo.nii.ac.jp/records/893830519602-607e-4a7b-9a81-d2d32e91ecbc
名前 / ファイル | ライセンス | アクション |
---|---|---|
Information 2019, 10(10), 317 (362.8 kB)
|
Item type | 学術雑誌論文 / Journal Article(1) | |||||
---|---|---|---|---|---|---|
公開日 | 2020-11-02 | |||||
タイトル | ||||||
タイトル | MiNgMatch?A Fast N-gram Model for Word Segmentation of the Ainu Language | |||||
言語 | en | |||||
言語 | ||||||
言語 | eng | |||||
キーワード | ||||||
主題Scheme | Other | |||||
主題 | word segmentation | |||||
キーワード | ||||||
主題Scheme | Other | |||||
主題 | tokenization | |||||
キーワード | ||||||
主題Scheme | Other | |||||
主題 | language modelling | |||||
キーワード | ||||||
主題Scheme | Other | |||||
主題 | n-gram models | |||||
キーワード | ||||||
主題Scheme | Other | |||||
主題 | Ainu language | |||||
キーワード | ||||||
主題Scheme | Other | |||||
主題 | endangered languages | |||||
キーワード | ||||||
主題Scheme | Other | |||||
主題 | under-resourced languages | |||||
資源タイプ | ||||||
資源タイプ識別子 | http://purl.org/coar/resource_type/c_6501 | |||||
資源タイプ | journal article | |||||
アクセス権 | ||||||
アクセス権 | open access | |||||
アクセス権URI | http://purl.org/coar/access_right/c_abf2 | |||||
著者 |
Nowakowski, Karol
× Nowakowski, Karol× Ptaszynski, Michal× Masui, Fumito |
|||||
著者別名 | ||||||
識別子Scheme | WEKO | |||||
識別子 | 90352 | |||||
識別子Scheme | CiNii ID | |||||
識別子URI | http://ci.nii.ac.jp/nrid/9000006924461 | |||||
識別子 | 9000006924461 | |||||
識別子Scheme | KAKEN - 研究者検索 | |||||
識別子URI | https://nrid.nii.ac.jp/ja/nrid/1000060711504/ | |||||
識別子 | 60711504 | |||||
姓名 | プタシンスキ, ミハウ | |||||
言語 | ja | |||||
著者別名 | ||||||
識別子Scheme | WEKO | |||||
識別子 | 304 | |||||
識別子Scheme | KAKEN - 研究者検索 | |||||
識別子URI | https://nrid.nii.ac.jp/ja/nrid/1000080324549/ | |||||
識別子 | 80324549 | |||||
姓名 | 桝井, 文人 | |||||
言語 | ja | |||||
抄録 | ||||||
内容記述タイプ | Abstract | |||||
内容記述 | Word segmentation is an essential task in automatic language processing for languages where there are no explicit word boundary markers, or where space-delimited orthographic words are too coarse-grained. In this paper we introduce the MiNgMatch Segmenter—a fast word segmentation algorithm, which reduces the problem of identifying word boundaries to finding the shortest sequence of lexical n-grams matching the input text. In order to validate our method in a low-resource scenario involving extremely sparse data, we tested it with a small corpus of text in the critically endangered language of the Ainu people living in northern parts of Japan. Furthermore, we performed a series of experiments comparing our algorithm with systems utilizing state-of-the-art lexical n-gram-based language modelling techniques (namely, Stupid Backoff model and a model with modified Kneser-Ney smoothing), as well as a neural model performing word segmentation as character sequence labelling. The experimental results we obtained demonstrate the high performance of our algorithm, comparable with the other best-performing models. Given its low computational cost and competitive results, we believe that the proposed approach could be extended to other languages, and possibly also to other Natural Language Processing tasks, such as speech recognition. | |||||
書誌情報 |
en : Information 巻 10, 号 10, p. 317, 発行日 2019 |
|||||
ISSN | ||||||
収録物識別子タイプ | EISSN | |||||
収録物識別子 | 2078-2489 | |||||
DOI | ||||||
識別子タイプ | DOI | |||||
関連識別子 | https://doi.org/10.3390/info10100317 | |||||
出版者 | ||||||
出版者 | MDPI | |||||
著者版フラグ | ||||||
値 | publisher | |||||
出版タイプ | ||||||
出版タイプ | VoR | |||||
出版タイプResource | http://purl.org/coar/version/c_970fb48d4fbd8a85 |