MiNgMatch—A Fast N-gram Model for Word Segmentation of the Ainu Language

Nowakowski, Karol; Ptaszynski, Michal; Masui, Fumito

doi:https://doi.org/10.3390/info10100317

WEKO3

lat lon distance

[[sub_check.contents]]

[[sub_radio.contents]]

Field does not validate

[[sub_attr.contents]]　

インデックスツリー

アイテム

MiNgMatch—A Fast N-gram Model for Word Segmentation of the Ainu Language

https://kitami-it.repo.nii.ac.jp/records/8938

名前 / ファイル	ライセンス	アクション
Information 2019, 10(10), 317 (362.8 kB)

アイテムタイプ

学術雑誌論文 / Journal Article(1)

公開日

2020-11-02

タイトル

MiNgMatch—A Fast N-gram Model for Word Segmentation of the Ainu Language

言語

eng

キーワード

主題Scheme

Other

主題

word segmentation

キーワード

主題Scheme

Other

主題

tokenization

キーワード

主題Scheme

Other

主題

language modelling

キーワード

主題Scheme

Other

主題

n-gram models

キーワード

主題Scheme

Other

主題

Ainu language

キーワード

主題Scheme

Other

主題

endangered languages

キーワード

主題Scheme

Other

主題

under-resourced languages

資源タイプ

資源

http://purl.org/coar/resource_type/c_6501

タイプ

journal article

アクセス権

open access

アクセス権URI

http://purl.org/coar/access_right/c_abf2

著者

Nowakowski, Karol
Ptaszynski, Michal
Masui, Fumito

抄録

内容記述タイプ

Abstract

内容記述

Word segmentation is an essential task in automatic language processing for languages where there are no explicit word boundary markers, or where space-delimited orthographic words are too coarse-grained. In this paper we introduce the MiNgMatch Segmenter—a fast word segmentation algorithm, which reduces the problem of identifying word boundaries to finding the shortest sequence of lexical n-grams matching the input text. In order to validate our method in a low-resource scenario involving extremely sparse data, we tested it with a small corpus of text in the critically endangered language of the Ainu people living in northern parts of Japan. Furthermore, we performed a series of experiments comparing our algorithm with systems utilizing state-of-the-art lexical n-gram-based language modelling techniques (namely, Stupid Backoff model and a model with modified Kneser-Ney smoothing), as well as a neural model performing word segmentation as character sequence labelling. The experimental results we obtained demonstrate the high performance of our algorithm, comparable with the other best-performing models. Given its low computational cost and competitive results, we believe that the proposed approach could be extended to other languages, and possibly also to other Natural Language Processing tasks, such as speech recognition.

書誌情報

en : Information

巻 10, 号 10, p. 317, 発行日 2019

ISSN

収録物識別子タイプ

EISSN

収録物識別子

2078-2489

DOI

識別子タイプ

DOI

Versions

Ver.1

2021-03-01 06:11:56.056232

Show All versions

Cite as

Other

エクスポート

OAI-PMH

JPCOAR 2.0
JPCOAR 1.0
DublinCore
DDI

Other Formats