Windows(Bash on Windows)でfastTextを使う - TadaoYamaokaの開発日記

word2vecより高速で学習できて精度がよいというfastTextを試してみました。

環境

Windows Home 64bit
Bash on Windows

学習用データの準備

確認用にコンパクトなデータセットとして、Wikipediaの全ページの要約のデータを使用した。

Index of /jawiki/latest/
から、jawiki-latest-abstract.xml をダウンロードする。

XMLファイルからテキストを抽出するため、
GitHub - icoxfog417/fastTextJapaneseTutorial: Tutorial to train fastText with Japanese corpus
こちらのツールを使用させていただいた。

$ git clone https://github.com/icoxfog417/fastTextJapaneseTutorial.git 
$ cd fastTextJapaneseTutorial
$ mkdir source
$ mkdir corpus

sourceディレクトリにダウンロードしたjawiki-latest-abstract.xmlを格納し、

$ python3 parse.py source/jawiki-latest-abstract.xml --extract

corpus/abstracts.txt
に抽出したテキストが保存される。

形態素解析

日本語のテキストはそのままでは使用できないため、MeCabを使用して単語に分割する。

Bash on WindowsでMeCabをインストールする。

$ sudo apt-get install mecab libmecab-dev mecab-ipadic
$ sudo aptitude install mecab-ipadic-utf8
$ sudo apt-get install python-mecab

テキストを単語に分割する。

$ mecab corpus/abstracts.txt -O wakati -o data.txt

単語に分割した結果が、data.txtに保存される。

fastTextのインストール

事前に、make、g++をインストールしておく。

適当な作業ディレクトリで、

$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ make

fastTextで学習

$ ./fasttext skipgram -input path/to/data.txt -output model -dim 100

「-dim」オプションに次元を指定する。
小さめのコーパスのため最小の100とした。

model.binとmodel.vecが出力される。

学習済みモデルを使用

Pythonから使用するには、gensimを使用する。
gensimはfastTextのモデルの読み込みにも対応している。

$ sudo pip3 install gensim

で、gensimをインストールする。

pythonで以下のスクリプトを実行する。

from gensim.models.wrappers.fasttext import FastText
model = FastText.load_fasttext_format('model')
model.most_similar(positive = ["王様","女性"], negative = ["男性"], topn = 3)

王様 - 男性 + 女性の結果が表示される。

[('おかえり', 0.6713434457778931), ('恋', 0.6410529613494873), ('おんな', 0.6355608701705933)]

Wikipediaの全ページの要約のデータではあまり期待した結果にはならなかった。。。