2022-07-18

R環境で小説のテキストマイニングをやってみたら、○○○な結末になった件【その2: 形態素解析と辞書設定】

R言語テキストマイニング R - テキストマイニング R - RMeCab R - stringr R - 形態素解析形態素解析夏目漱石小説 - 坊っちゃん R - str_sub R - RMeCabC MeCab - Neologd辞書 MeCab - UniDic辞書 MeCab - IPA辞書 MeCab - juman辞書 RMeCab - 辞書設定自然言語処理

はじめに: 『R環境で小説のテキストマイニング』の連載シリーズです。
- 連載シリーズの目次
- 実行環境
形態素解析と辞書設定
- mecabを使った形態素解析
- MeCabの辞書設定
辞書による形態素解析結果の違い
RMeCab形態素解析 + neologd辞書を用いた「坊ちゃん」の第1章テキストの形態素解析
RMeCab形態素解析 + neologd辞書を用いた「坊ちゃん」の第2章以降のテキストの形態素解析
まとめ
- テキスト処理の関連記事
- 参考資料

はじめに: 『R環境で小説のテキストマイニング』の連載シリーズです。

テキストマイニングは、テキストデータを定量的に扱って、有益な情報を抽出するデータマイニング手法の1つです。このようなテキストの情報解析では、自然言語処理、形態素解析、キーワード抽出、共起分析、ネットワーク可視化、機械学習、感情分析など、様々な分析手法が用いられます。近年では各種のAPIやWebスクレイピングなどを利用して、ウェブやSNS、地図上のクチコミからテキストデータを取得して、テキストマイニングに活用されています。

この記事は、夏目漱石の「坊っちゃん」が対象小説で、 形態素解析と辞書設定を扱った内容となっています。

連載シリーズの目次

1. 青空文庫、対象小説の紹介
2. 「坊っちゃん」のテキストの前処理
3. 形態素解析と辞書設定
4. 形態素解析と複合語抽出 (ルールベース抽出、pytermextract)
5. テキストマイニングと形態素のワードクラウド
6. テキストマイニングとN-gramのネットワーク表現(章ごとの関係性とか)
7. テキストマイニングと単語の埋め込み生成(word2vecとか)
8. 文章の感情分析(感情ポジネガ分析とか、英語感情対応表とか、ML-askとか)
9. ミクロとマクロなテキスト解析〜応用篇として内容を考え中〜

【その2: 形態素解析と辞書設定】に関する内容をやっていきます。

実行環境

#ターミナル上のR環境
R version 4.1.2 (2021-11-01) -- "Bird Hippie"
Copyright (C) 2021 The R Foundation for Statistical Computing
Platform: aarch64-apple-darwin20.6.0 (64-bit)

#RStudioターミナル上のR環境
R version 4.1.2 (2021-11-01) -- "Bird Hippie"
Copyright (C) 2021 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin17.0 (64-bit)

今回、M1 MacだとRStudioでRMeCabパッケージがうまく動作せず、やや面倒ですが、形態素解析の部分はターミナルからRを起動して実行しています。

形態素解析と辞書設定

mecabを使った形態素解析

mecabパッケージのインストールについては、過去記事の補足「M1 Macでのmecabのインストール方法 - Installation of RMeCab 1.07 on M1 Mac #13」を参照のこと。

skume.net

#mecabのインストール for M1 Mac
install.packages("RMeCab", repos = "https://rmecab.jp/R", type = "source")
library(RMeCab)

まずは、短文を使って試してみます。「坊ちゃん」の第1章テキストをGitHubからダウンロードして、先頭20文字のテキストに対して形態素解析にかけることにします。

#ダウンロード & readLines
file_path_1st <- "https://raw.githubusercontent.com/kumeS/Blog/master/R_text_analysis/R_01/Bochan_text.1st.txt"
Bochan_text.1st <- readLines(con = file(file_path_1st))

#第1章: 先頭20文字のテキストの表示
txt <- stringr::str_sub(Bochan_text.1st, start = 1, end = 20) 
txt
#[1] "親譲りの無鉄砲で小供の時から損ばかりして"


#RMeCab形態素解析: mypref = 0 (テキストに現れる形態素と同じ形態素を返す)
unlist(RMeCab::RMeCabC(txt, mypref = 0))
#    名詞     助詞     名詞     助詞   接頭詞     名詞     助詞     名詞 
#"親譲り"     "の" "無鉄砲"     "で"     "小"     "供"     "の"     "時" 
#    助詞     名詞     助詞     動詞     助詞 
#  "から"     "損" "ばかり"     "し"     "て" 

#RMeCab形態素解析: mypref = 1 (テキストに現れる形態素の基本形を返す)
unlist(RMeCab::RMeCabC(txt, mypref = 1))
#    名詞     助詞     名詞     助詞   接頭詞     名詞     助詞     名詞 
#"親譲り"     "の" "無鉄砲"     "で"     "小"     "供"     "の"     "時" 
#    助詞     名詞     助詞     動詞     助詞 
#  "から"     "損" "ばかり"   "する"     "て"

myprefのオプションは出力を、「0: テキストに現れる形態素と同じ形態素が返す」か、「1: テキストに現れる形態素の基本形が返す」かを制御します。

myprefの設定(0 → 1)を変更すると、この短文での実行結果であれば、「動詞 "し"」がその基本形である「動詞 "する"」に変わっています。

以後の形態素解析は、mypref = 1の設定で進めることにします。

MeCabの辞書設定

言わずもがなですが、 形態素解析で最も重要なものは辞書です。

ここでは、IPA辞書、Neologd辞書、UniDic辞書、juman辞書のインストール方法・RMeCabでの設定方法を紹介します。 Neologd辞書は、GitHubから辞書をダウンロード・インストールしますが、IPA辞書、UniDic辞書、juman辞書は、brew installでインストールします。

「ipadic」「IPA辞書」は、MeCabのデフォルト辞書として提供されています。ただ、2007年以降、IPA辞書は更新されていないようです*1。

Neologd辞書は、MeCabのためにカスタマイズされたシステム辞書です*2。 Neologd辞書は、Web上の多くの言語資源から抽出された新語を多く含んでいるようです。 Web文書を解析する際には、このシステム辞書とデフォルトの辞書（ipadic）を併用するとよいと提案されています。

UniDic辞書は、国立国語研究所が提供する、国語研短単位自動解析用辞書です*3。また、JUMAN辞書は、京都大学黒橋・河原研究室で開発されている*4日本語形態素解析システム JUMAN に含まれる辞書です。

#Neologd辞書のインストール
#GitHubからソース・ダウンロード
git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git

#コンパイル
./mecab-ipadic-neologd/bin/install-mecab-ipadic-neologd -n
#...
#[install-mecab-ipadic-NEologd] : Finish..
#[install-mecab-ipadic-NEologd] : Finish..
#なぜか、/opt/homebrew/lib/mecab/dic/mecab-ipadic-neologd にインストールされます

#mecab-ipadic（IPA辞書）: デフォルト
brew install mecab-ipadic
#/opt/homebrew/lib/mecab/dic/ipadicにインストールされます

#mecab-unidic（UniDic辞書）
brew install mecab-unidic
#/opt/homebrew/lib/mecab/dic/unidicにインストールされます

#mecab-jumandic (juman辞書)
brew install mecab-jumandic
#/opt/homebrew/lib/mecab/dic/jumandicにインストールされます

インストール後に、/opt/homebrew/lib/mecab/dicに各辞書の設定ファイルが保存されます。

辞書による形態素解析結果の違い

辞書を変えると、MeCabでの形態素解析の結果がどう変わるかを見ていきます。

このセクションは、R環境では無く、ターミナルのMecabコマンドで実行しています。出力結果はRMeCabと同じです。

デフォルトのipadic辞書を使った場合

echo "親譲りの無鉄砲で小供の時から損ばかりしている。小学校に居る時分学校の二階から飛び降りて一週間ほど腰を抜かした事" | mecab
#or
echo "親譲りの無鉄砲で小供の時から損ばかりしている。小学校に居る時分学校の二階から飛び降りて一週間ほど腰を抜かした事" | mecab -d /opt/homebrew/lib/mecab/dic/ipadic 

親譲り   名詞,一般,*,*,*,*,親譲り,オヤユズリ,オヤユズリ
の 助詞,連体化,*,*,*,*,の,ノ,ノ
無鉄砲   名詞,一般,*,*,*,*,無鉄砲,ムテッポウ,ムテッポー
で 助詞,格助詞,一般,*,*,*,で,デ,デ
小 接頭詞,名詞接続,*,*,*,*,小,ショウ,ショー
供 名詞,サ変接続,*,*,*,*,供,キョウ,キョー
の 助詞,連体化,*,*,*,*,の,ノ,ノ
時 名詞,非自立,副詞可能,*,*,*,時,トキ,トキ
から  助詞,格助詞,一般,*,*,*,から,カラ,カラ
損 名詞,一般,*,*,*,*,損,ソン,ソン
ばかり   助詞,副助詞,*,*,*,*,ばかり,バカリ,バカリ
し 動詞,自立,*,*,サ変・スル,連用形,する,シ,シ
て 助詞,接続助詞,*,*,*,*,て,テ,テ
いる  動詞,非自立,*,*,一段,基本形,いる,イル,イル
。 記号,句点,*,*,*,*,。,。,。
小学校   名詞,一般,*,*,*,*,小学校,ショウガッコウ,ショーガッコー
に 助詞,格助詞,一般,*,*,*,に,ニ,ニ
居る  動詞,自立,*,*,一段,基本形,居る,イル,イル
時分  名詞,一般,*,*,*,*,時分,ジブン,ジブン
学校  名詞,一般,*,*,*,*,学校,ガッコウ,ガッコー
の 助詞,連体化,*,*,*,*,の,ノ,ノ
二 名詞,数,*,*,*,*,二,ニ,ニ
階 名詞,接尾,助数詞,*,*,*,階,カイ,カイ
から  助詞,格助詞,一般,*,*,*,から,カラ,カラ
飛び降り    動詞,自立,*,*,一段,連用形,飛び降りる,トビオリ,トビオリ
て 助詞,接続助詞,*,*,*,*,て,テ,テ
一 名詞,数,*,*,*,*,一,イチ,イチ
週間  名詞,接尾,助数詞,*,*,*,週間,シュウカン,シューカン
ほど  助詞,副助詞,*,*,*,*,ほど,ホド,ホド
腰 名詞,一般,*,*,*,*,腰,コシ,コシ
を 助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
抜かし   動詞,自立,*,*,五段・サ行,連用形,抜かす,ヌカシ,ヌカシ
た 助動詞,*,*,*,特殊・タ,基本形,た,タ,タ
事 名詞,非自立,一般,*,*,*,事,コト,コト
EOS

neologd辞書を使った場合

echo "『親譲りの無鉄砲で小供の時から損ばかりしている。小学校に居る時分学校の二階から飛び降りて一週間ほど腰を抜かした事" | mecab -d /opt/homebrew/lib/mecab/dic/mecab-ipadic-neologd

『 記号,括弧開,*,*,*,*,『,『,『
親譲り   名詞,一般,*,*,*,*,親譲り,オヤユズリ,オヤユズリ
の 助詞,連体化,*,*,*,*,の,ノ,ノ
無鉄砲   名詞,一般,*,*,*,*,無鉄砲,ムテッポウ,ムテッポー
で 助詞,格助詞,一般,*,*,*,で,デ,デ
小供  名詞,固有名詞,一般,*,*,*,小供,コドモ,コドモ
の 助詞,連体化,*,*,*,*,の,ノ,ノ
時 名詞,非自立,副詞可能,*,*,*,時,トキ,トキ
から  助詞,格助詞,一般,*,*,*,から,カラ,カラ
損 名詞,一般,*,*,*,*,損,ソン,ソン
ばかり   助詞,副助詞,*,*,*,*,ばかり,バカリ,バカリ
し 動詞,自立,*,*,サ変・スル,連用形,する,シ,シ
て 助詞,接続助詞,*,*,*,*,て,テ,テ
いる  動詞,非自立,*,*,一段,基本形,いる,イル,イル
。 記号,句点,*,*,*,*,。,。,。
小学校   名詞,一般,*,*,*,*,小学校,ショウガッコウ,ショーガッコー
に 助詞,格助詞,一般,*,*,*,に,ニ,ニ
居る  動詞,自立,*,*,一段,基本形,居る,イル,イル
時分  名詞,一般,*,*,*,*,時分,ジブン,ジブン
学校  名詞,一般,*,*,*,*,学校,ガッコウ,ガッコー
の 助詞,連体化,*,*,*,*,の,ノ,ノ
二階  名詞,固有名詞,地域,一般,*,*,二階,ニカイ,ニカイ
から  助詞,格助詞,一般,*,*,*,から,カラ,カラ
飛び降り    動詞,自立,*,*,一段,連用形,飛び降りる,トビオリ,トビオリ
て 助詞,接続助詞,*,*,*,*,て,テ,テ
一週間   名詞,固有名詞,一般,*,*,*,一週間,イッシュウカン,イッシューカン
ほど  助詞,副助詞,*,*,*,*,ほど,ホド,ホド
腰 名詞,一般,*,*,*,*,腰,コシ,コシ
を 助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
抜かし   動詞,自立,*,*,五段・サ行,連用形,抜かす,ヌカシ,ヌカシ
た 助動詞,*,*,*,特殊・タ,基本形,た,タ,タ
事 名詞,非自立,一般,*,*,*,事,コト,コト
EOS

unidic辞書を使った場合

echo "『親譲りの無鉄砲で小供の時から損ばかりしている。小学校に居る時分学校の二階から飛び降りて一週間ほど腰を抜かした事" | mecab -d /opt/homebrew/lib/mecab/dic/unidic

『         『 補助記号-括弧開      
親譲り   オヤユズリ オヤユズリ 親譲り   名詞-普通名詞-一般      
の ノ ノ の 助詞-格助詞        
無鉄砲   ムテッポー ムテッポウ 無鉄砲   名詞-普通名詞-形状詞可能     
で デ ダ だ 助動詞   助動詞-ダ   連用形-一般
小供  コドモ   コドモ   子供  名詞-普通名詞-一般      
の ノ ノ の 助詞-格助詞        
時 トキ  トキ  時 名詞-普通名詞-副詞可能        
から  カラ  カラ  から  助詞-格助詞        
損 ソン  ソン  損 名詞-普通名詞-一般      
ばかり   バカリ   バカリ   ばかり   助詞-副助詞        
し シ スル  為る  動詞-非自立可能  サ行変格    連用形-一般
て テ テ て 助詞-接続助詞     
いる  イル  イル  居る  動詞-非自立可能  上一段-ア行    終止形-一般
。         。 補助記号-句点     
小 ショー   ショウ   小 接頭辞       
学校  ガッコー    ガッコウ    学校  名詞-普通名詞-一般      
に ニ ニ に 助詞-格助詞        
居る  イル  イル  居る  動詞-非自立可能  上一段-ア行    連体形-一般
時分  ジブン   ジブン   時分  名詞-普通名詞-副詞可能        
学校  ガッコー    ガッコウ    学校  名詞-普通名詞-一般      
の ノ ノ の 助詞-格助詞        
二 ニ ニ 二 名詞-数詞       
階 カイ  カイ  階 名詞-普通名詞-助数詞可能     
から  カラ  カラ  から  助詞-格助詞        
飛び降り    トビオリ    トビオリル 飛び下りる 動詞-一般   上一段-ラ行    連用形-一般
て テ テ て 助詞-接続助詞     
一 イチ  イチ  一 名詞-数詞       
週間  シューカン シュウカン 週間  名詞-普通名詞-助数詞可能     
ほど  ホド  ホド  ほど  助詞-副助詞        
腰 コシ  コシ  腰 名詞-普通名詞-一般      
を オ ヲ を 助詞-格助詞        
抜かし   ヌカシ   ヌカス   抜かす   動詞-一般   五段-サ行   連用形-一般
た タ タ た 助動詞   助動詞-タ   連体形-一般
事 コト  コト  事 名詞-普通名詞-一般

jumandic辞書を使った場合

echo "『親譲りの無鉄砲で小供の時から損ばかりしている。小学校に居る時分学校の二階から飛び降りて一週間ほど腰を抜かした事" | mecab -d /opt/homebrew/lib/mecab/dic/jumandic

『 特殊,括弧始,*,*,『,『,*
親譲り   名詞,普通名詞,*,*,親譲り,おやゆずり,代表表記:親譲り/おやゆずり カテゴリ:抽象物
の 助詞,格助詞,*,*,の,の,*
無鉄砲で    形容詞,*,ナ形容詞,ダ列タ系連用テ形,無鉄砲だ,むてっぽうで,代表表記:無鉄砲だ/むてっぽうだ
小 接頭辞,名詞接頭辞,*,*,小,しょう,代表表記:小/しょう 反義:接頭辞-名詞接頭辞:大/だい;接頭辞-名詞接頭辞:大/おお 内容語 換言:+N:Nが小さい
供 名詞,普通名詞,*,*,供,とも,代表表記:供/とも 漢字読み:訓 カテゴリ:人;抽象物
の 助詞,接続助詞,*,*,の,の,*
時 名詞,副詞的名詞,*,*,時,とき,代表表記:時/とき
から  助詞,格助詞,*,*,から,から,*
損 形容詞,*,ナ形容詞,語幹,損だ,そん,代表表記:損だ/そんだ 動詞派生:損する/そんする
ばかり   助詞,副助詞,*,*,ばかり,ばかり,*
して  動詞,*,サ変動詞,タ系連用テ形,する,して,連語
いる  接尾辞,動詞性接尾辞,母音動詞,基本形,いる,いる,代表表記:いる/いる
。 特殊,句点,*,*,。,。,*
小学校   名詞,普通名詞,*,*,小学校,しょうがっこう,代表表記:小学校/しょうがっこう 組織名末尾 カテゴリ:場所-施設 ドメイン:教育・学習
に 助詞,格助詞,*,*,に,に,*
居る  動詞,*,母音動詞,基本形,居る,いる,代表表記:居る/いる
時分  名詞,時相名詞,*,*,時分,じぶん,代表表記:時分/じぶん カテゴリ:時間
学校  名詞,普通名詞,*,*,学校,がっこう,代表表記:学校/がっこう カテゴリ:場所-施設 ドメイン:教育・学習
の 助詞,接続助詞,*,*,の,の,*
二 名詞,数詞,*,*,二,に,カテゴリ:数量
階 接尾辞,名詞性名詞助数辞,*,*,階,かい,代表表記:階/かい 準内容語
から  助詞,格助詞,*,*,から,から,*
飛び降りて 動詞,*,母音動詞,タ系連用テ形,飛び降りる,とびおりて,代表表記:飛び降りる/とびおりる
一 名詞,数詞,*,*,一,いち,カテゴリ:数量
週間  接尾辞,名詞性名詞助数辞,*,*,週間,しゅうかん,代表表記:週間/しゅうかん 準内容語 カテゴリ:時間
ほど  接尾辞,名詞性名詞接尾辞,*,*,ほど,ほど,代表表記:程/ほど
腰 名詞,普通名詞,*,*,腰,こし,代表表記:腰/こし 漢字読み:訓 カテゴリ:動物-部位;抽象物
を 助詞,格助詞,*,*,を,を,*
抜かした    動詞,*,子音動詞サ行,タ形,抜かす,ぬかした,代表表記:抜かす/ぬかす 自他動詞:自:抜ける/ぬける
事 名詞,普通名詞,*,*,事,こと,代表表記:事/こと 漢字読み:訓 カテゴリ:抽象物
EOS

Neologd辞書は、新語・固有表現に強く、カバーできる語彙数が多いと評価されています。 neologd辞書を使った形態素解析の結果を確認すると、「小供」、「学校」、「二階」、「一週間」がちゃんと複合語として抽出されていますね。

この結果を受けて、形態素解析は主に、「RMeCab」+「neologd辞書」を使うことにします。

RMeCabで辞書設定を変更する方法

このセクションは、結構、重要です。設定方法を知らないとハマる可能性があります。

RMeCabで辞書設定を変更するには、適時、mecabrcを修正します。

sudo vimで、そのテキストファイルを書き換えます。Macだと、/usr/local/etcディレクト内にあります。

sudo vim /usr/local/etc/mecabrc

そして、7行目に、以下のようにdicdirのディレクトリパスを追記します。

;
; Configuration file of MeCab
;
; $Id: mecabrc.in,v 1.3 2006/05/29 15:36:08 taku-ku Exp $;
;
;dicdir =  /opt/homebrew/lib/mecab/dic/ipadic
dicdir =  /opt/homebrew/lib/mecab/dic/mecab-ipadic-neologd

; userdic = /home/foo/bar/user.dic

; output-format-type = wakati
; input-buffer-size = 8192

; node-format = %m\n
; bos-format = %S\n
; eos-format = EOS\n

この例では、RMeCabでneologd辞書を使用するために、 dicdir = /opt/homebrew/lib/mecab/dic/mecab-ipadic-neologdの行を追記しています。先頭の「;」のコメントアウトは外しておきます。

RMeCab形態素解析 + neologd辞書を用いた「坊ちゃん」の第1章テキストの形態素解析

辞書設定の注意点ですが、RMeCabCのオプションで、 dic="/opt/homebrew/lib/mecab/dic/mecab-ipadic-neologd" のパスを指定すると、Abortします。今のところ、理由は不明です。

そこで、上記でmecabrcを設定変更した上で、 RMeCabCのオプションで、 mecabrc="/usr/local/etc/mecabrc" を指定します。mecabrcオプションだと、RMeCabCがうまく実行できます。

#RMeCab形態素解析 + neologd辞書
unlist(RMeCab::RMeCabC("『親譲りの無鉄砲で小供の時から損ばかりしている。", mypref = 1,
                       mecabrc="/usr/local/etc/mecabrc"))
#    記号     名詞     助詞     名詞     助詞     名詞     助詞     名詞 
#    "『" "親譲り"     "の" "無鉄砲"     "で"   "小供"     "の"     "時" 
#    助詞     名詞     助詞     動詞     助詞     動詞     記号 
#  "から"     "損" "ばかり"   "する"     "て"   "いる"     "。"

次に、RMeCab形態素解析 + neologd辞書を使って、「坊ちゃん」の第1章テキスト全体に対して形態素解析を実行します。解析後に、品詞と形態素の頻度集計も行いました。

#ダウンロード & readLines
file_path_1st <- "https://raw.githubusercontent.com/kumeS/Blog/master/R_text_analysis/R_01/Bochan_text.1st.txt"
Bochan_text.1st <- readLines(con = file(file_path_1st))

#RMeCab形態素解析 + neologd辞書
result0 <- unlist(RMeCab::RMeCabC(Bochan_text.1st, mypref = 1,
                                  mecabrc="/usr/local/etc/mecabrc"))

#データフレーム変換
result1 <- data.frame(parts = names(result0), 
                      mor = result0, 
                      row.names = 1:length(result0))

#結果表示
head(result1)
#  parts    mor
#1  名詞 親譲り
#2  助詞     の
#3  名詞 無鉄砲
#4  助詞     で
#5  名詞   小供
#6  助詞     の

#結果保存
#saveRDS(result1, file = "result1.Rds")
#write.table(result1, file = "result1.txt", append = F, row.names = F, sep="\t")
#result1 <- readRDS(file = "result1.Rds")

#GitHubからダウンロード
##Rds fileの場合
#Rds_path_all <- "https://github.com/kumeS/Blog/raw/master/R_text_analysis/R_02/result1.Rds"
#download.file(url=Rds_path_all, destfile = basename(Rds_path_all))
#result1 <- readRDS(file = "result1.Rds")
##text fileの場合
#result1 <- read.table(file="https://raw.githubusercontent.com/kumeS/Blog/master/R_text_analysis/R_02/result1.txt", header = T, sep = "\t")

次に、table()関数を用いて、品詞と形態素の頻度を集計しました。

#品詞のカウント
a <- table(result1$parts)
data.frame(parts=names(a), Freq=as.numeric(a))[order(Freq=as.numeric(a), decreasing = T),]
#      parts Freq
#5      助詞 1500
#11     名詞 1292
#9      動詞  775
#6    助動詞  535
#3      記号  438
#10     副詞  138
#4    形容詞   87
#12   連体詞   47
#7    接続詞   29
#8    接頭詞   24
#2    感動詞    4
#1  フィラー    1

#形態素のカウント: 名詞と動詞
b <- table(result1$mor[result1$parts %in% c("名詞", "動詞")])
head(data.frame(parts=names(b), Freq=as.numeric(b))[order(Freq=as.numeric(b), decreasing = T),], n=10)
#    parts Freq
#57   する   89
#172  云う   50
#16   いる   46
#21   おれ   43
#95   なる   37
#526    清   37
#281    兄   27
#116  もの   26
#196    何   25
#8    ある   24

#形態素のカウント: 名詞
b1 <- table(result1$mor[result1$parts %in% c("名詞")])
head(data.frame(parts=names(b1), Freq=as.numeric(b1))[order(Freq=as.numeric(b1), decreasing = T),], n=10)
#     parts Freq
#13    おれ   43
#363     清   37
#190     兄   27
#67    もの   26
#122     何   25
#55      の   24
#262     事   22
#12  おやじ   17
#242     三   16
#25    これ   14

RMeCab形態素解析 + neologd辞書を用いた「坊ちゃん」の第2章以降のテキストの形態素解析

続いて、RMeCab形態素解析 + neologd辞書を使って、「坊ちゃん」の各章のテキスト(2章から11章まで)に対して形態素解析を実行します。

#ダウンロード & readLines
file_path_all <- "https://raw.githubusercontent.com/kumeS/Blog/master/R_text_analysis/R_01/Bochan_text.all.txt"

#コネクション
txt <- file(file_path_all)
Bochan_text.all <- readLines(con = txt)
close(txt)

#RMeCab形態素解析 + neologd辞書
for(n in 2:length(Bochan_text.all)){
result00 <- unlist(RMeCab::RMeCabC(Bochan_text.all[n], mypref = 1,
                                  mecabrc="/usr/local/etc/mecabrc"))

#データフレーム変換
result01 <- data.frame(parts = names(result00), 
                      mor = result00, 
                      row.names = 1:length(result00))

#結果保存
saveRDS(result01, file = paste0("result", n, ".Rds"))
write.table(result01, file = paste0("result", n, ".txt"), 
append = F, row.names = F, sep="\t")
}

github.com

まとめ

MeCabでの形態素解析と辞書設定を紹介しました。

neologd辞書を使う方が、デフォルトの辞書よりは良い感じでした。

この経験を踏まえて、テキストマイニングの手法を使っていきたいと考えています。

テキスト処理の関連記事

skume.net

参考資料

*1:https://self-development.info/pip%E3%82%B3%E3%83%9E%E3%83%B3%E3%83%89%E3%81%A7ipadic%EF%BC%88ipa%E8%BE%9E%E6%9B%B8%EF%BC%89%E3%82%92%E3%82%A4%E3%83%B3%E3%82%B9%E3%83%88%E3%83%BC%E3%83%AB%E3%81%99%E3%82%8B/

*2:https://github.com/neologd/mecab-ipadic-neologd

*3:https://clrd.ninjal.ac.jp/unidic/

*4:https://nlp.ist.i.kyoto-u.ac.jp/?JUMAN

2022-07-16

R環境で小説のテキストマイニングをやってみたら、○○○な結末になった件【その1: 夏目漱石の小説「坊っちゃん」を使った、テキストの前処理編】

R言語テキストマイニング Webスクレイピング R - テキストマイニング R - テキストの前処理 R - Webスクレイピング R - rvest R - RMeCab R - stringr 夏目漱石小説 - 坊っちゃん R - write

はじめに
- 連載シリーズの目次
- 実行環境
1. 青空文庫、対象小説の紹介
- 夏目漱石が執筆した小説の無料公開リスト
2. 「坊っちゃん」のテキストの前処理
- 処理したテキストの保存
まとめ
- テキスト処理の関連記事
- 参考資料

はじめに

テキストマイニングは、簡単に言うと、テキストデータを定量的に扱って、有益な情報を抽出するデータマイニング手法の1つです。このようなテキストの情報解析では、自然言語処理、形態素解析、キーワード抽出、共起分析、ネットワーク可視化、機械学習、感情分析など、様々な分析手法が用いられます。近年では各種のAPIやWebスクレイピングなどを利用して、ウェブやSNS、地図上のクチコミからテキストデータを取得して、テキストマイニングに活用されています。『R環境で小説のテキストマイニング』の連載シリーズでは、テキストマイニングの活用事例として、小説テキストを対するテキストマイニングの基本的な手法を概説しています。各分析手法は、フリーソフトであるR言語やPythonのパッケージを主に用いて実施しています。

そもそも、自分が小説とかを通読するのが苦手ということもあって、テキストマイニングの観点からせめて、 「マイニング結果から小説を読んだ気になれる」とか、「全体のあらずじとか、盛り上がっている章とかがまぁまぁ分かる」といった辺りのことができるかどうかを以前から調査したかった経緯があります。そのため、テキストマイニング、形態素解析、ネットワーク可視化や感情分析による文脈解釈とかを扱う連載記事を書くことにしました。

他方、そもそもの小説を読む醍醐味とかは完全に無くなりますけどね。。。

テキストマイニングの対象小説は、夏目漱石の「坊っちゃん」にしました。理由は後ほど説明します。それでは、テキストマイニングの世界に行ってらっしゃい。

連載シリーズの目次

【その1: 夏目漱石の小説「坊っちゃん」を使った、テキストの前処理編】に関する内容をやっていきます。

実行環境

#ターミナル上のR環境
R version 4.1.2 (2021-11-01) -- "Bird Hippie"
Copyright (C) 2021 The R Foundation for Statistical Computing
Platform: aarch64-apple-darwin20.6.0 (64-bit)

#RStudioターミナル上のR環境
R version 4.1.2 (2021-11-01) -- "Bird Hippie"
Copyright (C) 2021 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin17.0 (64-bit)

今回、M1 MacだとRStudioでRMeCabパッケージがうまく動作せず、やや面倒ですが、形態素解析の部分はターミナルからRを起動して実行しています。

1. 青空文庫、対象小説の紹介

夏目漱石が執筆した小説の無料公開リスト

実は、夏目漱石の小説本文は、ウェブ上で無料公開されています。

著作権切れ作品を集めた、青空文庫というサイトにアップされています。

www.aozora.gr.jp

例えば、「坊っちゃん」とかもhtml形式のテキストで公開されていて読むことができます。

www.aozora.gr.jp

「坊っちゃん」のテキスト部分をWebスクレイピングで取得すると、 R環境上でテキストを利用できます。

次のセクションでは、Webスクレイピングによるテキストの取得から、テキスト情報の前処理を扱います。

2. 「坊っちゃん」のテキストの前処理

上画像の「坊っちゃん」のテキスト部分をWebスクレイピングで取得します。ご丁寧に漢字にルビがふってあるので、それらを前処理で除去していきます。

また、テキスト部分のXpathは「/html/body/div[3]"」になります。

それでは、R環境を立ち上げて以下を実行して、テキストをざっくり取得します。

#パッケージのインストール
install.packages(c("rvest", "magrittr", "stringr"))
#ロード
library(rvest); library(magrittr); library(stringr)

##テキストのWebスクレイピング
#「坊っちゃん」のURL
Bochan_url <- "https://www.aozora.gr.jp/cards/000148/files/752_14964.html"

#ウェブサイトを確認するなら
#browseURL(Bochan_url)

#テキストの取得
Bochan_text <- Bochan_url %>%
  rvest::read_html() %>%
  rvest::html_nodes(xpath = "/html/body/div[3]") %>%
  rvest::html_text()

#取得結果の表示
str(Bochan_text)
#chr "一\r\n\r\n　親譲（おやゆず）りの無鉄砲（むてっぽう）で小供の時から損ばかりしている。小学#校に居る時分学校の二階"| __truncated__

テキストを取得すると、「\r\n\r\n」「\r\n」とかの改行記号があったり、ルビ部分があったりします。組み込み関数やstringrパッケージを使って、それらを除いていきます。

#テキストを章ごとに分割する
#\r\n\r\n
#\\p{Han}は「漢字」指定、. は「任意文字」指定
Bochan_text.sp <- stringr::str_split(Bochan_text, 
                                     pattern = "\\p{Han}\r\n\r\n.|\\p{Han}\\p{Han}\r\n\r\n.")[[1]][-1]

#改行「\r\n」を消す
Bochan_text.sp.rn <- gsub(pattern="\r\n　|\r\n", 
                          replacement="",
                          Bochan_text.sp)  

#ルビ部分を削除する
Rub <- c(".", "..", "...", "....", ".....", "......", ".......", "........", ".........", "..........", "................", ".................")

#ルビの任意文字数
nchar(Rub)
#[1]  1  2  3  4  5  6  7  8  9 10 16 17

#実行
for(n in Rub){
Bochan_text.sp.rn <- gsub(pattern=paste0("[（]", n, "[）]"), 
                        replacement="",
                        Bochan_text.sp.rn)  
}

#文字数
nchar(Bochan_text.sp.rn)
#[1]  7499  5983  5821  7583  7495 10412 10071  7133  8409  8058  9806

#先頭20文字のテキストの表示
stringr::str_sub(Bochan_text.sp.rn[1], start = 1, end = 20) 
#[1] "親譲りの無鉄砲で小供の時から損ばかりして"
stringr::str_sub(Bochan_text.sp.rn[2], start = 1, end = 20) 
#[1] "ぶうと云って汽船がとまると、艀が岸を離れ"
stringr::str_sub(Bochan_text.sp.rn[3], start = 1, end = 20) 
#[1] "いよいよ学校へ出た。初めて教場へはいって"

こんな感じで、章ごとにテキストを分けた、11要素のベクトル空間ができます。

次に、テキストファイルとして書き出して、GitHubにアップします。

#テキストファイルへの書き出し
#全文
write(Bochan_text.sp.rn, file = "Bochan_text.all.txt", append = F)
#saveRDS(Bochan_text.sp.rn, file = "Bochan_text.all.Rds")

#第1章
write(Bochan_text.sp.rn[1], file = "Bochan_text.1st.txt", append = F)
#saveRDS(Bochan_text.sp.rn[1], file = "Bochan_text.1st.Rds")

#第2章
write(Bochan_text.sp.rn[2], file = "Bochan_text.2nd.txt", append = F)
#saveRDS(Bochan_text.sp.rn[2], file = "Bochan_text.2nd.Rds")

坊ちゃんの第1章のテキスト

坊ちゃんの第2章のテキスト

処理したテキストの保存

次に、先ほど処理したテキストをGitHubからダウンロードする方法をのせています。

##GitHubにアップロードしたファイルたち
#テキスト全文
file_path_all <- "https://raw.githubusercontent.com/kumeS/Blog/master/R_text_analysis/R_01/Bochan_text.all.txt"
browseURL(file_path_all)
Bochan_text.all <- readLines(con = file(file_path_all))

#テキスト全文: Rds version
Rds_path_all <- "https://github.com/kumeS/Blog/raw/master/R_text_analysis/R_01/Bochan_text.all.Rds"
download.file(url=Rds_path_all, destfile = basename(Rds_path_all))
Bochan_text.all <- readRDS(basename(Rds_path_all))


##第1章
file_path_1st <- "https://raw.githubusercontent.com/kumeS/Blog/master/R_text_analysis/R_01/Bochan_text.1st.txt"
browseURL(file_path_1st)
Bochan_text.1st <- readLines(con = file(file_path_1st))

#第1章: Rds version
Rds_path_1st <- "https://github.com/kumeS/Blog/raw/master/R_text_analysis/R_01/Bochan_text.1st.Rds"
download.file(url=Rds_path_1st, destfile = basename(Rds_path_1st))
Bochan_text.all <- readRDS(basename(Rds_path_1st))


##第2章
file_path_2nd <- "https://raw.githubusercontent.com/kumeS/Blog/master/R_text_analysis/R_01/Bochan_text.2nd.txt"
browseURL(file_path_2nd)
Bochan_text.2nd <- readLines(con = file(file_path_2nd))

#第2章: Rds version
Rds_path_2nd <- "https://github.com/kumeS/Blog/raw/master/R_text_analysis/R_01/Bochan_text.2nd.Rds"
download.file(url=Rds_path_2nd, destfile = basename(Rds_path_2nd))
Bochan_text.all <- readRDS(basename(Rds_path_2nd))

まとめ

Html形式のテキストを取得して、ざっくりとテキストの前処理までをやってみました。

次回からは、形態素解析も取り入れて、テキスト解析を行います。

テキスト処理の関連記事

skume.net

参考資料

skume.net

2022-07-13

M1 Mac環境で、flash2を使ってペアエンドfastqをマージしてみた件

M1 Mac Mac flash2 ペアエンドリード FASTQ形式 FASTQファイルシーケンスデータシークエンスデータの取得 NGSデータ Terminalコマンド - wget Terminalコマンド - echo Terminalコマンド - gzip zsh - バイオインフォ zsh - make Mac - 環境構築 NGS - リードマージ Terminalコマンド Bioinfoツール設定 Bioinformatics

はじめに
- FLASHの原著論文
- 今回のzsh実行環境
M1 Macにおけるflash2のインストール
flash2によるペアエンドfastqのマージ実行
GAGEからペアエンドのfastqデータを取得する
まとめ
- ゲノム解析の関連記事
- 参考資料
補足

はじめに

この記事では、flash2を使った、ペアエンドリード(paired-end read)のマージ(read merge)、オーバーラップ（read overlap）とその他諸設定について紹介します。 flash2のもともとのアルゴリズムは、「FLASH (Fast Length Adjustment of SHort reads) 」と言われるものです。名前からも推測できる通り、flash2は、FLASHの改良版となります。

github.com

FLASH は、リードの長さが2倍より短いDNA断片から生成されたペアエンドリードを、正確かつ高速に、ペアエンドリードをマージする（リード同士を結合する）ためのツールの1つです。マージされたリードペアは長いリードとして、リード数のカウントをしたり、ゲノムアセンブリやゲノム解析のプロセスで活用されたりします。

このFLASHアルゴリズムでは、リードペア間の最小長以上のオーバーラップをすべて考慮し、ミスマッチ密度（オーバーラップ領域内のミスマッチ塩基の割合）が最も低くなるオーバーラップを選択しています。また、ミスマッチ部位の品質スコアも考慮しています。マージされた配列を構築する際に、FLASHはオーバーラップした領域のコンセンサス配列を計算しています。

一方で、flash2の実行には、いくつかの制限があります。当然ながら、オーバーラップしていないペアエンドリードのマージには使えません。また、ペアエンドのfastqデータに適していて、サンガーシーケンスのデータには基本的に適していません。

FLASHの原著論文

Title: FLASH: fast length adjustment of short reads to improve genome assemblies
Authors: Tanja Magoč and Steven L. Salzberg

詳しくは、原著論文を確認してみてください。

bioinformatics.oxfordjournals.org

今回のzsh実行環境

macOS Big Sur (バージョン11.5.2)
MacBook Air (M1, 2020)
チップ Apple M1
メモリ 16GB

M1 Macにおけるflash2のインストール

flash2は、GitHubからクローンして、ソースからコンパイルします。コンパイルといっても、「ダウンロードとmake実行」でインストールが完了します。

おそらく最近のMacでも、ダウンロードには事前にgitコマンドを設定する必要があります。

gitコマンドの設定方法や使い方は、過去記事を参考のこと。

skume.net

ターミナル(zsh)を起動して、以下のコマンドを実行していきます。

#ダウンロード
git clone https://github.com/dstreett/FLASH2.git

#ディレクトリ移動
cd FLASH2

#コンパイル
make

#パス設定の前段階: 手っ取り早く"/opt/homebrew/bin"に置いておく
cd ../
mv FLASH2 /opt/homebrew/bin

#HOMEに移動
#cd ~/

#パスを.zshrcを書き込む
echo 'export PATH="/opt/homebrew/bin/FLASH2:$PATH"' >> .zshrc

#いったんソースする
source .zshrc
 
#パス設定
which flash2
#/opt/homebrew/bin/FLASH2/flash2

#ヘルプで動作確認: 出力表示は補足に記載
flash2 -h

flash2によるペアエンドfastqのマージ実行

flash2の実行には、ペアエンドのfastqファイル（2つのfastq）が入力ファイルとして必要です。

GAGE (Genome Assembly Gold-standard Evaluation) というサイトがあって、 fastqのテストデータを色々と用意してくれているので、今回、それをテスト用のデータとして使うことにします。

gage.cbcb.umd.edu

詳しくは、GAGEの原著論文もあります。

genome.cshlp.org

GAGEからペアエンドのfastqデータを取得する

今回、GAGEから1. Staphylococcus aureus (黄色ブドウ球菌)のシーケンスデータをダウンロードします。

簡単なサンプル情報は下記に示します。

# テストデータ
# Staphylococcus aureus: Data download page
# Library 1: Fragment
# Avg Read length: 101bp
# Insert length: 180bp
# # of reads: 1,294,104

まずは、Read1とRead2のFastqファイルをそれぞれwgetでダウンロードして、gzipで解凍します。

#ダウンロード
wget http://gage.cbcb.umd.edu/data/Staphylococcus_aureus/Data.original/frag_1.fastq.gz http://gage.cbcb.umd.edu/data/Staphylococcus_aureus/Data.original/frag_2.fastq.gz

#解凍
gzip -d -k frag_1.fastq.gz
gzip -d -k frag_2.fastq.gz

#ファイルの詳細表示
ls -lh
#-rw-r--r--  1 sas  staff   140M  4 15  2011 frag_1.fastq
#-rw-r--r--  1 sas  staff    61M  4 15  2011 frag_1.fastq.gz
#-rw-r--r--  1 sas  staff   140M  4 15  2011 frag_2.fastq
#-rw-r--r--  1 sas  staff    65M  4 15  2011 frag_2.fastq.gz

データは、fastq.gz形式で約60Mバイト、生データで約150MBバイトという、低スペックPCにも優しい低容量です。

次に、実際に、flash2コマンドを使って、リードのマージを行います。入力は、2つのfastqファイルを使います。 -tオプションでは、worker threadsの数を指定します。

#実行
flash2 frag_1.fastq frag_2.fastq -t 2

#[FLASH] Starting FLASH v2.2.00
#[FLASH] Fast Length Adjustment of SHort reads
#[FLASH]  
#[FLASH] Input files:
#[FLASH]     frag_1.fastq
#[FLASH]     frag_2.fastq
#[FLASH]  
#[FLASH] Output files:
#[FLASH]     ./out.extendedFrags.fastq
#[FLASH]     ./out.notCombined_1.fastq
#[FLASH]     ./out.notCombined_2.fastq
#[FLASH]     ./out.hist
#[FLASH]     ./out.histogram
#[FLASH]  
#[FLASH] Parameters:
#[FLASH]     Min overlap:           10
#[FLASH]     Min overlap outie:     35
#[FLASH]     Max overlap:           65
#[FLASH]     Max mismatch density:  0.250000
#[FLASH]     Allow "outie" pairs:   true
#[FLASH]     Cap mismatch quals:    false
#[FLASH]     Combiner threads:      2
#[FLASH]     Input format:          FASTQ, phred_offset=33
#[FLASH]     Output format:         FASTQ, phred_offset=33
#[FLASH]  
#[FLASH] Starting reader and writer threads
#[FLASH] Starting 2 combiner threads
#[FLASH] Processed 25000 read pairs
#[FLASH] Processed 50000 read pairs
#[FLASH] Processed 75000 read pairs
#[FLASH] Processed 100000 read pairs
#[FLASH] Processed 125000 read pairs
#[FLASH] Processed 150000 read pairs
#[FLASH] Processed 175000 read pairs
#[FLASH] Processed 200000 read pairs
#[FLASH] Processed 225000 read pairs
#[FLASH] Processed 250000 read pairs
#[FLASH] Processed 275000 read pairs
#[FLASH] Processed 300000 read pairs
#[FLASH] Processed 325000 read pairs
#[FLASH] Processed 350000 read pairs
#[FLASH] Processed 375000 read pairs
#[FLASH] Processed 400000 read pairs
#[FLASH] Processed 425000 read pairs
#[FLASH] Processed 450000 read pairs
#[FLASH] Processed 475000 read pairs
#[FLASH] Processed 500000 read pairs
#[FLASH] Processed 525000 read pairs
#[FLASH] Processed 550000 read pairs
#[FLASH] Processed 575000 read pairs
#[FLASH] Processed 600000 read pairs
#[FLASH] Processed 625000 read pairs
#[FLASH] Processed 647052 read pairs
#[FLASH]  
#[FLASH] Read combination statistics:
#[FLASH]     Total pairs:       647052
#[FLASH]     Discarded pairs:   262
#[FLASH]     Percent Discarded: 0.04%
#[FLASH]     Combined pairs:    370303
#[FLASH]         Innie pairs:   369148 (99.69% of combined)
#[FLASH]         Outie pairs:   1155 (0.31% of combined)
#[FLASH]     Uncombined pairs:  276487
#[FLASH]     Percent combined:  57.25%
#[FLASH]  
#[FLASH] Writing histogram files.
#[FLASH]  
#[FLASH] FLASH v2.2.00 complete!
#[FLASH] 11.136 seconds elapsed

解析が終って、出力結果(out.histogram)を出力表示すれば、テキストのヒストグラムとして結果を確認できます。アスタリスクのヒストグラムは、なんとも乙な感じです。

#結果確認
#ヒストグラム確認
tail -n 90 out.histogram

#103   
#104   
#105   
#106   
#107   
#108   
#109   
#110   
#111   
#112   
#113   
#114   
#115   *
#116   *
#117   *
#118   *
#119   *
#120   *
#121   **
#122   **
#123   ***
#124   ***
#125   ****
#126   *****
#127   ******
#128   ********
#129   ********
#130   ***********
#131   ************
#132   **************
#133   *****************
#134   ********************
#135   **********************
#136   *************************
#137   *****************************
#138   *******************************
#139   ************************************
#140   *************************************
#141   ****************************************
#142   *********************************************
#143   ************************************************
#144   **************************************************
#145   ******************************************************
#146   ********************************************************
#147   ***********************************************************
#148   *************************************************************
#149   ***************************************************************
#150   ****************************************************************
#151   *****************************************************************
#152   *******************************************************************
#153   ********************************************************************
#154   **********************************************************************
#155   ********************************************************************
#156   ***********************************************************************
#157   **********************************************************************
#158   ***********************************************************************
#159   *********************************************************************
#160   ***********************************************************************
#161   **********************************************************************
#162   ************************************************************************
#163   ***********************************************************************
#164   *********************************************************************
#165   **********************************************************************
#166   ***********************************************************************
#167   *******************************************************************
#168   *******************************************************************
#169   ********************************************************************
#170   ******************************************************************
#171   *****************************************************************
#172   *****************************************************************
#173   ****************************************************************
#174   ****************************************************************
#175   *************************************************************
#176   *************************************************************
#177   **************************************************************
#178   **************************************************************
#179   **********************************************************
#180   **********************************************************
#181   **********************************************************
#182   *********************************************************
#183   ******************************************************
#184   ******************************************************
#185   *******************************************************
#186   ******************************************************
#187   **************************************************
#188   **************************************************
#189   ****************************************************
#190   ******************************************************
#191   **********************************************
#192   ***********************************************

また、マージされたリードが出力されたFastq(out.extendedFrags.fastq)やヒストグラムの数値のテキスト出力ファイルであるout.histもあります。

マージされなかったリードは、 out.notCombined_1.fastq、out.notCombined_2.fastqとしてアウトプットされます。

#Extended FASTQ 確認
head out.extendedFrags.fastq
#@SRR022868.1483
#TCATACATATTAATATAGTCAGAACTAGTAATATAATTTTGGGCATTTCTATATAAATATCTATTCCATGACAGAAATACACATTGCGCTGGTCTTCCCATTTCTTTAAATAAATTTAAACGATTAATAATTGCTTTCTCT
#+
#IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGIBECEIF8E;DEAE7<I1ICD2ICB%8I1354I25;A9E=>52?AI6;795:%6BIACII;II@<IIIIIA<0AFI3FIIIIIIIIIIIIIIIIIIIIII
#@SRR022868.1630
#AATATGAAGCAATTTTTTCAAAAACGCTATTAGTCACAAAATGTAAACATAATTTTAACAAATGTGTGAACCCACGTCGCACCTTGATTTATCAACATTTTTGATACAATTTTATTAATTTTTTTCCCATAAGCCCT
#+
#IIIIIIIIIIIIFIIIII>IIIBFIIIIII1II/5I:I5@6EI24%81+<,&+A4436(3,3,+*A+00(*#0/62,/&;&2.10-/3202+F+-8;+8,>31;?-:5%584+$)G2I6I438C9IIII.I,6)IIB
#@SRR022868.1854
#TTCTGTGTATGAAAATATTTTTCATAAAAATTGTTTGAATCATGTAACAATCATATAAATTGCTGTTTATATTGTTGTGAAAATGAGTTGACAAAAGTCGGTATAGATATGTAGACATCCTATTTTTAGCGAG

#ヒストグラムのヘッド表示
head out.hist
#35    15
#36    12
#37    4
#38    11
#39    9
#40    7
#41    12
#42    10
#43    8
#44    11

まとめ

flash2を使って、ペアエンド・リードのマージをやってみました。ソースからコンパイルすると、設定が少なくて助かります。あと、M1 Macでもちゃんとコンパイルが通るもんなんですね。感心。

便利なバイオインフォのツールがたくさんありそうので、引き続き、ツール発掘を続けてみます。

ゲノム解析の関連記事

skume.net

参考資料

kazumaxneo.hatenablog.com

https://ccb.jhu.edu/software/FLASH/ccb.jhu.edu

補足

FASTQ形式からFASTA形式への変換 - sedコマンドを使う方法

sedコマンドを使って、Fastqファイルから必要な行を選択的に出力することができます。 4行ごとに1行目と2行目を出力すれば、配列のヘッダーとFASTAフォーマットに必要な配列が得られます。

下記のような実行例になります。

sed -n '1~4s/^@/>/p;2~4p' INFILE.fastq > OUTFILE.fasta

bioinformaticsworkbook.org

FASTQ形式からFASTA形式への変換 - fastq_to_fastaを使う方法

fastq_to_fastaは、FASTX-toolkitで利用可能です。巨大なデータセットでもうまく対応できます。下記のような実行例になります。

fastq_to_fasta -n -i INFILE.fastq -o OUTFILE.fasta

hannonlab.cshl.edu

Anacondaの設定

結局使わなかったけども、、Anacondaの設定もメモしておきます

まずは、brewコマンドで、Anacondaをインストールします。

以下のbrew install --caskコマンドで入ります。

#インストール
brew install --cask anaconda
#...
#==> Downloading https://repo.anaconda.com/archive/Anaconda3-2022.05-MacOSX-arm64
#==> Installing Cask anaconda
#==> Running installer script 'Anaconda3-2022.05-MacOSX-arm64.sh'

#パス設定
echo 'export PATH="/opt/homebrew/anaconda3/bin:$PATH"' >> .zshrc
#tail -n 1 .zshrc
source .zshrc

#パス確認
echo $PATH

#condaパス確認
which conda
#/opt/homebrew/anaconda3/bin/conda

FLASH-1.2.11のインストール方法 from sourceforge.net

これは、FLASHのオリジナル版のインストール方法です。

ターミナルを起動して、以下のコマンドを実行していきます。

#ダウンロード
wget https://sourceforge.net/projects/flashpage/files/FLASH-1.2.11.tar.gz

#解凍
tar xzf FLASH-1.2.11.tar.gz

#ディレクトリ移動
cd FLASH-1.2.11

#コンパイル
make

#パス設定の前段階: 手っ取り早く"/opt/homebrew/bin"に置いておく
cd ../
mv FLASH-1.2.11 /opt/homebrew/bin

#HOMEに移動
cd ~/

#パスを.zshrcを書き込む
echo 'export PATH="/opt/homebrew/bin/FLASH-1.2.11:$PATH"' >> .zshrc

#いったんソースする
source .zshrc
 
#パス設定
which flash

#ヘルプで動作確認: 出力表示は補足に記載
flash -h

flash -h の出力

Usage: flash [OPTIONS] MATES_1.FASTQ MATES_2.FASTQ
       flash [OPTIONS] --interleaved-input (MATES.FASTQ | -)
       flash [OPTIONS] --tab-delimited-input (MATES.TAB | -)

----------------------------------------------------------------------------
                                 DESCRIPTION                                
----------------------------------------------------------------------------

FLASH (Fast Length Adjustment of SHort reads) is an accurate and fast tool
to merge paired-end reads that were generated from DNA fragments whose
lengths are shorter than twice the length of reads.  Merged read pairs result
in unpaired longer reads, which are generally more desired in genome
assembly and genome analysis processes.

Briefly, the FLASH algorithm considers all possible overlaps at or above a
minimum length between the reads in a pair and chooses the overlap that
results in the lowest mismatch density (proportion of mismatched bases in
the overlapped region).  Ties between multiple overlaps are broken by
considering quality scores at mismatch sites.  When building the merged
sequence, FLASH computes a consensus sequence in the overlapped region.
More details can be found in the original publication
(http://bioinformatics.oxfordjournals.org/content/27/21/2957.full).

Limitations of FLASH include:
   - FLASH cannot merge paired-end reads that do not overlap.
   - FLASH is not designed for data that has a significant amount of indel
     errors (such as Sanger sequencing data).  It is best suited for Illumina
     data.

----------------------------------------------------------------------------
                               MANDATORY INPUT
----------------------------------------------------------------------------

The most common input to FLASH is two FASTQ files containing read 1 and read 2
of each mate pair, respectively, in the same order.

Alternatively, you may provide one FASTQ file, which may be standard input,
containing paired-end reads in either interleaved FASTQ (see the
--interleaved-input option) or tab-delimited (see the --tab-delimited-input
option) format.  In all cases, gzip compressed input is autodetected.  Also,
in all cases, the PHRED offset is, by default, assumed to be 33; use the
--phred-offset option to change it.

----------------------------------------------------------------------------
                                   OUTPUT
----------------------------------------------------------------------------

The default output of FLASH consists of the following files:

   - out.extendedFrags.fastq      The merged reads.
   - out.notCombined_1.fastq      Read 1 of mate pairs that were not merged.
   - out.notCombined_2.fastq      Read 2 of mate pairs that were not merged.
   - out.hist                     Numeric histogram of merged read lengths.
   - out.histogram                Visual histogram of merged read lengths.

FLASH also logs informational messages to standard output.  These can also be
redirected to a file, as in the following example:

  $ flash reads_1.fq reads_2.fq 2>&1 | tee flash.log

In addition, FLASH supports several features affecting the output:

   - Writing the merged reads directly to standard output (--to-stdout)
   - Writing gzip compressed output files (-z) or using an external
     compression program (--compress-prog)
   - Writing the uncombined read pairs in interleaved FASTQ format
     (--interleaved-output)
   - Writing all output reads to a single file in tab-delimited format
     (--tab-delimited-output)

----------------------------------------------------------------------------
                                   OPTIONS
----------------------------------------------------------------------------

  -m, --min-overlap=NUM   The minimum required overlap length between two
                          reads to provide a confident overlap.  Default:
                          10bp.

  -M, --max-overlap=NUM   Maximum overlap length expected in approximately
                          90% of read pairs.  It is by default set to 65bp,
                          which works well for 100bp reads generated from a
                          180bp library, assuming a normal distribution of
                          fragment lengths.  Overlaps longer than the maximum
                          overlap parameter are still considered as good
                          overlaps, but the mismatch density (explained below)
                          is calculated over the first max_overlap bases in
                          the overlapped region rather than the entire
                          overlap.  Default: 65bp, or calculated from the
                          specified read length, fragment length, and fragment
                          length standard deviation.

  -x, --max-mismatch-density=NUM
                          Maximum allowed ratio between the number of
                          mismatched base pairs and the overlap length.
                          Two reads will not be combined with a given overlap
                          if that overlap results in a mismatched base density
                          higher than this value.  Note: Any occurence of an
                          'N' in either read is ignored and not counted
                          towards the mismatches or overlap length.  Our
                          experimental results suggest that higher values of
                          the maximum mismatch density yield larger
                          numbers of correctly merged read pairs but at
                          the expense of higher numbers of incorrectly
                          merged read pairs.  Default: 0.25.

  -O, --allow-outies      Also try combining read pairs in the "outie"
                          orientation, e.g.

                               Read 1: <-----------
                               Read 2:       ------------>

                          as opposed to only the "innie" orientation, e.g.

                               Read 1:       <------------
                               Read 2: ----------->

                          FLASH uses the same parameters when trying each
                          orientation.  If a read pair can be combined in
                          both "innie" and "outie" orientations, the
                          better-fitting one will be chosen using the same
                          scoring algorithm that FLASH normally uses.

                          This option also causes extra .innie and .outie
                          histogram files to be produced.

  -p, --phred-offset=OFFSET
                          The smallest ASCII value of the characters used to
                          represent quality values of bases in FASTQ files.
                          It should be set to either 33, which corresponds
                          to the later Illumina platforms and Sanger
                          platforms, or 64, which corresponds to the
                          earlier Illumina platforms.  Default: 33.

  -r, --read-len=LEN
  -f, --fragment-len=LEN
  -s, --fragment-len-stddev=LEN
                          Average read length, fragment length, and fragment
                          standard deviation.  These are convenience parameters
                          only, as they are only used for calculating the
                          maximum overlap (--max-overlap) parameter.
                          The maximum overlap is calculated as the overlap of
                          average-length reads from an average-size fragment
                          plus 2.5 times the fragment length standard
                          deviation.  The default values are -r 100, -f 180,
                          and -s 18, so this works out to a maximum overlap of
                          65 bp.  If --max-overlap is specified, then the
                          specified value overrides the calculated value.

                          If you do not know the standard deviation of the
                          fragment library, you can probably assume that the
                          standard deviation is 10% of the average fragment
                          length.

  --cap-mismatch-quals    Cap quality scores assigned at mismatch locations
                          to 2.  This was the default behavior in FLASH v1.2.7
                          and earlier.  Later versions will instead calculate
                          such scores as max(|q1 - q2|, 2); that is, the
                          absolute value of the difference in quality scores,
                          but at least 2.  Essentially, the new behavior
                          prevents a low quality base call that is likely a
                          sequencing error from significantly bringing down
                          the quality of a high quality, likely correct base
                          call.

  --interleaved-input     Instead of requiring files MATES_1.FASTQ and
                          MATES_2.FASTQ, allow a single file MATES.FASTQ that
                          has the paired-end reads interleaved.  Specify "-"
                          to read from standard input.

  --interleaved-output    Write the uncombined pairs in interleaved FASTQ
                          format.

  -I, --interleaved       Equivalent to specifying both --interleaved-input
                          and --interleaved-output.

  -Ti, --tab-delimited-input
                          Assume the input is in tab-delimited format
                          rather than FASTQ, in the format described below in
                          '--tab-delimited-output'.  In this mode you should
                          provide a single input file, each line of which must
                          contain either a read pair (5 fields) or a single
                          read (3 fields).  FLASH will try to combine the read
                          pairs.  Single reads will be written to the output
                          file as-is if also using --tab-delimited-output;
                          otherwise they will be ignored.  Note that you may
                          specify "-" as the input file to read the
                          tab-delimited data from standard input.

  -To, --tab-delimited-output
                          Write output in tab-delimited format (not FASTQ).
                          Each line will contain either a combined pair in the
                          format 'tag <tab> seq <tab> qual' or an uncombined
                          pair in the format 'tag <tab> seq_1 <tab> qual_1
                          <tab> seq_2 <tab> qual_2'.

  -o, --output-prefix=PREFIX
                          Prefix of output files.  Default: "out".

  -d, --output-directory=DIR
                          Path to directory for output files.  Default:
                          current working directory.

  -c, --to-stdout         Write the combined reads to standard output.  In
                          this mode, with FASTQ output (the default) the
                          uncombined reads are discarded.  With tab-delimited
                          output, uncombined reads are included in the
                          tab-delimited data written to standard output.
                          In both cases, histogram files are not written,
                          and informational messages are sent to standard
                          error rather than to standard output.

  -z, --compress          Compress the output files directly with zlib,
                          using the gzip container format.  Similar to
                          specifying --compress-prog=gzip and --suffix=gz,
                          but may be slightly faster.

  --compress-prog=PROG    Pipe the output through the compression program
                          PROG, which will be called as `PROG -c -',
                          plus any arguments specified by --compress-prog-args.
                          PROG must read uncompressed data from standard input
                          and write compressed data to standard output when
                          invoked as noted above.
                          Examples: gzip, bzip2, xz, pigz.

  --compress-prog-args=ARGS
                          A string of additional arguments that will be passed
                          to the compression program if one is specified with
                          --compress-prog=PROG.  (The arguments '-c -' are
                          still passed in addition to explicitly specified
                          arguments.)

  --suffix=SUFFIX, --output-suffix=SUFFIX
                          Use SUFFIX as the suffix of the output files
                          after ".fastq".  A dot before the suffix is assumed,
                          unless an empty suffix is provided.  Default:
                          nothing; or 'gz' if -z is specified; or PROG if
                          --compress-prog=PROG is specified.

  -t, --threads=NTHREADS  Set the number of worker threads.  This is in
                          addition to the I/O threads.  Default: number of
                          processors.  Note: if you need FLASH's output to
                          appear deterministically or in the same order as
                          the original reads, you must specify -t 1
                          (--threads=1).

  -q, --quiet             Do not print informational messages.

  -h, --help              Display this help and exit.

  -v, --version           Display version.

Run `flash --help | less' to prevent this text from scrolling by.

flash2 --h の出力

Usage: flash [OPTIONS] MATES_1.FASTQ MATES_2.FASTQ
       flash [OPTIONS] --interleaved-input (MATES.FASTQ | -)
       flash [OPTIONS] --tab-delimited-input (MATES.TAB | -)

----------------------------------------------------------------------------
                                 DESCRIPTION                                
----------------------------------------------------------------------------

FLASH (Fast Length Adjustment of SHort reads) is an accurate and fast tool
to merge paired-end reads that were generated from DNA fragments whose
lengths are shorter than twice the length of reads.  Merged read pairs result
in unpaired longer reads, which are generally more desired in genome
assembly and genome analysis processes.

Briefly, the FLASH algorithm considers all possible overlaps at or above a
minimum length between the reads in a pair and chooses the overlap that
results in the lowest mismatch density (proportion of mismatched bases in
the overlapped region).  Ties between multiple overlaps are broken by
considering quality scores at mismatch sites.  When building the merged
sequence, FLASH computes a consensus sequence in the overlapped region.
More details can be found in the original publication
(http://bioinformatics.oxfordjournals.org/content/27/21/2957.full).

Limitations of FLASH include:
   - FLASH cannot merge paired-end reads that do not overlap.
   - FLASH is not designed for data that has a significant amount of indel
     errors (such as Sanger sequencing data).  It is best suited for Illumina
     data.

----------------------------------------------------------------------------
                               MANDATORY INPUT
----------------------------------------------------------------------------

The most common input to FLASH is two FASTQ files containing read 1 and read 2
of each mate pair, respectively, in the same order.

Alternatively, you may provide one FASTQ file, which may be standard input,
containing paired-end reads in either interleaved FASTQ (see the
--interleaved-input option) or tab-delimited (see the --tab-delimited-input
option) format.  In all cases, gzip compressed input is autodetected.  Also,
in all cases, the PHRED offset is, by default, assumed to be 33; use the
--phred-offset option to change it.

----------------------------------------------------------------------------
                                   OUTPUT
----------------------------------------------------------------------------

The default output of FLASH consists of the following files:

   - out.extendedFrags.fastq      The merged reads.
   - out.notCombined_1.fastq      Read 1 of mate pairs that were not merged.
   - out.notCombined_2.fastq      Read 2 of mate pairs that were not merged.
   - out.hist                     Numeric histogram of merged read lengths.
   - out.histogram                Visual histogram of merged read lengths.

FLASH also logs informational messages to standard output.  These can also be
redirected to a file, as in the following example:

  $ flash reads_1.fq reads_2.fq 2>&1 | tee flash.log

In addition, FLASH supports several features affecting the output:

   - Writing the merged reads directly to standard output (--to-stdout)
   - Writing gzip compressed output files (-z) or using an external
     compression program (--compress-prog)
   - Writing the uncombined read pairs in interleaved FASTQ format
     (--interleaved-output)
   - Writing all output reads to a single file in tab-delimited format
     (--tab-delimited-output)

----------------------------------------------------------------------------
                                   OPTIONS
----------------------------------------------------------------------------

  -Q, --quality-cutoff=NUM  The cut off number for the quality score
                          corresponding wtih the percent cutoff.  Default:
                          2.
  -C, --percent-cutoff=NUM   The cutoff percentage for each read that will
                          be discarded if it falls below -Q option. (0-100)  Default:
                          50.
  -D, --no-discard         This turns off the discard logic Default: false

  -m, --min-overlap=NUM   The minimum required overlap length between two
                          reads to provide a confident overlap. Default 10bp.

  -M, --max-overlap=NUM   Maximum overlap length expected in approximately
                          90% of read pairs.  It is by default set to 65bp,
                          which works well for 100bp reads generated from a
                          180bp library, assuming a normal distribution of
                          fragment lengths.  Overlaps longer than the maximum
                          overlap parameter are still considered as good
                          overlaps, but the mismatch density (explained below)
                          is calculated over the first max_overlap bases in
                          the overlapped region rather than the entire
                          overlap.  Default: 65bp, or calculated from the
                          specified read length, fragment length, and fragment
                          length standard deviation.

  -e, --min-overlap-outie=NUM   The minimum required overlap length between two
                          reads to provide a confident overlap in an outie scenario.
                          Default: 35bp.

  -x, --max-mismatch-density=NUM
                          Maximum allowed ratio between the number of
                          mismatched base pairs and the overlap length.
                          Two reads will not be combined with a given overlap
                          if that overlap results in a mismatched base density
                          higher than this value.  Note: Any occurence of an
                          'N' in either read is ignored and not counted
                          towards the mismatches or overlap length.  Our
                          experimental results suggest that higher values of
                          the maximum mismatch density yield larger
                          numbers of correctly merged read pairs but at
                          the expense of higher numbers of incorrectly
                          merged read pairs.  Default: 0.25.

  -O, --allow-outies      Also try combining read pairs in the "outie"
                          orientation, e.g.

                               Read 1: <-----------
                               Read 2:       ------------>

                          as opposed to only the "innie" orientation, e.g.

                               Read 1:       <------------
                               Read 2: ----------->

                          FLASH uses the same parameters when trying each
                          orientation.  If a read pair can be combined in
                          both "innie" and "outie" orientations, the
                          better-fitting one will be chosen using the same
                          scoring algorithm that FLASH normally uses.

                          This option also causes extra .innie and .outie
                          histogram files to be produced.

  -p, --phred-offset=OFFSET
                          The smallest ASCII value of the characters used to
                          represent quality values of bases in FASTQ files.
                          It should be set to either 33, which corresponds
                          to the later Illumina platforms and Sanger
                          platforms, or 64, which corresponds to the
                          earlier Illumina platforms.  Default: 33.

  -r, --read-len=LEN
  -f, --fragment-len=LEN
  -s, --fragment-len-stddev=LEN
                          Average read length, fragment length, and fragment
                          standard deviation.  These are convenience parameters
                          only, as they are only used for calculating the
                          maximum overlap (--max-overlap) parameter.
                          The maximum overlap is calculated as the overlap of
                          average-length reads from an average-size fragment
                          plus 2.5 times the fragment length standard
                          deviation.  The default values are -r 100, -f 180,
                          and -s 18, so this works out to a maximum overlap of
                          65 bp.  If --max-overlap is specified, then the
                          specified value overrides the calculated value.

                          If you do not know the standard deviation of the
                          fragment library, you can probably assume that the
                          standard deviation is 10% of the average fragment
                          length.

  --cap-mismatch-quals    Cap quality scores assigned at mismatch locations
                          to 2.  This was the default behavior in FLASH v1.2.7
                          and earlier.  Later versions will instead calculate
                          such scores as max(|q1 - q2|, 2); that is, the
                          absolute value of the difference in quality scores,
                          but at least 2.  Essentially, the new behavior
                          prevents a low quality base call that is likely a
                          sequencing error from significantly bringing down
                          the quality of a high quality, likely correct base
                          call.

  --interleaved-input     Instead of requiring files MATES_1.FASTQ and
                          MATES_2.FASTQ, allow a single file MATES.FASTQ that
                          has the paired-end reads interleaved.  Specify "-"
                          to read from standard input.

  --interleaved-output    Write the uncombined pairs in interleaved FASTQ
                          format.

  -I, --interleaved       Equivalent to specifying both --interleaved-input
                          and --interleaved-output.

  -Ti, --tab-delimited-input
                          Assume the input is in tab-delimited format
                          rather than FASTQ, in the format described below in
                          '--tab-delimited-output'.  In this mode you should
                          provide a single input file, each line of which must
                          contain either a read pair (5 fields) or a single
                          read (3 fields).  FLASH will try to combine the read
                          pairs.  Single reads will be written to the output
                          file as-is if also using --tab-delimited-output;
                          otherwise they will be ignored.  Note that you may
                          specify "-" as the input file to read the
                          tab-delimited data from standard input.

  -To, --tab-delimited-output
                          Write output in tab-delimited format (not FASTQ).
                          Each line will contain either a combined pair in the
                          format 'tag <tab> seq <tab> qual' or an uncombined
                          pair in the format 'tag <tab> seq_1 <tab> qual_1
                          <tab> seq_2 <tab> qual_2'.

  -o, --output-prefix=PREFIX
                          Prefix of output files.  Default: "out".

  -d, --output-directory=DIR
                          Path to directory for output files.  Default:
                          current working directory.

  -c, --to-stdout         Write the combined reads to standard output.  In
                          this mode, with FASTQ output (the default) the
                          uncombined reads are discarded.  With tab-delimited
                          output, uncombined reads are included in the
                          tab-delimited data written to standard output.
                          In both cases, histogram files are not written,
                          and informational messages are sent to standard
                          error rather than to standard output.

  -z, --compress          Compress the output files directly with zlib,
                          using the gzip container format.  Similar to
                          specifying --compress-prog=gzip and --suffix=gz,
                          but may be slightly faster.

  --compress-prog=PROG    Pipe the output through the compression program
                          PROG, which will be called as `PROG -c -',
                          plus any arguments specified by --compress-prog-args.
                          PROG must read uncompressed data from standard input
                          and write compressed data to standard output when
                          invoked as noted above.
                          Examples: gzip, bzip2, xz, pigz.

  --compress-prog-args=ARGS
                          A string of additional arguments that will be passed
                          to the compression program if one is specified with
                          --compress-prog=PROG.  (The arguments '-c -' are
                          still passed in addition to explicitly specified
                          arguments.)

  --suffix=SUFFIX, --output-suffix=SUFFIX
                          Use SUFFIX as the suffix of the output files
                          after ".fastq".  A dot before the suffix is assumed,
                          unless an empty suffix is provided.  Default:
                          nothing; or 'gz' if -z is specified; or PROG if
                          --compress-prog=PROG is specified.

  -t, --threads=NTHREADS  Set the number of worker threads.  This is in
                          addition to the I/O threads.  Default: number of
                          processors.  Note: if you need FLASH's output to
                          appear deterministically or in the same order as
                          the original reads, you must specify -t 1
                          (--threads=1).

  -q, --quiet             Do not print informational messages.

  -h, --help              Display this help and exit.

  -v, --version           Display version.

Run `flash2 --help | less' to prevent this text from scrolling by.

はじめに: 『R環境で小説のテキストマイニング』の連載シリーズです。

連載シリーズの目次

実行環境

形態素解析と辞書設定

mecabを使った形態素解析

MeCabの辞書設定

辞書による形態素解析結果の違い

デフォルトのipadic辞書を使った場合

neologd辞書を使った場合

unidic辞書を使った場合

jumandic辞書を使った場合

RMeCabで辞書設定を変更する方法

RMeCab形態素解析 + neologd辞書を用いた「坊ちゃん」の第1章テキストの形態素解析

RMeCab形態素解析 + neologd辞書を用いた「坊ちゃん」の第2章以降のテキストの形態素解析

まとめ

テキスト処理の関連記事

参考資料

はじめに

連載シリーズの目次

実行環境

1. 青空文庫、対象小説の紹介

夏目 漱石が執筆した小説の無料公開リスト

2. 「坊っちゃん」のテキストの前処理

処理したテキストの保存

まとめ

テキスト処理の関連記事

参考資料

はじめに

FLASHの原著論文

今回のzsh実行環境

M1 Macにおけるflash2のインストール

flash2によるペアエンドfastqのマージ実行

GAGEからペアエンドのfastqデータを取得する

まとめ

ゲノム解析の関連記事

参考資料

補足

FASTQ形式からFASTA形式への変換 - sedコマンドを使う方法

FASTQ形式からFASTA形式への変換 - fastq_to_fastaを使う方法

Anacondaの設定

FLASH-1.2.11のインストール方法 from sourceforge.net

flash -h の出力

flash2 --h の出力

夏目漱石が執筆した小説の無料公開リスト