This package, RcppMeCab, is a Rcpp wrapper for the part-of-speech morphological analyzer MeCab. It supports native utf-8 encoding in C++ code and CJK (Chinese, Japanese, and Korean) MeCab library. This package fully utilizes the power Rcpp brings R computation to analyze texts faster.
Please see this for easy installation and usage examples in Korean.
RcppMeCab builds MeCab from source at install time. The MeCab variant is selected by the MECAB_LANG environment variable:
MECAB_LANG |
Backend | Version | Source |
|---|---|---|---|
ko (default) |
mecab-ko-msvc | 0.999 | Pusnow/mecab-ko-msvc |
ja |
MeCab | 0.996 | taku910/mecab |
On Linux and macOS, if MeCab is already installed system-wide (detected via mecab-config), RcppMeCab uses the system installation regardless of MECAB_LANG.
RcppMeCab automatically downloads and builds MeCab from source if it is not already installed on your system. No manual MeCab installation is required.
install.packages("RcppMeCab") # install from CRAN
# or install the development version
# install.packages("devtools")
devtools::install_github("junhewk/RcppMeCab")If you already have MeCab installed (e.g. via brew install mecab on macOS, or apt install libmecab-dev on Linux), RcppMeCab will use your system installation.
Set MECAB_LANG before installation to choose the MeCab variant:
# Korean (default)
install.packages("RcppMeCab", type = "source")
# Japanese
Sys.setenv(MECAB_LANG = "ja")
install.packages("RcppMeCab", type = "source")A MeCab dictionary is automatically downloaded and installed during package installation:
- Korean (
MECAB_LANG=ko, default): mecab-ko-dic (pre-compiled, from mecab-ko-msvc releases) - Japanese (
MECAB_LANG=ja): IPAdic (compiled from source during installation)
The bundled dictionary is stored in the package's dic/ directory and used automatically — no manual dictionary setup is required.
You can download and install dictionaries for other languages after installation using download_dic(). No system-level MeCab installation is required — dictionary compilation is handled entirely within R.
download_dic("ja") # download and compile Japanese IPAdic
download_dic("ko") # download Korean mecab-ko-dic
download_dic("zh") # download and compile Chinese mecab-jiebaDictionaries are stored in the user data directory (tools::R_user_dir("RcppMeCab", "data")) and persist across R sessions.
Use list_dic() to see all installed dictionaries:
list_dic()
#> lang name path active
#> 1 bundled bundled /path/to/RcppMeCab/dic TRUE
#> 2 ja ipadic ~/.local/share/R/RcppMeCab/ja FALSE
#> 3 ko mecab-ko-dic ~/.local/share/R/RcppMeCab/ko FALSE
#> 4 zh mecab-jieba ~/.local/share/R/RcppMeCab/zh FALSEThis package has pos and posParallel functions.
pos(sentence) # returns a list
pos(sentence, join = FALSE) # morphemes only (tags as vector names)
pos(sentence, format = "data.frame") # returns a data frame
pos(sentence, user_dic = "path") # with a compiled user dictionary
posParallel(sentence) # parallelized, faster for large inputsUse the lang parameter to select a dictionary by language:
pos("東京は日本の首都です。", lang = "ja")
pos("안녕하세요", lang = "ko")
pos("我是中国人。", lang = "zh")Or set a default with set_dic():
set_dic("ja")
pos("東京は日本の首都です。") # uses Japanese dictionary
set_dic("ko")
pos("안녕하세요") # uses Korean dictionary
set_dic("bundled") # switch back to the build-time dictionaryYou can also specify a custom dictionary path directly:
pos("text", sys_dic = "/path/to/custom-dic")
options(mecabSysDic = "/path/to/custom-dic")sentence: text to analyzejoin: ifTRUE(default), output ismorpheme/tag; ifFALSE, output ismorphemewith tag as attributeformat:"list"(default) or"data.frame"lang: language code ("ja","ko", or"zh") to select a dictionary installed viadownload_dic(). Overridessys_dicwhen specified.sys_dic: directory containingdicrc,sys.dic, etc. Set a default withoptions(mecabSysDic = "/path/to/dic")user_dic: path to a user dictionary compiled bydict_index()
Note: provide full paths for sys_dic and user_dic (no tilde ~/ expansion).
RcppMeCab provides the dict_index() function to compile user dictionaries directly from R, without needing the mecab-dict-index command-line tool.
Prepare your entries as a CSV file (Japanese format, Korean format), then compile:
dict_index(
dic_csv = "entries.csv",
out_dic = "userdic.dic",
dic_dir = "/path/to/mecab-dic"
)
# Then use the compiled dictionary:
pos("some text", user_dic = "userdic.dic")Junhewk Kim (junhewk.kim@gmail.com), Taku Kudo
Akiru Kato, Patrick Schratz