Hanlp的基本使用

格雅百科 2023-10-09 13:33

1。 Hanlp基本介绍

HanLP是由一系列模型和算法组成的NLP工具包，目标是普及自然语言处理在生产环境中的应用。 HanLP具有功能齐全、性能高效、结构清晰、语料最新、可定制等特点。

GitHub地址：https://www.gyballet.com/hankcs/HanLP
官网地址：https://www.gyballet.com/

2。下载并配置

在pom.xml文件中引入依赖

com.hankcshanlp便携式-1.8.2

引入以上依赖关系后，就可以使用基本功能（除构词和依存句法分析之外的所有功能）。
自定义函数需要安装数据包并配置hanlp.properties文件
数据包文件：www.gyballet.com
HanLP中的数据分为字典和模型。词法分析需要字典，语法分析需要模型。用户可以自行添加、删除、替换。如果不需要句法分析等功能，可以随时删除模型文件夹。

数据
│
├─词典
└─型号

3。文件配置

字典数据和hanlp.properties配置文件存放在如图所示的项目目录中（存储位置可以任意，只需在配置文件中指定对应的字典数据文件位置即可）

hanlp .properties 配置文件修改，主要关注以下配置：

在windows下使用，只需要修改root指定数据包文件位置即可。如果你想实现自定义单词，只需将自定义文件添加到CustomDictionaryPath即可。
在Linux下使用时，除了root和CustomDictionaryPath需要做相应修改外，默认的IO适配器也需要重写。
当自定义词典数据量较小时，可以通过代码写入词典，而不是落地到词典文件中，如CustomDictionary.insert(自定义词，“自定义词词性”)；

#指定Hanlp包文件位置
# root=D:/JavaProjects/HanLP/
# root=/home/aword/
根=src/main/资源/
#自定义字典路径，使用；分隔多个自定义词典。开头的空格表示它们位于同一目录中。使用“文件名词性”格式表示本词典的词性默认为该词性。降低优先级。
#所有词典均使用UTF-8编码。每行代表一个单词。格式如下[词][词性A][A的频率][词性B][B的频率]...如果词性不填写，则表示使用词典默认的词性。
CustomDictionaryPath=data/dictionary/custom/CustomDictionary.txt;现代汉语补充词库.txt；国家地名百科全书.txt ns;个人姓名词典.txt；组织名称词典.txt；上海地名.txt ns;data/dictionary/person/ nrf.txt nrf;
#默认IO适配器如下，基于普通文件系统。
#IOAdapter=com.hankcs.hanlp.corpus.io.FileIOAdapter
# 重写适配器并指定文件
IOAdapter=com.aword.config.ResourceFileIoAdapter

www.gyballet.com

public class ResourceFileIoAdapter 实现 IIOAdapter {@Overridepublic InputStream open(String s) throws IOException {//return new FileInputStream(new ClassPathResource(path).getFile());return this.getClass().getClassLoader().getResourceAsStream( s);}@Overridepublic OutputStream create(String s) throws IOException {return new FileOutputStream(new ClassPathResource(path).getFile());}// @Override
// public InputStream open(String path) 抛出 IOException {
// String tempDir = Files.createTempDirectory("hanlp").toAbsolutePath().toString();
// String cachePath = new File(tempDir + "/" + path).getPath().intern();
// 如果 (IOUtil.isFileExisted(cachePath)) {
// 返回新的 FileInputStream(cachePath);
// }
// 输入流 inputStream = IOUtil.getResourceAsStream("/" + 路径);
// 返回输入流；
// }
//
// @覆盖
// public OutputStream create(String path) throws IOException {// String tempDir = Files.createTempDirectory("hanlp").toAbsolutePath().toString();
// String cachePath = new File(tempDir + "/" + path).getPath().intern();
// 如果 (IOUtil.isResource(路径)) {
// mkdir(cachePath);
// 返回新的 FileOutputStream(cachePath);
// }
// FileOutputStream fileOutputStream = new FileOutputStream(path);
// 返回文件输出流；
// }
//
//
// 私有 void mkdir(String cachePath) {
// if (new File(cachePath).exists()) {
// 返回;
// }
// String dir = cachePath.endsWith(File.separator) ? cachePath : StringUtils.substringBeforeLast(cachePath, File.separator);
// 新文件(dir).mkdirs();
// } }

4。基本使用

通过工具类HanLP可以快速调用HanLP的几乎所有功能。当你不记得调用某个方法时，只需输入 HanLP.，IDE 就会给出提示并显示 HanLP 的完整文档。所有演示都位于 com.hankcs.demo 下。
Hanlp 词性表：HanLP 词性标签集
第一个 Demo

{public static void main(String[] args){String text = "江西鄱阳湖干涸了，中国最大的淡水湖变成了草原";System.out.println(SpeedTokenizer.segment(text));长启动=系统。 currentTimeMillis();int 压力 = 1000000;for (int i = 0; i < 压力; ++i){SpeedTokenizer.segment(text);}double costTime = (System.currentTimeMillis() - start) / (double)1000 ;System.out.printf("分词速度：%.2f字/秒",text.length() * 压力/costTime);} }

自定义词典
详细算法解释：《Trie树分词》

/*** 演示用户词典的动态添加和删除** @author hankcs*/
公共类 DemoCustomDictionary
{public static void main(String[] args){// 动态添加 CustomDictionary.add("siege lion");// 强制插入 CustomDictionary.insert("白富美", "nz 1024");// 删除单词 (尝试注释掉）// CustomDictionary.remove("攻城狮");System.out.println(CustomDictionary.add("单狗", "nz 1024 n 1"));System.out.println(CustomDictionary.get("单狗 " ));String text = "攻城狮逆袭单身，迎娶白富美，走上人生巅峰"; // 这怎么可能？哈哈！ // AhoCorasickDoubleArrayTrie 自动机扫描文本中出现的自定义单词 final char[] charArray = text.toCharArray();CustomDictionary.parseText(charArray, new AhoCorasickDoubleArrayTrie.IHit(){@Overridepublic void hit(int begin, int end, CoreDictionary.Attribute value){System.out.printf("[%d:%d]=%s %s\n", begin, end, new String(charArray, begin, end - begin), value); }});//自定义字典在所有分词器中均有效 System.out.println(HanLP.segment(text));}
}

上面只是简单列出了几个基本分词。更多分词和详情请参考Hanlp官方文档

文章仅供学习交流，如有侵权一律删除。