(资料图)
自然语言任务经常使用jieba分词,数据量大时怎么加速,jieba分词不支持使用asyncio异步加速,使用multiprocessing还是可以的
import jiebaimport jieba.analyseimport multiprocessing# 加载自定义词典jieba.load_userdict("user_dic.txt")jieba.load_userdict("cate_group.txt")jieba.analyse.set_stop_words("stopwords_v1.txt")def process_text(text): # 分词 words = jieba.cut(text, cut_all=True) # 过滤长度小于2或大于10的词和纯数字的词 filtered_words = [w for w in words if len(w) >= 2 and len(w) <= 10 and not w.isdigit()] # 返回分词结果 return filtered_words# 创建进程池pool = multiprocessing.Pool()# 处理文本列表# texts = ["这是一段测试文本", "这是另一段测试文本"]texts = data["new_text"]results = pool.map(process_text, texts)# 输出结果results
结果:
[["估值", "有待", "修复", "煤炭", "平均", "市盈率", "美元"], ["国产", "医疗", "医疗器械", "器械", "行业", "发展", "迅速", "作为", "国内", "最大", "医疗", "医疗器械", "器械", "企业", "基本", "一枝", "一枝独秀", "独秀"], ["今日", "上海", "现货"], ["消息", "准备"],
标签:
免责声明:市场有风险,选择需谨慎!此文仅供参考,不作买卖依据。