(资料图)

自然语言任务经常使用jieba分词,数据量大时怎么加速,jieba分词不支持使用asyncio异步加速,使用multiprocessing还是可以的

import jiebaimport jieba.analyseimport multiprocessing# 加载自定义词典jieba.load_userdict("user_dic.txt")jieba.load_userdict("cate_group.txt")jieba.analyse.set_stop_words("stopwords_v1.txt")def process_text(text):    # 分词    words = jieba.cut(text, cut_all=True)        # 过滤长度小于2或大于10的词和纯数字的词    filtered_words = [w for w in words if len(w) >= 2 and len(w) <= 10 and not w.isdigit()]        # 返回分词结果    return filtered_words# 创建进程池pool = multiprocessing.Pool()# 处理文本列表# texts = ["这是一段测试文本", "这是另一段测试文本"]texts = data["new_text"]results = pool.map(process_text, texts)# 输出结果results

结果:

[["估值", "有待", "修复", "煤炭", "平均", "市盈率", "美元"], ["国产",  "医疗",  "医疗器械",  "器械",  "行业",  "发展",  "迅速",  "作为",  "国内",  "最大",  "医疗",  "医疗器械",  "器械",  "企业",  "基本",  "一枝",  "一枝独秀",  "独秀"], ["今日", "上海", "现货"], ["消息", "准备"],

标签:

免责声明:市场有风险,选择需谨慎!此文仅供参考,不作买卖依据。