一种适应短文本的相关测度及其应用:针对博客社区和BBS 论坛充斥Web 垃圾信息的问题,提出相关度向量空间模型cVSM,并以此作为评论的特征,采用支持向量机分类算法自动识别垃圾评论。cVSM 包括一种适合短文本的相关测度,用于衡量评论和文章的语义相关程度。在中文博客测试集和中文BBS 测试集上的实验结果表明,相比纯粹使用评论文本特征的方法,应用该模型时F1 至少提高6%。关键词:博客;垃圾评论;支持向量机;文本挖掘;相关测度Relevancy Coefficient and Its Application Adapted to Short TextsHE Hai-jiang(Computer Center, Changsha University, Changsha 410003)【Abstract】A relevancy coefficient vectort space model named cVSM is proposed to aim at Web spams which flood in blogosphere and forums.The cVSM whose components are employed as features of comments and the support vector machine classification algorithms are used toautomatically identify comment spams. The relevancy coefficient included in the cVSM is presented, which is used to evaluate relevancy grade ofposts and comments. Chinese blog dataset and Chinese BBS dataset are tested. Experimental results show that compared with traditional method theF1 has been improved at least 6% by this way.【Key words】blog; comment spam; support vector machine; text mining; relevancy coefficient