Ourren

关注技术,记录生活.

NLTK笔记:词干提取与词性标注

| 留言

前面虽然对「nltk分句和分词」进行了分析和实现,但是在进一步进行处理之前,我们需要对这些词语进行规范化处理,例如我们需要统一所有单词的大小写,词语时态问题「相同的单词在不同的时态句型中属于不同的单词,我们需要转换为同一单词」等。nltk则针对问题提供了相关的类库,主要有两类操作:词干提取,词性标注。

词干提取

词干提取主要是将不同时态的词语转变为单词一般时态,例如seeing->see,extremely->extem,目前提供的类库在一定程度上可以实现这类转换「有些词语暂时不行,主要有些模块是基于字典的」。

nltk提供三种方法可以实现如上功能:Porter、Lancaster、WordNet。使用都相当简单,具体使用方法如下:

#! /usr/bin/env python
/# encoding: utf-8

"""
Copyright (c) 2014-2015 ourren
author: ourren <i@ourren.com>
"""
import nltk

def main():
    print 'enter main...'
    sentence = "Listen, strange women lying in ponds \
    distributing swords. Supreme executive power derives \
    from a mandate from the masses, not from some farcical aquatic ceremony"
    sentences = nltk.sent_tokenize(sentence)
    sentences = [nltk.word_tokenize(sent) for sent in sentences]
    for sent in sentences:
        # Porter
        porter = nltk.PorterStemmer()
        words = [porter.stem(word) for word in sent]
        print words

        # Lancaster
        lancaster = nltk.LancasterStemmer()
        lwords = [lancaster.stem(t) for t in sent]
        print lwords

        # WordNet
        wnl = nltk.WordNetLemmatizer()
        wwrods = [wnl.lemmatize(t) for t in sent]
        print wwrods

if __name__ == "__main__":
    main()

词性标注

词性指作为划分词类的根据的词的特点。现代汉语的词可以分为两类12种词性。一类是实词:名词、动词、形容词、数词、量词和代词。一类是虚词:副词、介词、连词、助词、拟声词和叹词。因此nltk对应的pos_tag模块也是实现这类功能,将一个句子中的词性进行标注;

nltk中将词汇按它们的词性(parts-of-speec h,POS)分类以及相应的标注它们的过程被称为词性标注(part-of-speech tagging, POS tagging)或干脆简称标注。其中标注结果中缩写词所代表的词性如下:

ADJ     adjective   new, good, high, special, big, local
ADV     adverb  really, already, still, early, now
CNJ     conjunction and, or, but, if, while, although
DET     determiner  the, a, some, most, every, no
EX      existential there, there’s
FW      foreign word    dolce, ersatz, esprit, quo, maitre
MOD     modal verb  will, can, would, may, must, should
N       noun    year, home, costs, time, education
NP      proper noun Alison, Africa, April, Washington
NUM     number  twenty-four, fourth, 1991, 14:24
PRO     pronoun he, their, her, its, my, I, us
P       preposition on, of, at, with, by, into, under
TO      the word to to
UH      interjection    ah, bang, ha, whee, hmpf, oops
V       verb    is, has, get, do, make, see, run
VD      past tense  said, took, told, made, asked
VG      present participle  making, going, playing, working
VN      past participle given, taken, begun, sung
WH      wh determiner   who, which, when, what, where, how

下面看具体例如:

def sentence_pos(sentences):
    for sent in sentences:
        words = nltk.pos_tag(sent)
        print words

完整代码可以见learn_nltk