在动词/名词/形容词形式之间转换单词
- 2025-03-21 09:05:00
- admin 原创
- 35
问题描述:
我想要一个可以跨不同词类进行翻译/转换的 Python 库函数。有时它应该输出多个单词(例如“coder”和“code”都是来自动词“to code”的名词,一个是主语,另一个是宾语)
# :: String => List of String
print verbify('writer') # => ['write']
print nounize('written') # => ['writer']
print adjectivate('write') # => ['written']
对于我想编写的笔记程序,我主要关心动词<=>名词。例如,我可以写“咖啡因拮抗 A1”或“咖啡因是 A1 拮抗剂”,并且通过一些 NLP,它可以弄清楚它们意味着同一件事。(我知道这并不容易,并且它需要能够解析而不仅仅是标记的 NLP,但我想破解一个原型)。
类似的问题...
将形容词和副词转换为名词形式
(这个答案仅限于词根 POS。我想在 POS 之间转换。)
ps 在语言学中称为转换http://en.wikipedia.org/wiki/Conversion_%28linguistics%29
解决方案 1:
这更像是一种启发式方法。我刚刚对其进行了编码,因此请原谅这种风格。它使用了 wordnet 中的 derivationally_related_forms()。我已经实现了 nounify。我猜 verbify 的工作原理类似。从我测试的情况来看,效果很好:
from nltk.corpus import wordnet as wn
def nounify(verb_word):
""" Transform a verb to the closest noun: die -> death """
verb_synsets = wn.synsets(verb_word, pos="v")
# Word not found
if not verb_synsets:
return []
# Get all verb lemmas of the word
verb_lemmas = [l for s in verb_synsets \n for l in s.lemmas if s.name.split('.')[1] == 'v']
# Get related forms
derivationally_related_forms = [(l, l.derivationally_related_forms()) \n for l in verb_lemmas]
# filter only the nouns
related_noun_lemmas = [l for drf in derivationally_related_forms \n for l in drf[1] if l.synset.name.split('.')[1] == 'n']
# Extract the words from the lemmas
words = [l.name for l in related_noun_lemmas]
len_words = len(words)
# Build the result in the form of a list containing tuples (word, probability)
result = [(w, float(words.count(w))/len_words) for w in set(words)]
result.sort(key=lambda w: -w[1])
# return all the possibilities sorted by probability
return result
解决方案 2:
这是一个理论上能够在名词/动词/形容词/副词形式之间转换单词的函数,我从这里更新了它(我相信最初是由bogs编写的)以符合 nltk 3.2.5,现在synset.lemmas
和sysnset.name
是函数。
from nltk.corpus import wordnet as wn
# Just to make it a bit more readable
WN_NOUN = 'n'
WN_VERB = 'v'
WN_ADJECTIVE = 'a'
WN_ADJECTIVE_SATELLITE = 's'
WN_ADVERB = 'r'
def convert(word, from_pos, to_pos):
""" Transform words given from/to POS tags """
synsets = wn.synsets(word, pos=from_pos)
# Word not found
if not synsets:
return []
# Get all lemmas of the word (consider 'a'and 's' equivalent)
lemmas = []
for s in synsets:
for l in s.lemmas():
if s.name().split('.')[1] == from_pos or from_pos in (WN_ADJECTIVE, WN_ADJECTIVE_SATELLITE) and s.name().split('.')[1] in (WN_ADJECTIVE, WN_ADJECTIVE_SATELLITE):
lemmas += [l]
# Get related forms
derivationally_related_forms = [(l, l.derivationally_related_forms()) for l in lemmas]
# filter only the desired pos (consider 'a' and 's' equivalent)
related_noun_lemmas = []
for drf in derivationally_related_forms:
for l in drf[1]:
if l.synset().name().split('.')[1] == to_pos or to_pos in (WN_ADJECTIVE, WN_ADJECTIVE_SATELLITE) and l.synset().name().split('.')[1] in (WN_ADJECTIVE, WN_ADJECTIVE_SATELLITE):
related_noun_lemmas += [l]
# Extract the words from the lemmas
words = [l.name() for l in related_noun_lemmas]
len_words = len(words)
# Build the result in the form of a list containing tuples (word, probability)
result = [(w, float(words.count(w)) / len_words) for w in set(words)]
result.sort(key=lambda w:-w[1])
# return all the possibilities sorted by probability
return result
convert('direct', 'a', 'r')
convert('direct', 'a', 'n')
convert('quick', 'a', 'r')
convert('quickly', 'r', 'a')
convert('hunger', 'n', 'v')
convert('run', 'v', 'a')
convert('tired', 'a', 'r')
convert('tired', 'a', 'v')
convert('tired', 'a', 'n')
convert('tired', 'a', 's')
convert('wonder', 'v', 'n')
convert('wonder', 'n', 'a')
正如你在下面看到的,它的效果不是很好。它无法在形容词和副词形式之间切换(这是我的具体目标),但它在其他情况下确实给出了一些有趣的结果。
>>> convert('direct', 'a', 'r')
[]
>>> convert('direct', 'a', 'n')
[('directness', 0.6666666666666666), ('line', 0.3333333333333333)]
>>> convert('quick', 'a', 'r')
[]
>>> convert('quickly', 'r', 'a')
[]
>>> convert('hunger', 'n', 'v')
[('hunger', 0.75), ('thirst', 0.25)]
>>> convert('run', 'v', 'a')
[('persistent', 0.16666666666666666), ('executive', 0.16666666666666666), ('operative', 0.16666666666666666), ('prevalent', 0.16666666666666666), ('meltable', 0.16666666666666666), ('operant', 0.16666666666666666)]
>>> convert('tired', 'a', 'r')
[]
>>> convert('tired', 'a', 'v')
[]
>>> convert('tired', 'a', 'n')
[('triteness', 0.25), ('banality', 0.25), ('tiredness', 0.25), ('commonplace', 0.25)]
>>> convert('tired', 'a', 's')
[]
>>> convert('wonder', 'v', 'n')
[('wonder', 0.3333333333333333), ('wonderer', 0.2222222222222222), ('marveller', 0.1111111111111111), ('marvel', 0.1111111111111111), ('wonderment', 0.1111111111111111), ('question', 0.1111111111111111)]
>>> convert('wonder', 'n', 'a')
[('curious', 0.4), ('wondrous', 0.2), ('marvelous', 0.2), ('marvellous', 0.2)]
希望这能帮你省点麻烦
解决方案 3:
我知道这并不能回答你的全部问题,但确实回答了很大一部分。我会查看
http://nodebox.net/code/index.php/Linguistics#verb_conjugation
这个 python 库能够对动词进行变位,并识别一个词是动词、名词还是形容词。
示例代码
print en.verb.present("gave")
print en.verb.present("gave", person=3, negate=False)
>>> give
>>> gives
它还可以对单词进行分类。
print en.is_noun("banana")
>>> True
下载位于链接顶部。
解决方案 4:
一种方法可能是使用包含单词及其 POS 标记和词形映射的词典。如果您获得或创建这样的词典(如果您可以访问任何传统词典的数据,这是完全有可能的,因为所有词典都列出了单词的 POS 标记以及所有派生形式的基形式),您可以使用类似以下内容:
def is_verb(word):
if word:
tags = pos_tags(word)
return 'VB' in tags or 'VBP' in tags or 'VBZ' in tags \n or 'VBD' in tags or 'VBN' in tags:
def verbify(word):
if is_verb(word):
return word
else:
forms = []
for tag in pos_tags(word):
base = word_form(word, tag[:2])
if is_verb(base):
forms.append(base)
return forms
解决方案 5:
由于现在语言模型已经相当完善,一个好的想法可能是找到与向量空间距离最小的动词/名词/形容词。
扫码咨询,免费领取项目管理大礼包!