博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
MIT自然语言处理第四讲:标注(第一部分)
阅读量:6671 次
发布时间:2019-06-25

本文共 2973 字,大约阅读时间需要 9 分钟。

MIT自然语言处理第四讲:标注(第一部分)

自然语言处理:标注

Natural Language Processing: Tagging
作者:Regina Barzilay(MIT,EECS Department, November 15, 2004)
译者:( ,2009年2月24日)

上一讲主要内容回顾(Last time)

 语言模型(Language modeling):
  n-gram模型(n-gram models)
  语言模型评测(LM evaluation)
 平滑(Smoothing):
  打折(Discounting)
  回退(Backoff)
  插值(Interpolation)
本讲主要内容(Today):
 标注(Tagging)

一、 基本介绍

a) 标注问题(Tagging)
 i. 任务(Task): 在句子中为每个词标上合适的词性(Label each word in a sentence with its appropriate part of speech)
 ii. 输入(Input): Our enemies are innovative and resourceful , and so are we. They never stop thinking about new ways to harm our country and our people, and neither do we.
 iii. 输出(Output): Our/PRPenemies/NNSare/VBPinnovative/JJand/CCresourceful/JJ,/,and/CCso/RBare/VBwe/PRP?/?.They/PRPnever/RBstop/VBthinking/VBGabout/INnew/JJways/NNSto/TOharm/VBour/PROP country/NN and/CC our/PRP$ people/NN, and/CC neither/DT do/VB we/PRP.
b) Motivation
 i. 词性标注对于许多应用领域是非常重要的(Part-of-speech(POS) tagging is important for many applications)
  1. 语法分析(Parsing)
  2. 语言模型(Language modeling)
  3. 问答系统和信息抽取(Q&A and Information extraction)
  4. 文本语音转换(Text-to-speech)
 ii. 标注技术可用于各种任务(Tagging techniques can be used for a variety of tasks)
  1. 语义标注(Semantic tagging)
  2. 对话标注(Dialogue tagging)
c) 如何确定标记集(How to determine the tag set)?
 i. “The definition [of the parts of speech] are very far from having attained the degree of exactitude found in Euclidean geometry” Jespersen, The Philosophy of Grammar
 ii. 粗糙的词典类别划分基本达成一致至少对某些语言来说(Agreement on coarse lexical categories (at least, for some languages))
  1. 封闭类(Closed class): 介词,限定词,代词,小品词,助动词(prepositions, determiners, pronouns, particles, auxiliary verbs)
  2. 开放类(Open class): 名词,动词,形容词和副词(nouns, verbs, adjectives and adverbs)
 iii. 各种粒度的多种标记集(Multiple tag sets of various granularity)
  1. Penn tag set (45 tags), Brown tag set (87 tags), CLAWS2 tag set (132 tags)
  2. 示例:Penn Tree Tags
  标记(Tag) 说明(Description) 举例(Example)
  CC      conjunction     and, but
  DT      determiner      a, the
  JJ       adjective      red
  NN      noun, sing.      rose
  RB       adverb       quickly
  VBD     verb, past tense    grew
d) 标注难吗(Is Tagging Hard)?
 i. 举例:“Time flies like an arrow”
 ii. 许多单词可能会出现在几种不同的类别中(Many words may appear in several categories)
 iii. 然而,大多数单词似乎主要在一个类别中出现(However, most words appear predominantly in one category)
  1. “Dumb”标注器在给单词标注最常用的标记时获得了90%的准确率(“Dumb” tagger which assigns the most common tag to each word achieves 90% accuracy (Charniak et al., 1993))
  2. 对于90%的准确率我们满足吗(Are we happy with 90%)?
 iv. 标注的信息资源(Information Sources in Tagging):
  1. 词汇(Lexical): 观察单词本身(look at word itself)
  单词(Word) 名词(Noun) 动词(Verb) 介词(Preposition)
  flies      21      23      0
  like      10      30      21
  2. 组合(Syntagmatic): 观察邻近单词(look at nearby words)
  ——哪个组合更像(What is more likely): “DT JJ NN” or “DT JJ VBP“?

 

附:课程及课件pdf下载地址:

   http://people.csail.mit.edu/regina/6881/

转载于:https://www.cnblogs.com/renly/archive/2013/01/07/2849897.html

你可能感兴趣的文章
第二天的收获-----c中小问题
查看>>
【错误异常】 Maven出现错误No plugin found for prefix 'jetty' in the current
查看>>
扩展欧几里德算法
查看>>
openoffice启动8100端口
查看>>
cnetos 6.0下Chage的使用方法来提升系统安全级别
查看>>
tomcat启动没有8080端口
查看>>
ubuntu 16.04 安装lamp
查看>>
Javascript的匿名函数
查看>>
OC中类的属性与成员变量的区别
查看>>
SMTP命令邮件投递(无身份认证)
查看>>
Nginx + MySQL + PHP + Xcache + Memcached
查看>>
使用Windows live movie maker轻松与朋友分享视频
查看>>
我的友情链接
查看>>
数据库死锁的类型
查看>>
找水王
查看>>
grep及正则表达式
查看>>
MongoDB常用命令大全
查看>>
Python程序的执行过程
查看>>
Proxmox-VE搭配Ceph存储组建高可用虚拟化平台
查看>>
前端基础---JavaScript
查看>>