Image Caption Dataset

Image Caption Dataset

2020-01-14Updated 2023-10-31Machine Learning / Dataset5 minutes read (About 810 words)

Goals：

1.数据量要求
2.标注的标准
3.标注的手段

Microsoft COCO Captions:

使用Amazon的Mechanical Turk(AMT)收集数据，再对数据进行标注。
“Each of our captions are also generated using human subjects on AMT.”

一些其他信息：(Caption Evaluation Server):

好像是可以评价caption的生成质量，但是应该是仅仅针对于使用COCO数据进行的，所以这一部分就不分析了。
文中（section 3）包含了几种不同评价方法的介绍：

BLEU
ROUGE
METEOR
CIDEr

在进行Evaluation之前的 Tokenization and preprocessing中：
使用了工具来添加caption标记：

Stanford PTBTokenizer in Stanford CoreNLP tools (version 3.4.1)

这个工具是模仿的是peen treebank3. 其参考文献和相关链接如下：

“The Stanford CoreNLP natural language processing toolkit,” in Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2014, pp. 55–60. related-link

数据规模：

Dataset, Image Caption