NLP IR
DocOIE Dataset
Document-level Open Information Extraction ACL 2021 Findings GitHub

DocOIE is a document-level context-aware dataset for Open Information Extraction, comprising both evaluation and training subsets. The evaluation dataset contains 800 expert-annotated sentences sampled from 80 documents across two domains — healthcare and transportation. The training dataset contains 2,400 documents (1,200 per domain); all sentences are used to bootstrap pseudo labels for neural model training. Note: only document IDs are included in the training set for collection at PatFT.

800Annotated sentences
2,400Training documents
2Domains
Cite this dataset
Kuicai Dong, Yilin Zhao, Aixin Sun, Jung-Jae Kim, Xiaoli Li. DocOIE: A Document-level Context-Aware Dataset for OpenIE. ACL 2021 Findings
IR DM
HSpam14 Dataset
14 Million Tweets · Spam Detection SIGIR 2015 ~308 MB uncompressed

HSpam14 is a collection of 14 million tweets for hashtag-oriented spam research. The dataset is a tab-delimited text file with three columns: tweet_id, label, and step. The label field takes one of three values: 0 (ham), 1 (spam), or -1 (ambiguous — could not be labelled even after manual inspection). The step field (1–6) records the annotation method used.

14MTweets
74.7 MBCompressed size
308 MBUncompressed
6Annotation steps
Annotation Steps
Step 1
Manual annotation
Step 2
kNN-based annotation
Step 3
User-based annotation
Step 4
Domain-based annotation
Step 5
Reliable ham tweet detection
Step 6
EM-based annotation
Cite this dataset
Surendra Sedhai and Aixin Sun. HSpam14: A Collection of 14 Million Tweets for Hashtag-Oriented Spam Research. SIGIR 2015
IR NLP
Wikipedia Keyphraseness
4.3M phrases from English Wikipedia SIGIR 2012 · SIGIR 2013 · CIKM 2012 ~45 MB compressed

A collection of keyphraseness values for phrases extracted from Wikipedia. The keyphraseness value Q(s) of a phrase s is the probability that it appears as anchor text in a Wikipedia article. Extracted from the English Wikipedia dump of January 30, 2010. Phrases containing non-English characters are excluded, leaving 4,157,753 phrases — of which approximately 1.9 million have non-zero keyphraseness values. Released solely for research purposes; please cite at least one of the papers below if you use it.

4.34MTotal phrases
4.16MEnglish phrases
~1.9MNon-zero Q(s)
45 MBCompressed
Cite this dataset — at least one paper required
Chenliang Li, Aixin Sun, Jianshu Weng, Qi He. Exploiting Hybrid Contexts for Tweet Segmentation. SIGIR 2013
Chenliang Li, Aixin Sun, Anwitaman Datta. Twevent: Segment-based Event Detection from Tweets. CIKM 2012
Chenliang Li, Jianshu Weng, Qi He, Yuxia Yao, Anwitaman Datta, Aixin Sun, Bu-Sung Lee. TwiNER: Named Entity Recognition in Targeted Twitter Stream. SIGIR 2012
CV IR
Tag Visual-Representativeness
Image Tag Clarity · NUS-WIDE ACM MM 2010 · WSM 2009 Excel format

Normalised Image Tag Clarity (NITC) scores for the 5,981 most popular tags from the NUS-WIDE dataset, available in Excel format. The clarity values are used to measure the visual representativeness of social image tags. Note: values may differ slightly from those reported in WSM'09 due to the number of dummy tags used for estimation (500 dummy tags were used for MM'10). Tag labels used in MM'10 experiments are also available. Contact axsun AT ntu DOT edu DOT sg for questions about the paper or experimental results.

5,981Tags scored
500Dummy tags (MM'10)
Cite this dataset
Aixin Sun, Sourav S. Bhowmick. Quantifying Tag Representativeness of Visual Content of Social Images. ACM MM 2010 — Pages 471–480. Firenze, Italy.
Aixin Sun, Sourav S. Bhowmick. Image Tag Clarity: In Search of Visual-Representative Tags for Social Images. WSM 2009 (co-located with ACM MM) — Pages 19–26. Beijing, China.
NLP IR
Comments-Oriented Document Summarization
Blog Summarization · Readers' Feedback SIGIR 2008 · CIKM 2007

The Blog Summarization dataset accompanies the SIGIR 2008 paper on comments-oriented document summarisation. It enables research into understanding documents through readers' feedback — leveraging blog comments as signals for extracting representative summaries. Please refer to the paper for a detailed description of the dataset structure.

Cite this dataset
Meishan Hu, Aixin Sun, Ee-Peng Lim. Comments-Oriented Document Summarization: Understanding Documents with Readers' Feedback. SIGIR 2008 — Pages 291–298. Singapore.
Meishan Hu, Aixin Sun, Ee-Peng Lim. Comments-Oriented Blog Summarization by Sentence Extraction. CIKM 2007 — Pages 901–904. Lisboa, Portugal.
IR DM
Web Unit Mining (UnitSet)
Based on WebKB dataset JASIST 2006 · CIKM 2003

UnitSet is the dataset used for the Web Unit Mining project, created based on the WebKB dataset from CMU. It supports research on finding and classifying subgraphs of web pages, and mining homepage relationships at the web-unit level. Please cite one of the papers below when using this dataset.

Cite this dataset
Aixin Sun, Ee-Peng Lim. Web Unit Based Mining of Homepage Relationships. JASIST 57(3):394–407, February 2006
Aixin Sun, Ee-Peng Lim. Web Unit Mining: Finding and Classifying Subgraphs of Web Pages. CIKM 2003 — Pages 108–115. New Orleans, LA.
Need more information about any of these datasets? Drop an email to axsun AT ntu DOT edu DOT sg and I'll be happy to help.