Open research datasets released alongside published work. Note: Most recent datasets are hosted on external platforms and may not be listed directly on this page. For additional information, contact axsun AT ntu DOT edu DOT sg.
DocOIE is a document-level context-aware dataset for Open Information Extraction, comprising both evaluation and training subsets. The evaluation dataset contains 800 expert-annotated sentences sampled from 80 documents across two domains — healthcare and transportation. The training dataset contains 2,400 documents (1,200 per domain); all sentences are used to bootstrap pseudo labels for neural model training. Note: only document IDs are included in the training set for collection at PatFT.
HSpam14 is a collection of 14 million tweets for hashtag-oriented spam research. The dataset is a
tab-delimited text file with three columns: tweet_id, label, and
step. The label field takes one of three values: 0 (ham),
1 (spam), or
-1 (ambiguous — could not be
labelled even after manual inspection). The step field (1–6) records the annotation method used.
A collection of keyphraseness values for phrases extracted from Wikipedia. The keyphraseness value Q(s) of a phrase s is the probability that it appears as anchor text in a Wikipedia article. Extracted from the English Wikipedia dump of January 30, 2010. Phrases containing non-English characters are excluded, leaving 4,157,753 phrases — of which approximately 1.9 million have non-zero keyphraseness values. Released solely for research purposes; please cite at least one of the papers below if you use it.
Normalised Image Tag Clarity (NITC) scores for the 5,981 most popular tags from the NUS-WIDE dataset, available in Excel format. The clarity values are used to measure the visual representativeness of social image tags. Note: values may differ slightly from those reported in WSM'09 due to the number of dummy tags used for estimation (500 dummy tags were used for MM'10). Tag labels used in MM'10 experiments are also available. Contact axsun AT ntu DOT edu DOT sg for questions about the paper or experimental results.
The Blog Summarization dataset accompanies the SIGIR 2008 paper on comments-oriented document summarisation. It enables research into understanding documents through readers' feedback — leveraging blog comments as signals for extracting representative summaries. Please refer to the paper for a detailed description of the dataset structure.
UnitSet is the dataset used for the Web Unit Mining project, created based on the WebKB dataset from CMU. It supports research on finding and classifying subgraphs of web pages, and mining homepage relationships at the web-unit level. Please cite one of the papers below when using this dataset.