Datasets — Sun Aixin

NLP IR

DocOIE Dataset

◆ Document-level Open Information Extraction ◆ ACL 2021 Findings ◆ GitHub

DocOIE is a document-level context-aware dataset for Open Information Extraction, comprising both evaluation and training subsets. The evaluation dataset contains 800 expert-annotated sentences sampled from 80 documents across two domains — healthcare and transportation. The training dataset contains 2,400 documents (1,200 per domain); all sentences are used to bootstrap pseudo labels for neural model training. Note: only document IDs are included in the training set for collection at PatFT.

800Annotated sentences

2,400Training documents

2Domains

Cite this dataset

Kuicai Dong, Yilin Zhao, Aixin Sun, Jung-Jae Kim, Xiaoli Li. DocOIE: A Document-level Context-Aware Dataset for OpenIE. ACL 2021 Findings

PDF

IR DM

HSpam14 Dataset

◆ 14 Million Tweets · Spam Detection ◆ SIGIR 2015 ◆ ~308 MB uncompressed

⬇ Dropbox ⬇ OneDrive

HSpam14 is a collection of 14 million tweets for hashtag-oriented spam research. The dataset is a tab-delimited text file with three columns: tweet_id, label, and step. The label field takes one of three values: 0 (ham), 1 (spam), or -1 (ambiguous — could not be labelled even after manual inspection). The step field (1–6) records the annotation method used.

14MTweets

74.7 MBCompressed size

308 MBUncompressed

6Annotation steps

Annotation Steps

Step 1

Manual annotation

Step 2

kNN-based annotation

Step 3

User-based annotation

Step 4

Domain-based annotation

Step 5

Reliable ham tweet detection

Step 6

EM-based annotation

Cite this dataset

Surendra Sedhai and Aixin Sun. HSpam14: A Collection of 14 Million Tweets for Hashtag-Oriented Spam Research. SIGIR 2015

PDF

IR NLP

Wikipedia Keyphraseness

◆ 4.3M phrases from English Wikipedia ◆ SIGIR 2012 · SIGIR 2013 · CIKM 2012 ◆ ~45 MB compressed

⬇ Dropbox ⬇ OneDrive 📄 Readme

A collection of keyphraseness values for phrases extracted from Wikipedia. The keyphraseness value Q(s) of a phrase s is the probability that it appears as anchor text in a Wikipedia article. Extracted from the English Wikipedia dump of January 30, 2010. Phrases containing non-English characters are excluded, leaving 4,157,753 phrases — of which approximately 1.9 million have non-zero keyphraseness values. Released solely for research purposes; please cite at least one of the papers below if you use it.

4.34MTotal phrases

4.16MEnglish phrases

~1.9MNon-zero Q(s)

45 MBCompressed

Cite this dataset — at least one paper required

Chenliang Li, Aixin Sun, Jianshu Weng, Qi He. Exploiting Hybrid Contexts for Tweet Segmentation. SIGIR 2013

PDF ACM

Chenliang Li, Aixin Sun, Anwitaman Datta. Twevent: Segment-based Event Detection from Tweets. CIKM 2012

PDF ACM

Chenliang Li, Jianshu Weng, Qi He, Yuxia Yao, Anwitaman Datta, Aixin Sun, Bu-Sung Lee. TwiNER: Named Entity Recognition in Targeted Twitter Stream. SIGIR 2012

PDF ACM

CV IR

Tag Visual-Representativeness

◆ Image Tag Clarity · NUS-WIDE ◆ ACM MM 2010 · WSM 2009 ◆ Excel format

⬇ Tag Clarity Scores ⬇ Tag Labels (MM'10)

Normalised Image Tag Clarity (NITC) scores for the 5,981 most popular tags from the NUS-WIDE dataset, available in Excel format. The clarity values are used to measure the visual representativeness of social image tags. Note: values may differ slightly from those reported in WSM'09 due to the number of dummy tags used for estimation (500 dummy tags were used for MM'10). Tag labels used in MM'10 experiments are also available. Contact axsun AT ntu DOT edu DOT sg for questions about the paper or experimental results.

5,981Tags scored

500Dummy tags (MM'10)

Cite this dataset

Aixin Sun, Sourav S. Bhowmick. Quantifying Tag Representativeness of Visual Content of Social Images. ACM MM 2010 — Pages 471–480. Firenze, Italy.

PDF

Aixin Sun, Sourav S. Bhowmick. Image Tag Clarity: In Search of Visual-Representative Tags for Social Images. WSM 2009 (co-located with ACM MM) — Pages 19–26. Beijing, China.

PDF

NLP IR

Comments-Oriented Document Summarization

◆ Blog Summarization · Readers' Feedback ◆ SIGIR 2008 · CIKM 2007

⬇ Download Dataset

The Blog Summarization dataset accompanies the SIGIR 2008 paper on comments-oriented document summarisation. It enables research into understanding documents through readers' feedback — leveraging blog comments as signals for extracting representative summaries. Please refer to the paper for a detailed description of the dataset structure.

Cite this dataset

Meishan Hu, Aixin Sun, Ee-Peng Lim. Comments-Oriented Document Summarization: Understanding Documents with Readers' Feedback. SIGIR 2008 — Pages 291–298. Singapore.

PDF

Meishan Hu, Aixin Sun, Ee-Peng Lim. Comments-Oriented Blog Summarization by Sentence Extraction. CIKM 2007 — Pages 901–904. Lisboa, Portugal.

PDF

IR DM

Web Unit Mining (UnitSet)

◆ Based on WebKB dataset ◆ JASIST 2006 · CIKM 2003

⬇ UnitSet

UnitSet is the dataset used for the Web Unit Mining project, created based on the WebKB dataset from CMU. It supports research on finding and classifying subgraphs of web pages, and mining homepage relationships at the web-unit level. Please cite one of the papers below when using this dataset.

Cite this dataset

Aixin Sun, Ee-Peng Lim. Web Unit Based Mining of Homepage Relationships. JASIST 57(3):394–407, February 2006

PDF

Aixin Sun, Ee-Peng Lim. Web Unit Mining: Finding and Classifying Subgraphs of Web Pages. CIKM 2003 — Pages 108–115. New Orleans, LA.

PDF