Home > Research > Research Group > Digital Libraries and Information Retrieval
 
Current Research
Singapore Internet Research Centre
Research Groups
- Singapore Internet Project
- Digital Intelligence Research Cluster
- Knowledge Organisation Research Cluster
- Information Literacy Research Cluster
- Knowledge Management Research Cluster
Singapore Internet Research Centre
Asian Communication Resource Centre
- Asian Communication Resource Centre (ACRC)
Fellowship Award
Asian Media Information and
Communication Centre
 
 
 
Digital Intelligence Research Cluster
(incorporating Digital Library, Information Retrieval, Natural Language Processing & Human Computer Interaction research)

This research cluster focuses on developing intelligent text processing and retrieval technologies, and integrating them into an advanced search agent system and text mining tool bench. The search agent system being developed will function as a meta-search engine to perform intelligent retrieval in Web search engines, digital libraries and textual databases, and perform intelligent analysis and presentation of the information retrieved from these sources. The text mining tool bench will be developed to be used by social science researchers to perform computer-assisted content analysis of text.

Most search engines merely retrieve potentially relevant documents and display them in rank order, without performing sophisticated query processing, search strategy development, analysis of search results, or user modeling. The search agent to be developed by this research cluster will have the following capabilities:

1. Query processing and formulation

  • Collaborative querying: this technology seeks to improve a user’s search query by mining queries submitted by previous users and the search results retrieved by those queries.
  • Knowledge-based query expansion and query categorization: a knowledge base of keyword-subject heading association and keyword-subject classification association has been developed by mining 16 years of Library of Congress book records. This can now be used to expand a user’s query with related Library of Congress subject headings and related keywords, as well as identify the subject area of a query.
  • Expert system for Boolean search strategy formulation: an expert system for formulating a search strategy for Boolean retrieval systems has earlier been developed. This can be integrated into a search agent to apply to a larger number of Boolean retrieval systems.

2. Search result processing and mining

  • Collaborative filtering: further filtering, clustering and re-ranking of documents retrieved by mining previous queries submitted by other users and their search results.
  • Multi-document summarization: this summarizes a set of research abstracts retrieved by a search engine into a single summary, highlighting common and unique research concepts, methods and findings across the abstracts retrieved.
  • Information extraction: whereas search engines retrieve whole documents, information extraction technology analyzes the text to identify and extract particular facts, e.g. names of terrorists, treatments for a particular disease, etc. A technology is being developed to help users to develop linguistic patterns for extracting different types of facts.
  • Automatic text clustering and categorization: search engines mainly display a rank list of documents retrieved, though more advanced search engines and search agents cluster and categorize documents into subject groups. Several types of text categorization technology are being developed for topical categorization, genre categorization, sentiment categorization and document clustering, which can usefully be incorporated into the search agent to help users to zoom into documents of interest.
  • Link and network analysis: Link and network analysis technology maps out networks of researchers, documents and concepts, and can be used to identify important documents as well as related documents.

3. Search result presentation and interface design

Several types of human-computer interaction and interface design studies are being carried out. Specific types of interfaces being developed include:

  • Mobile interfaces for retrieving information on small-screen mobile devices
  • Multi-level interactive interface for displaying multi-document summarizes
  • Visualization interface for displaying document/concept networks and social networks
  • Children’s interfaces
  • Virtual reality interface
  • Information retrieval interfaces for subjective relevance judgment and processing.

4. User modeling and profiling

Current research in user mental models, subjective relevance, and children’s information processing can be used to develop user profiling and personalization technologies for more personalized information retrieval and processing in the search agent.

5. Personal information management

Advanced search agents should provide facilities for the user to archive and manage the documents and information retrieved. Current research in Web annotation and digital archiving will develop technologies to allow users to annotate and archive documents. These techniques can also assist to automatically seek and retrieve other related documents in the Internet to augment the document sets. In addition, text categorization and link analysis technologies mentioned earlier will help users to organize the documents archived.

The text mining and digital intelligence technologies developed in this research cluster is potentially useful for computer-assisted content analysis in social science research. We propose to integrate the technologies into a text mining tool bench with a unified Web interface tailored for social science researchers.

 
   Staff Members
A/P Christopher Khoo
A/P Dion Goh
Prof Schubert Foo
A/P Theng Yin Leng
Ast/P Na Jin Cheon
Dr Paul Wu Horng Jyh
Ast/P Chang Yun Ke
A/P Ravi Sharma
 
  Research Projects and Grants
G-Portal: A Digital Library Infrastructure for Distributed Geospatial Information GeogDL: A Digital Library for Geography Examination Resources
MobiTOP: A System for the Mobile Tagging of Objects and People   Collaborative Querying in Web-based and Mobile Environments
ACRC Digital Library An Information Retrieval Portal
A Digital Library of Historical Resources Generating Executable Cognitive User Models.
Design and Development of a Suite of Usability Engineering Tools for Digital Libraries on Mobile Environment and the Web Bootstrapping a Machine Translation Dictionary for Cross-Language Information Retrieval
Intelligent Search Agent for Information Extraction and Synthesis on the Web Automatic Multi-document Summarization of Research Abstracts
Mining of Disease-Treatment Information in a Medical Database to Support Evidence-Based Medicine Automatic Identification of News Frames Using Machine-Learning
Automatic Sentiment Analysis & Categorization Text Annotation and Encoding Tool for Content and Linguistic Analysis
Web Archiving    
 
  Postgraduate Student Projects
M.A.Sc. and Ph.D. Projects Completed Theses
back to top

Title of Project: G-Portal: A Digital Library Infrastructure for Distributed Geospatial Information

Investigators:
A/P Lim Ee Peng (School of Computer Engineering), A/P Dion Goh, A/P Theng Yin Leng

Funding:
SingAREN

Description: G-Portal is an on-going digital library project at the School of Computer Engineering in Nanyang Technological University and staff at the Division of Information Studies. The aims of the project include identification, classification and Organisation of geospatial and georeferenced content on the Web, and the provision of digital services such as searching and visualisation. In addition, authorsed users may also contribute resources so that G-Portal becomes a common environment for knowledge sharing.Research areas that this project addresses include:
the development of a reusable software architecture for building geospatial digital library applications
usability issues related to designing interfaces for access to geospatial information
querying of geospatial data
classification of geospatial information
knowledge sharing and community building
back to top

Title of Project: GeogDL: A Digital Library for Geography Examination Resources

Investigators: A/P Dion Goh, A/P Theng Yin Leng, A/P Lim Ee Peng (School of Computer Engineering)

Funding: SingAREN

Description: GeogDL is a digital library application built above G-Portal. The aim of this project is to assist students in revising for the GCE 'O' Level Geography Examination - an annual national examination conducted by the Ministry of Education in Singapore. The digital library contains past-year examination questions and solutions supplemented with additional geographical content for students to explore.

Research issues being addressed include:

metadata models for describing educational content
user interface design
collaborative environments for authoring and sharing of information
back to top

Title of Project: MobiTOP: A System for the Mobile Tagging of Objects and People.

Investigators: A/P Dion Goh, A/P Theng Yin Leng, A/P Lim Ee Peng (SCE), Ast/P Sun Aixin (SCE), A/P Kalyani Chatterjea (NIE), Ast/P Chang Chew Hung (NIE)

Funding: A*STAR

Description: An A*STAR funded project to develop techniques for the creation, management, analysis and discovery of mobile tags, which are media-rich information applied to real-world objects and people. Research areas include user profiling, tag modeling and recommendation, and user interface design. These deliverables will culminate in the implementation of a mobile tagging system known as MobiTOP (Mobile Tagging of Objects and People). Working with pedagogy experts, the system will be deployed and tested in the context of geography education. The project draws upon earlier G-Portal work on geospatial data management and visualization.


Title of Project: Collaborative Querying in Web-based and Mobile Environments

Investigators: A/P Dion Goh & Prof Schubert Foo

Funding:
NTU AcRF funding

The objectives of the project are to design and implement tools and techniques to support collaboration in information retrieval environments. Known also as collaborative querying, this approach aims to assist users in formulating queries to meet their information needs by harnessing other users’ expert knowledge or search experience. The project will: 1. Develop and evaluate algorithms for collaborative querying and mining of query logs using supervised and unsupervised machine learning techniques; 2. Identify information needs from query logs; 3. Design and evaluate user interfaces for collaborative querying. A collaborative querying system for Web and mobile environments will be implemented, including a suite of tools for automatic preprocessing of query logs, mining of queries for collaborative querying, information retrieval functions, and user interfaces. User evaluation will also be conducted to ensure that the system is both useful and usable.

back to top

Title of Project: An Information Retrieval Portal

Investigators: Prof Schubert Foo and A/P Dion Goh

Description: Currently, information retrieval resources are scattered about various web sites making it difficult for researchers to efficiently access them. In addition, while Java is fast becoming a popular language among developers, there are very web sites offering Java-based information retrieval source code. This project thus aims to develop a portal for devoted to information retrieval with emphasis on source code in the Java programming language.

The project uses a Java-based open source portal solution named JetSpeed that is part of the Apache project. Consequently, while the creation of the portal is a major goal, this project also aims to build a comprehensive, extensible portal infrastructure based on JetSpeed that is reusable across various domains.

Identifying areas of improvement in JetSpeed
Identifying areas of improvement in JetSpeed
Implementation of a document publication and review system
Development of annotation and rating/voting systems
Development of an extensible architecture for interfacing with different full text retrieval engines
back to top

Title of Project: A Digital Library of Historical Resources

Investigators: The Division of Information Studies and National Archives of Singapore

Description: This is a Division-wide project that is being conducted in collaboration with the National Archives of Singapore. The project seeks to build a Web-based digital library of Singapore's history, containing historical multimedia resources obtained from the NAS. Such resources are broadly categorised into textual documents, images, audio and video.

In addition to delivering a system for public use, this project will also utilise these multimedia resources as a test-bed for conducting exploratory research and building advanced systems in a variety of areas. These areas are intentionally broad to leverage on the strengths of the Division, and include:

Digital library architectures
Information Organisation and metadata
Information retrieval algorithms and engines
Information exploration environments
Authoring and publishing systems for user-contributed resources
Online exhibitions
E-learning systems
Usability studies
back to top

Title of Project: Generating Executable Cognitive User Models.

Investigators: A/P Theng Yin Leng

To reduce the use of extensive and time-consuming real users testing ubiquitous learning systems, a tool is being developed to automatically generate executable cognitive user models to simulate a real user’s behaviour, as a cost-effective means to rapidly iterate and test system design and detect usability problems in web-based systems. Executable cognitive user models are software agents that simulate real end-users’ behaviour, as well as predict end-users’ performance. The objectives of the project are: 1. To investigate the potential of embedding theories and models of human cognition and artificial intelligence in a tool for constructing executable cognitive user models; 2. To specify the requirements of such a tool for an effective and practical evaluation of web-based systems; 3. To determine how executable cognitive user models can be investigated using software agent technologies throughout the design process; and 4. To investigate how executable cognitive user models can be effectively combined with user testing to achieve the best results.


Title of Project: Design and Development of a Suite of Usability Engineering Tools for Digital Libraries on Mobile Environment and the Web

Investigators: A/P Theng Yin Leng & A/P Dion Goh

Funding: NTU AcRF

Description: Institutions are spending millions of dollars implementing digital libraries (DLs) and Web portals. However, many studies have found the usability and effectiveness of current DLs and portals to be poor. Although there has been some research conducted over the last few years in understanding user needs of text-based and geospatial DLs, there is little work done in helping to make the usability evaluation process of DLs less cumbersome and tedious. Better tools and techniques are needed to help DL designers evaluate their systems in ways that will improve usability to enhance users' experience of DL collections and products. This project investigates usability engineering techniques, a combination of qualitative and quantitative techniques, applicable not only for text-based DLs but also for geospatial DLs, on the Web as well as the mobile environments. DLs of universities, public libraries and national libraries have large user populations, in tens and hundreds of thousands of users. Improvements in DL design can have a major organisational, national and international impact.

Collaborators: Recognising its importance, this proposal has the support of the NTU library and the National Library Board (NLB). Two research centres at NTU, Centre for Human Factors and Ergonomics (CHFE, MPE) and Centre for Advanced Computer Information Systems (CAIS, SCE), and the University of Waikato (New Zealand), are internal and external collaborators working with the project team to exploit the potential of applying this research to the mobile environment, which is fast becoming the popular platform for systems delivering "on-demand" use.


Title of Project: Bootstrapping a Machine Translation Dictionary for Cross-Language Information Retrieval Using A Comparable Corpus

Investigators: A/P Christopher Khoo & A/P Chan Syin (School of Computer Engineering)

Description: In a multilingual information retrieval system (e.g. multilingual Web search engines), cross-language searching capability which permits the user to specify queries in the user's native language but retrieve documents in other languages is essential. Other researchers have developed translation dictionaries for cross-language retrieval by performing statistical analyses of parallel corpora -- document collections in which each document in one language has a sentence-by-sentence translation in a second language. This study aims to develop a method for constructing a translation dictionary in a situation where there is no parallel corpus, but there are nevertheless documents in both languages reporting the same event, e.g. news articles in different language newspapers reporting the same event.

This study seeks to develop a method for bootstrapping a English-Chinese and English-Malay translation dictionary using a training sample of manually paired English-Chinese and English-Malay documents. The system first analyses the set of manually paired English-Other Language articles to construct a preliminary translation dictionary, and then use this preliminary dictionary to identify other pairs of English-Other Language articles. It then performs "self-learning" by analysing these new pairs of articles to improve its dictionary. Cross-language retrieval experiments will be carried out to test the effectiveness of such a dictionary.

back to top

Title of Project: ACRC Digital Library

Investigators: A/P Dion Goh & Prof Schubert Foo

Funding
: WSCI and NTU Library

The ACRC plans to transform itself into an important regional hub and one-stop center housing quality resources in the specialized areas of Media, Communication and Information. The purpose of the project is develop a digital library for the ACRC to host its electronic collection that includes grey literature, published literature in media, communication and information, and to use the digital library as a platform for conducting research in areas such as information retrieval, information extraction, information organization, data/text mining, collaborative systems, knowledge sharing, etc. The digital library will also provide a platform to support knowledge sharing and publishing by capturing, preserving and communicating the intellectual output of SCI’s faculty staff and researchers. Such a DL system can be further exploited to distribute SCI’s digital works over the Web through a sophisticated search and retrieval system. Availability and easy accessibility of ACRC sources for local, regional and international users would certainly enhance the image of NTU in general, and SCI in particular.

back to top

Title of Project: Intelligent Search Agent for Information Extraction and Synthesis on the Web

Investigators: A/P Chris Khoo, A/P Dion Goh & A/P Chan Syin (School of Computer Engineering)

Funding
: NTU AcRF

A project to develop a prototype intelligent search agent that performs information extraction and synthesis on the Web. Most Web search engines and intelligent search agents merely identify potentially relevant documents on the Web without actually extracting the relevant information from the text of the documents. Information extraction systems developed so far require large training sets, are usable only by experts and take a long time to train. The study seeks to develop an intelligent information extraction system that can be trained by ordinary users using a small number of examples to extract relevant information from multiple Web sites and integrate the information into a multi-document summary to aid in knowledge discovery and knowledge acquisition.

back to top

Title of Project: Automatic Multi-document Summarization of Research Abstracts

Investigators: A/P Chris Khoo, A/P Dion Goh & Dr Paul Wu

Funding
: NTU/SCI RCC

The objective of this study is to develop a method for automatic summarization of sets of sociology abstracts that might be retrieved by a digital library system or search engine in response to a user’s query. The purpose of the multi-document summarizer is to present an overall summary of the set of documents, highlighting the important concepts and relations found in them. The method includes an automatic analysis of the discourse structure of sociology abstracts, both at the macro-level (between sentences and sections) and the micro-level (within sentences). The automatic summarizer focuses on the extraction of variables and semantic relationships between variables expressed in the text, and the integration of the extracted information into a coherent summary.

back to top

Title of Project: Mining of Disease-Treatment Information in a Medical Database to Support Evidence-Based Medicine

Investigators: A/P Chris Khoo, Ast/P Na Jin Cheon & A/P Chan Syin (School of Computer Engineering)

Funding
:

This project seeks to extend automatic information extraction technology and apply it to the medical domain to extract disease-treatment information from medical abstracts to support evidence-based medicine and knowledge discovery. Current information extraction systems make use of linguistic patterns and pattern matching to identify the pieces of information to extract from unstructured text. The extracton patterns are often constructed automatically by applying a supervised learning technique on a set of manually annotated training text. This project seeks to develop a technique to construct the information extraction patterns without manual annotation of text by performing text mining, automatic text annotation and pseudo-supervised learning. The objectives of the project are:

  1. To develop an effective method to mine information extraction patterns in a medical database
  2. To develop a method to construct information extraction patterns using pseudo-supervised learning and automated annotation of training text
  3. To develop a disease-treatment ontology to model and represent treatment information found in medical abstracts, and to summarize the information to support evidence-based medicine.
back to top

Title of Project: Automatic Identification of News Frames Using Machine-Learning

Investigators: Ast/P Na Jin Cheon & A/P Chris Khoo

Funding
: NTU/SCI RCC

This project will develop techniques and a software tool for automatic news frames analysis – automatically analyzing news articles and categorizing them into one of several pre-defined news frames. News framing analysis is a kind of content analysis of news articles to identify how the news is framed, including the perspective in which the events are reported, how information is selected and organized in the news article, and how the information is expressed. News frames analysis is intellectual work usually performed by human analyzers. The tremendous number of news articles to be analyzed makes manual news frame categorization a difficult and tedious task. This project thus seeks to develop a method for automatic news frame categorization using machine-learning and text mining techniques.

back to top

Title of Project: Automatic Sentiment Analysis & Categorization

Investigators: A/P Chris Khoo & Ast/P Na Jin Cheon

The objective of the project is to develop techniques for automated or computer-assisted sentiment analysis of various genres of text. Sentiment refers to a person’s feeling, emotion or attitude toward a subject, and can cover a variety of emotional dispositions (e.g. anger, admiration, dislike, eagerness, etc.). The appraisal theory (Rothery, 1997; Martin, 1995), which is based on the principles of Systemic Functional Linguistics, is adopted as a framework for the study for its clear explication of how sentiment is expressed in language. It divides appraisal into Attitude, Engagement and Graduation, with Attitude further divided into Affect (emotion), Judgment (ethical/social evaluation) and Appreciation (aesthetic assessment). Current work is focused on:

  • automatic categorization of product reviews into positive (favorable/recommended) versus negative (unfavorable/not recommended) sentiment
  • development of a sentiment meta-search engine to identify documents and document snippets reporting product reviews and categorizing them into positive and negative reviews
  • automatic sentiment analysis of polical news articles using a framework based on the appraisal theory.
back to top

Title of Project: Text Annotation and Encoding Tool for Content and Linguistic Analysis

Investigators: Dr Paul Wu

Funding
: NTU/SCI RCC

The purpose of the project is to develop a Web-based software tool and graphical interface to enable researchers to mark-up and annotate text, encode the annotation in an XML format, store the annotation for further processing, and display the annotation in a number of visual formats. The text annotation tool will be designed to be general and powerful enough to handle most types of content analysis and linguistic analysis. The tool will handle several independent layers of annotation, hierarchical annotation (where primitive units are grouped to form more complex units), and overlapping annotations. Such an annotation tool will be useful for many types of research – content analysis, linguistic analysis, text analysis, creating training documents for text mining and information extraction, etc. A powerful text annotation tool, grounded on a good representation formalism, is needed because a deeper level of content analysis involves a deeper level of linguistic coding. The validation of content analysis results also requires evidence presented in linguistic coding.

back to top

Title of Project: Web Archiving

Investigators: Dr Paul Wu

Funding
: National Library Board

The Internet has increased the proliferation of online publication and community worldwide. Due to the fragility of digital medium, new approach needs to be developed for preservation for future generations; a task imperative in capturing a record of contemporary digital culture and heritage. This project develops a digital repository and Web annotation and cataloguing system for Web archives. By applying intuitive mechanism, evidence of the subject matter and contextual information will be captured as metadata. The metadata further serves as evidence to monitor substantial changes of websites. In sum, the objectives of the project are two fold:
• Evidence-based cataloging: Allow users to effectively catalog websites collection for archival and preservation purposes, reducing the turn-around time, producing verifiable catalogue data/metadata
and thus, increasing the quality of the catalogue.
• Dynamic web content monitoring: Minimize the manual efforts required to maintain the catalogue data/metadata of the web archives through automatically verifying and monitoring the dynamic changes of websites, filtering away unnecessary attention paid to scrupulous changes and alerting only substantive ones that need to be attended to.

back to top

Postgraduate Student Projects

M.A.Sc. and Ph.D. Projects

Automatic Sentiment Analysis of News Articles
Student: Armineh Nourbaksh (M.A.Sc. student)
Supervisor: A/P Christopher Khoo & Ast/P Na Jin Cheon

Automatic Information Extraction and Text Mining in Medical Abstracts
Student: Wang Wei (M.A.Sc. student)
Supervisor: A/P Christopher Khoo & Ast/P Na Jin Cheon

Concept-based Information Retrieval
Student: Yin Ming
Supervisor: A/P Dion Goh and A/P Lim Ee Peng
 
Completed Theses
Collaborative Querying through the Mining of Query Logs
Student: Fu Lin (PhD, 2006)
Supervisor: A/P Dion Goh and Prof Schubert Foo
Automatic Multi-Document Summarization Using a Variable-Based Framework
Student: Ou Shiyan (PhD, June 2006)
Supervisor: A/P Christopher Khoo and A/P Dion Goh
An Intelligent Monitoring Service for Web Monitoring
Student: Tan Bing (M.A.Sc., 2001)
Supervisor: Prof Schubert Foo
Chinese Text Segmentation for Information Retrieval
Student: Li Hui (M.A.Sc., 2000)
Supervisor: Prof Schubert Foo
Developing a New Statistical Method for Chinese Text Segmentation
Student: Dai Yubin (M.A.Sc., 2000)
Supervisor: A/P Christopher Khoo
Automatic Extraction of Cause-Effect Information from Medical Abstracts
Student: Niu Yun (M.A.Sc., 2000)
Supervisor: A/P Chan Syin & A/P Christopher Khoo
Combining Multiple Sources of Evidence for Information Retrieval
Student: Xi Wensi (M.A.Sc., 2000)
Supervisor: A/P Lim Ee Peng & A/P Christopher Khoo
Enhancing Play-out Performance for Internet Video Communications.
Student: Yip See Wai (M.Phil., May 1999)
Supervisor: Prof Schubert Foo
Chinese Text Retrieval System
Student: Lim Hong Koon  (M.Phil., May 1999)
Supervisor: Prof Schubert Foo
An Intelligent Web-based Helpdesk for Customer Service Support
Student: Liu Shigong  (M.A.Sc., May 1999)
Supervisor: Prof Schubert Foo
Evaluation of  Web-Based Online Catalogue Interfaces : A Cognitive Approach
Student: Cheng Lu (M.A.Sc., May 1999)
Supervisor: A/P Christopher Khoo
back to top