Text summarisation in progress: a literature review

Introduction

Overview of Text Summarization

In the realm of text summarization, various taxonomies have been proposed to categorize the factors influencing summaries. One well-known taxonomy, introduced by Spärck Jones (1999), considers three classes of context factors: input, purpose, and output. Input factors pertain to characteristics of the source text, such as genre, language, and register.

Additionally, another taxonomy suggested by Mani and Maybury (1999) categorizes summarization systems based on their approach to generating summaries. These approaches can be categorized into three levels: surface, entity, or discourse.

A common challenge in taxonomy creation is that many systems share multiple characteristics.

Common Approaches to Summary Generation

The process typically involves three main subtasks: topic identification, topic interpretation, and summary generation. However, many approaches focus only on the first two stages, extracting sentences directly from documents to create extracts.

Statistical-Based Approaches

These approaches identify the main topic of a document by analyzing the frequency of words. Sentences are then scored based on the weights assigned to words, determining their relevance.

Topic-Based Approaches

Relevance is determined based on the phrases or words contained in a sentence. Phrases like "in conclusion" or "the aim of this paper" may indicate relevant information.

Graph-Based Approaches

Graph nodes represent text elements, while edges connect these elements. This method is effective for representing relationships between words or sentences.

Discourse-Based Approaches

Cohesion and coherence pose significant challenges for text summarization. Techniques such as lexical chains, which identify sequences of semantically related words, help detect the main topics of a document.

Machine Learning-Based Approaches

These approaches, including neural networks like RankNet, require large training corpora to yield conclusive results. Typically, the corpus consists of human-written summaries or annotated source documents indicating important sentences.

Relevant Conferences and Workshops

Text Analysis Conference (TAC) is a notable event in the field of text summarization.

Text Summarization in the Current Context

In today's landscape, new types of summaries have emerged, catering to personalized, updated, and sentiment-based needs. These summaries are tailored directly to user requirements, allowing individuals to specify the type of information they seek.

Personalized Summaries:

These summaries deliver content that aligns with a user's specific interests. By selecting sentences most relevant to a user model, they can be customized to reflect personal preferences. This could include information from a person's online presence, such as their personal webpage or digital publications.

Update Summaries:

Designed for users with background knowledge on a topic, update summaries prioritize recent events related to that subject. By defining the concept of history and excluding similar sentences, these summaries focus on providing the latest information.

Sentiment-Based Summaries:

Before generating these summaries, opinions are detected and classified based on subjectivity and polarity. By comparing sentence sentiment with entity sentiment, these summaries select content that closely aligns with user preferences.

Survey Summaries:

Offering a comprehensive overview of a topic or entity, survey summaries provide valuable insights into various subjects.

Abstractive Summaries:

While most approaches traditionally adopt an extractive paradigm, resulting in summaries composed of relevant sentences, abstractive summaries aim to address limitations by generating new text from identified relevant fragments. Methods such as sentence compression, fusion, and natural language generation are employed to achieve this.

New Scenarios for Text Summarization:

Beyond traditional domains like newswire and scientific documents, text summarization now extends to literary texts, patent claims, image captioning, and Web 2.0 textual genres. In literary text summarization, summaries can be objective or interpretative, capturing events or the author's ideas, respectively. For patent claims and image captioning, summaries are tailored to the unique characteristics of the content. Additionally, sentiment analysis plays a crucial role in summarizing content from Web 2.0 genres, reflecting the evolving landscape of text summarization.

Combining Text Summarization with Intelligent Systems

The most common approach to integrating Information Retrieval (IR) and Text Summarization (TS) involves retrieving documents related to a topic and generating a summary based on these documents.

Latent Semantic Analysis (LSA) is one such method. For example, the QCS system integrates an IR module that retrieves documents from a static collection rather than directly from the Internet.

Combining Text Summarization with Question Answering

Another integration involves combining text summarization with question answering systems.

Combining Text Summarization with Text Classification

Text Classification (TC), also known as text categorization, automatically sorts documents into predefined categories. In the context of web pages, summaries can extract relevant information, aiding in the classification process by reducing noise.

Text Summarization for Noise Filtering

Summaries can also be used for filtering noise in large datasets, helping to identify and focus on essential information.

Text Summarization Evaluation

Evaluation methods for text summarization can be broadly categorized as intrinsic or extrinsic. Intrinsic evaluation assesses the quality or informativeness of a summary, while extrinsic evaluation measures its fidelity to the source.

Informativeness Evaluation

Evaluation metrics such as precision, recall, and F-measure are adapted to assess summary informativeness by comparing system-generated summaries to human-written ones.

Quality Evaluation

Quality evaluation criteria include linguistic aspects like grammaticality, non-redundancy, referential clarity, focus, and structure coherence. These criteria assess the overall quality of a summary without comparing it to a reference summary.

Limitations of Existing Evaluation Methods

Current evaluation methods mostly focus on intrinsic evaluation, and there is a need for improved methods for assessing extrinsic factors.

Future Directions in Text Summarization

Future advancements in text summarization will focus on multi-document and multilingual summarization to address the abundance of information across various languages. Additionally, there is a growing need to explore summarization beyond traditional text inputs, such as meetings or videos, and to present outputs in formats other than text.

Referencia

Con objetivos exclusivamente académicos, este documento es un extracto de la fuente:

Lloret, E.; Palomar, M. (2012). “Text summarisation in progress: a literature review”. Artificial
Intelligence Review, 37, 1-41. https://doi.org/10.1007/s10462-011-9216-z

Fuente

UNED Master HD

Url

Ver fuente

Interés

Válido

Referencia

Discurso especializado y tecnologías lingüísticas

Prioridad

Obligatorio