关于TAPAS


"TAPAS"是Google Research开发的一个表格解析模型。 "TAPAS"是"TAble PArSing"的首字母缩写,表明该模型的专长。TAPAS独特之处在于,它被设计为理解表格作为一种结构化的数据形式,并且能够执行需要结合自然语言理解和表格数据推理的任务。

"TAPAS"基于BERT(来自Transformers的双向编码器表示)架构,这是Google开发的一种流行且非常有效的语言模型。像BERT一样,TAPAS是一个基于Transformer的模型,但有所不同:它将表格视为一种语言。TAPAS按单元格、行逐行阅读整个表格,包括列标题,然后将单元格的内容与所提出的问题关联起来,使TAPAS能够执行需要理解表格内容的任务。

TAPAS可以用于与表格相关的各种任务,包括但不限于:

  • 基于表格的问题回答: Given a natural language question referring to a table, TAPAS can select the correct cell or cells that contain the answer.
  • 表格填充: Given a table with some missing cells, TAPAS can predict the missing values.
  • 表格总结: TAPAS can generate a text summary of the table's content.

下图显示了TAPAS架构

form

面向企业的TAPAS


鉴于TAPAS理解和推理表格数据的能力,它可能在涉及表格的许多商业场景中都非常有用,例如财务报表、产品目录、项目管理时间表等等。以下是TAPAS在企业环境中的一些可能应用:

  • 数据分析和报告: TAPAS can be used to extract insights from financial reports, sales data, customer demographics, and other types of business data that are often represented in tabular form. For instance, you could ask TAPAS to answer questions like "Which product had the highest sales last quarter?" or "What is the total revenue generated from Region X?"
  • 客户支持: In the context of customer support, TAPAS can be employed to pull data from tables in a database to answer customer queries. For example, a customer might ask, "When is my product due for delivery?" TAPAS could then find the relevant delivery schedule and provide the requested information.
  • 自动文档处理: Many business documents, such as invoices, contracts, and technical specifications, often contain tables. TAPAS can extract and process this information, making document processing more efficient.
  • 商业智能: TAPAS can be used to build more intuitive business intelligence tools. Users can ask natural language questions about their data, and TAPAS can parse the tables in the database to provide the answers.
  • 数据质量管理: TAPAS can potentially be used to detect errors or anomalies in tabular data, by checking for inconsistencies between related cells or identifying cells that do not fit the expected patterns.

TAPAS已经在几个数据集上进行了微调

  • SQA: Sequential Question Answering by Microsoft
  • WTQ: Wiki Table Questions by Stanford University
  • WikiSQL: by Salesforce

TAPAS方案的一般步骤是


通用流程可能涉及几个步骤:

  • 数据提取: First, the relevant information from the tables needs to be extracted and prepared for processing. This could involve deciding which columns or rows are relevant to the text to be generated.
  • 文本生成: The extracted data is then processed by the pre-trained language model, which generates the corresponding text. The model might use the table's headers and data values as inputs and then generate a sentence or paragraph that accurately represents that information in natural language.
  • 微调: Often, a pre-trained model is fine-tuned on a specific task to optimize its performance. For table-to-text generation, the model could be fine-tuned on a dataset of tables and corresponding text descriptions.

预训练模型已经在各种任务中展示出了显著的结果,而它们在表格到文本生成中的应用也充满希望。然而,仍然存在一些挑战,比如生成准确代表复杂表格的文本,或者处理含有缺失或错误数据的表格。


使用业务数据集进行微调


通用流程可能包括我们如何使用业务数据集进行微调的几个步骤:

  • 步骤 1: Choose one of the 3 ways in which business can use TAPAS
    form
  • 步骤 2: Prepare the data in the SQA format
  • 步骤 3: Convert the data into tensors using TapasTokenizer
  • 步骤 4: Train (fine-tune) the model

例子


TAPAS被设计为解释和提供来自表格数据的答案,这些数据以类似电子表格或数据库表的表格格式进行结构化。

ActorsAgeNumber of movies
Brad Pitt5687
Leonardo Di Caprio4553
George Clooney5969

以下是使用TAPAS的结果:

How many movies has George Clooney played in? -- 69
How old is Brad Pitt? -- 56

TAPAS通过允许语言模型理解和与结构化表格数据交互,扩展了语言模型的功能,这是商业界中信息的重要部分。