September 18, 2022
Our new open-source model is here: from table to text with Narratable
By Sofía Sánchez González
When we talk about Natural Language Processing (NLP), we’re used to the term data-to-text. But at Narrativa we’ve taken it to another level: we’ve created the first BLOOM-based open-source model that converts data from a table into text. Introducing Narratable.
How does Narratable work?
It’s straightforward. Let’s imagine, for example, that we have a Wikipedia table with structured data on Leo Messi‘s performance in the Champions League competition. There are several rows and columns indicating the number of games played, the yellow and red cards, the number of minutes on the pitch, and the goals he has scored, etc.
Now let’s suppose that we would like to convert some of these figures into a paragraph that explains the table in an easy and simple way. Remember that our model describes a set of adjacent cells and rows. Then, we just give instructions to our model: “Narratable, I want you to tell me how many minutes, goals and cards Messi has achieved in the Champions League.”
Our model will build a sentence around these data points with the following output: Messi has played 15,356 minutes in the Champions League and has scored 126 goals. He has also obtained 1 red and 4 yellow cards.
If we only want the number of goals and we indicate such and our model will generate only one sentence referring to the goals.
Narratable can be very useful for companies with a large amount of data and tables. It would be much quicker for employees to locate information and present it in a written report.
The origin of the model
Our innovation team, led by Manuel Romero, has achieved this feat after overcoming many obstacles. At first, to create Narratable, the finetuning technique was applied with a T5 model. But there was a problem. T5 models can only encode 512 tokens. This is a big problem since, as you can imagine, most tables, when encoded (after linearizing it), can have hundreds even thousands of tokens.
If Narratable could only read 512 tokens, it would lack context and wouldn’t answer our questions. :(
BLOOM to the rescue
What was the solution? To use BLOOM, the largest open source language model to date. Here you can find more information about how it works and what it took to train it.
BLOOM is a generative and multilingual model, perfect for Narratable. With a scaled-down version of BLOOM we have 0.5 billion parameters, a totally doable figure for any type of hardware. In addition, it can encode 2,048 context tokens, which is much higher than what T5 gave us. Thus, BLOOM understands all the rows and columns of the table. This of course can help provide greater context.
What dataset was used in the training?
A dataset of 121,000 training examples was used to train Narratable; we have created a kind of markup language to teach it to distinguish between rows and columns in tables and what information in the table we wanted to use. Once we defined that process, we took all the training set tables, linearized them (converted them to text) and voilà! We have our Narratable model!
What languages does Narratable work in?
At the moment, just in English. It’s the language with which we have trained Narratable, but we have not ruled out expanding into other languages such as Spanish.
Where can I try it?
As you know, Narrativa has a Hugging Face profile where we upload all our open source models. You can try and use them freely. Click on this link and you’ll be able to see all the possibilities offered by Narratable.
In the model card we have included more instructions, including a video in case you have any questions.
How does it relate to regulatory submissions?
Regulatory submissions, like clinical study reports (CSRs), are based on millions of data points and inputs that could easily involve thousands of patients. Utilizing NLP to create text from tables could help medical writers create patient safety narratives or Tables, Lists, and Figures (TLFs) within a few minutes and with minimal effort. So, what does this translate into?
Simply, this means that medical writers will have more time to focus their attention on tasks that require critical thinking rather than repetitive tasks. It also means faster regulatory submission times and reduced costs, as reviews, quality checks, and verifications will be minimized or eliminated.
About Narrativa
Narrativa is an internationally recognized content services company that uses its proprietary artificial intelligence and machine learning platforms to build and deploy digital content solutions for enterprises. Its technology suite, consisting of data extraction, data analysis, natural language processing (NLP) and natural language generation (NLG) tools, all seamlessly work together to power a lineup of smart content creation, automated business intelligence reporting and process optimization products for a variety of industries.
Contact us to learn more about our solutions!
Our new open-source model is here: From table to text with Narratable
Share