The explosion in popularity and accessibility of large language models (LLMs) has opened up exciting possibilities for reducing human effort in manual and repetitive processing tasks. In this series, we'll dive into how several foundational models tackle both simple and complex, domain-specific natural language processing tasks like summarization, segmentation, and classification of customer conversations. We'll explore their potential and shortcomings in solving these specialized challenges by benchmarking several proprietary and open-weight models from Anthropic (Claude Haiku, Claude Sonnet, and Claude Sonnet 3.5), Meta's Llama 3-70B Instruct, and Mistral AI's Mixtral 8X7B Instruct.
Summarization: Distilling the Essence
The task of summarization involved translating complex, often grammatically incorrect inputs into concise English while trimming redundancies and unnecessary filler words. Models were instructed to retain the original tone of the input, while keeping as much relevant information (defined based on the use-case) as possible. For this task, we utilized few-shot prompting to give the LLMs examples of a few desired outcomes and each model was separately instructed with fine tuned prompts to optimize their outputs. In general, the models handled the summarization task rather effectively, as they were able to utilize the context to explain acronyms, correct misspelled words, and handle domain specific language.
Let's break down how each model performed:
Claude Haiku: The Little Engine That Could
As one of the simpler models in our lineup, Haiku surprised us! Its summaries were readable and captured most of the key information we were after. This suggests a good balance between compression and information retention.
Claude Sonnet: The Wordsmiths
Both Sonnet 3 and the Sonnet 3.5 excelled at summarization. Responses from these models were the most verbose, as they retained the most information from the original input. This can be attributed to a higher capacity to understand and retain context, leading to more detailed summaries.
Llama3: The Minimalist
Summaries from Llama3 were the most concise. This model left out the most information, but captured the main ideas of the summary fairly well. However, we didn’t find it optimal for tasks that required more detailed summary outputs. This might suggest a stronger bias towards compression, which could explain why it's less suitable for detailed summaries.
Mixtral: The Wild Card
Responses from the Mixtral model fell short of the others. Within a large dataset of inputs, Mixtral’s responses occasionally included its own commentary (hello, hallucination!), and sometimes failed to remove extraneous information. This could be due to the model overfitting to certain patterns in its training data or struggling to distinguish between summarization and generation tasks. The failure to remove extraneous information could also indicate less effective attention mechanisms or context understanding.
A Real-World Example
To put these models through their paces, we fed them customer feedback from retail stores. Here's a taste of what we got:
Input
"I ordered two Louis Vuitton bags on XXXX one for $1139 and one for $799 it was order number FG0H1299 and then I returend both of them but only got a refund of $799. When I called into the support serivce the customer service was horrible and they made me sound like a criminal. Just extremely unprofessional there should really be a more efficient system for processing refusnds than this. and I need my refund NOW. Sent from my iPhone"
Claude Haiku
"The customer ordered two Louis Vuitton bags from XXXX, one for $1139 and one for $799, with order number FG0H1299. The customer returned both bags but only received a refund of $799. When the customer called customer service, they were treated unprofessionally and made to feel like a criminal. The customer is requesting a more efficient system for processing refunds and the remaining $1139 refund."
Claude Sonnet
"The customer ordered two Louis Vuitton bags from XXXX, one for $1139 and one for $799, with order number FG0H1299. They returned both bags but only received a refund for $799. When they called customer service, the representative was unprofessional and made the customer feel like a criminal. The customer believes there should be a more efficient system for processing refunds and demands their remaining refund immediately."
Claude Sonnet 3.5
“Customer ordered two Louis Vuitton bags from XXXX, one priced at $1139 and another at $799, with order number FG0H1299. They returned both bags but only received a refund for $799. When contacting customer support about the missing refund, the customer experienced poor service and felt they were treated unprofessionally, as if they were a criminal. The customer expresses frustration with the inefficient refund processing system and urgently requests their remaining refund.”
Llama3
"Customer ordered two Louis Vuitton bags from XXXX, totaling $1139 and $799, with order number FG0H1299. After returning both bags, they only received a refund of $799. When they contacted customer support, they experienced horrible service and were treated unprofessionally."
Mixtral
"Customer claims they returned two Louis Vuitton bags to XXXX, order number FG0H1299, but only received a refund of $799 instead of the expected $1938. They experienced poor customer service when they called the support line, describing it as unprofessional and making them feel like a criminal."
While the quality differences were subtle, they were there. Most models performed well, but each had its own flavor.
The Takeaway
Here's what we learned: When it comes to language models, bigger isn't always better! Efficient design and targeted training can make smaller models punch above their weight, as Claude Haiku showed us. Each model we tested had its own way of balancing conciseness and information retention, reflecting the unique objectives and architectures behind them.
These differences highlight a crucial point: Choosing the right model for your specific task is key. The best fit isn't always the biggest or most complex model, but the one that aligns best with what you're trying to achieve.
At Spring Labs, we use a mix of basic NLP techniques alongside fine-tuned, transformer-based language models (various flavors of BERT). We bring in small and large language models when we need that extra push in reasoning and comprehension. This hybrid approach helps us strike a balance between accuracy, speed, and cost, ensuring our conversational intelligence engine is both reliable and scalable. Stay tuned for our upcoming posts, where we'll explore how these models tackle more complex tasks like segmentation, classification, entity recognition and tagging in the financial services domain.