In our last blog post, we explored how different LLMs distill information through summarization. Today, we're diving deeper into our benchmarking series, examining how these foundational models tackle more complex tasks: segmentation and classification of customer conversations. We'll continue our evaluation of Claude Haiku, Claude Sonnet, Claude Sonnet 3.5, Meta's Llama 3-70B Instruct, and Mistral AI's Mixtral 8X7B Instruct.
Segmentation: Extracting the Essence
The segmentation task required models to extract high-level main ideas from complex outputs, leaving sentence fragments that described the core issues from the input. Models were prompt-tuned to remove filler words and input-specific terms such as merchant and brand names, customer-specific information, prices, and dates. Similar to the summarization task, we used few-shot prompting with applicable examples to guide the LLMs.
However, this task proved more challenging for the LLMs to conceptualize compared to summarization, resulting in a larger variance in performance. Let's break down how each model fared:
Claude Family: The Verbose Bunch
Claude Haiku, Sonnet, and Sonnet 3.5 created long segmented summaries. Given their propensity for verbosity, they frequently included information not directly relevant to the main points of the conversation. Additionally, these models were prone to including conversation-specific information such as brand names, despite explicit instructions otherwise.
Llama 3: The Concise Performer
Llama 3's concise nature made it an effective candidate for segmentation. It was able to distill inputs down into their main ideas, generalizing customer feedback in a high-level manner as instructed.
Mixtral: The Mixed Bag
The results of Mixtral's segmented summaries were mixed (pun intended!) and generally fell between the Claude models and Llama 3. These outputs were not as verbose as the Claude models but more detailed than Llama 3. However, Mixtral repeatedly experienced hallucinations and included response preambles and extraneous information.
A Real-World Example
To illustrate these differences, let's look at how each model segmented the same customer feedback we used in our summarization task:
Input:
- I ordered two Louis Vuitton bags on XXXX one for $1139 and one for $799 it was order number FG0H1299 and then I returend both of them but only got a refund of $799. When I called into the support serivce the customer service was horrible and they made me sound like a criminal. Just extremely unprofessional there should really be a more efficient system for processing refunds than this. and I need my refund NOW. Sent from my iPhone
Claude Haiku:
- Incorrect refund amount received for Louis Vuitton bags.
- Poor customer service experience when contacting about refund.
Claude Sonnet:
- Customer did not receive full refund for returned items.
- Customer service was unprofessional and accusatory.
- Inefficient refund processing system.
Claude Sonnet 3.5:
- Customer only received partial refund for returned items.
- Customer service was unprofessional and accusatory.
- Refund processing system is inefficient.
- Urgent request for remaining refund.
Llama 3:
- Refund amount incorrect.
- Poor customer service experience.
Mixtral:
- Customer returned two Louis Vuitton bags but received incorrect refund amount.
- Customer experienced poor customer service when calling support line.
Given that the goal of this task was to extract concise, high-level summaries, Llama 3 outperformed the rest of the models.
Classification: Sorting Through the Noise
The classification task involved evaluating messy data to sort inputs into different categories. This was implemented as a graph where the model evaluated the presence of risk signals (a binary classification problem) sequentially based on various conditions to determine whether an input satisfied the particular criteria for a certain category. However, the data was unclean and this made it challenging to establish clear criteria to differentiate one category from another.
Initially, we employed zero-shot prompting with Chain-of-Thought (CoT) tuning for the classification tasks, with varying degrees of success. Under zero-shot prompting, both Claude Haiku and Claude Sonnet exhibited conservative behavior, often categorizing all inputs uniformly into one class.
To improve performance, we switched to few-shot prompting with CoT tuning, providing the models with examples to help them better distinguish between categories. This adjustment led to a higher accuracy rate, as the examples likely enabled the models to identify and apply more nuanced criteria, resulting in more accurate classification. Even with few-shot CoT prompting, model behavior was variable. Here's how each model performed:
Mixtral had the highest rate of uniform classification, resulting in the lowest accuracy rate. In general, Claude Sonnet and Llama 3 were best able to assess minute differences between categories and accurately sort inputs within them.
The Takeaway
Our benchmarking highlights the importance of selecting the right model and fine-tuning prompts to achieve optimal results in natural language processing tasks. While more complex models like Claude Sonnet excel in detailed tasks, simpler models such as Llama 3 may offer the most cost-effective solutions for tasks requiring concise outputs.
The choice of model and prompting technique significantly impacts the performance of LLMs in summarization, segmentation, and classification tasks. As LLM technology continues to evolve, so too will their applications and effectiveness in tackling increasingly complex data challenges.
At Spring Labs, we often face problems that require a much higher level of accuracy that can’t be met by leveraging existing foundational models alone. To achieve a much higher level of reliability, we use a mix of basic NLP techniques alongside fine-tuned, transformer-based language models. We leverage small and large language models when the task requires that extra push in reasoning and comprehension. This hybrid approach helps us strike a balance between accuracy, speed, and cost, ensuring our conversational intelligence engine is both reliable and scalable.
Stay tuned for our upcoming posts, where we'll explore how leveraging various foundational models combined with complex but effective cognitive architectures can be used to tackle even more complex tasks like entity recognition, tagging, and profiling in the financial services domain.