Conversation Sentiment Analysis Doesn't Work the Way You Think

Last updated:

June 24, 2026

A customer named Marcus got on a call with us a few weeks ago. He likes the product. He'd just asked to extend his trial so more of his team could use it. Partway through, working out how to get more out of it, he said this: "I feel like that defeats the purpose. And I can't figure out how to better use this system."

That's not an unhappy customer. That's an engaged one telling us exactly where to make the product better. It's the most useful sentence in the call, and the kind of nuance any good account owner wants surfaced.

So we ran Marcus's call through a conversation sentiment analysis tool, the kind that scores every line from negative to positive. It rated that sentence a +0.99. On a scale where +1 is the happiest a sentence can be, "I can't figure out how to use this" came back as the single most positive thing Marcus said the entire call.

That is the whole problem in one number. Not that the customer was upset. He wasn't. The tool simply couldn't tell the difference between real feedback and small talk.

For years, conversation sentiment analysis has been sold as a must-have layer on top of meeting recordings and call transcripts. Score the sentiment, watch the trend, catch the unhappy customer before they churn. It sounds obviously useful. We decided to actually test whether it still is. It isn't. And the reason it isn't tells you something bigger about how AI changed what meeting data is for.

What conversation sentiment analysis actually is

Conversation sentiment analysis is the systematic evaluation of emotions, attitudes, and opinions expressed across support channels, sales calls, and other business communications, measuring the emotional tone in customer interactions to reveal satisfaction levels and friction points. In practice, it examines customer conversations end to end so teams can interpret customer sentiment rather than just score isolated text. Two methods have done most of the work over the years.

The first is word scoring. Every word gets a value, roughly from -1 to +1. "Terrible" is negative. "Great" is positive. You add up the words in a sentence and normalize. The second is emotional clustering. Instead of a single positive-to-negative axis, words get mapped to emotions like joy, anger, fear, and trust, and the tool reports which emotions show up most.

Both share one trait that matters more than any other. They count words. They do not understand them. Sentiment analysis is a way to pull emotional tone out of text, determining whether language is positive, negative, or neutral, and it uses text analytics, AI, and natural language processing to interpret human emotions from written language. The tool that scored Marcus's complaint at +0.99 did it because the sentence around the criticism was polite. He said the insights were "really valuable" and "very smart" right before he said he couldn't figure out how to use the thing. The word counter saw "valuable" and "smart" and stopped thinking.

Why it used to make sense

Here is the fair version of the argument for sentiment analysis. There was a time it earned its place.

Before large language models, no machine could read a transcript the way a person does. If you had ten thousand support chats or sales calls and wanted any signal about how customers felt, older systems used natural language processing (NLP) and machine learning to analyze large volumes of conversation text when teams needed a scalable read on customer sentiment. A blunt instrument beats nothing. You could sort calls by score, find the angriest ones, and at least point a human in the right direction while using sentiment analysis to identify positive, negative, or neutral patterns, understand how customers feel, and get rough signals on customer satisfaction, agent performance, or overall service quality.

That world is gone. It went the way of caller ID and voicemail. Useful in its moment. An extra step you no longer need once something better arrives.

Why it breaks on real calls

The thing nobody accounts for is how people actually talk in business: real conversations are full of human language, including sarcasm, slang, negation, and such nuances that simple models miss.

Customers rarely say "this is bad." They say "I love the idea, I just wonder if maybe it could be a little more granular." That kind of phrasing can carry mixed sentiment or even neutral sentiment on the surface while still signaling an underlying emotion. Prospects don't say "you lost." They say "you've been great, we're just also looking at one other option." Real criticism comes wrapped in cushioning. "Really valuable." "Very cool." "Appreciate it." "Would be awesome."

A word counter reads the cushioning as the message. An LLM reads the sentence and knows the cushioning is just manners, unlike rule based systems that overreact to negative words or miss negation and context. It knows "defeats the purpose" is the point and the praise is the wrapping paper.

To see how wide the gap gets, we pulled three real calls and scored every line.

We tested it on real calls

We picked three calls that were going well, because that's the point. These weren't blowups. Marcus was an engaged customer extending his trial and asking how to do more. Daniel was a serious prospect running a careful evaluation, who told us our recording was the best of anything he'd tried. Tom was on the call to sign. Good conversations, all three. The kind you'd be glad to have. But each one had a moment of real substance buried in the friendly back-and-forth, and that's the part you can't afford to lose. These are exactly the kind of customer touchpoints where sentiment analysis tools are supposed to track customer emotions and surface sentiment trends.

Across all three calls, 184 lines of conversation, exactly one line scored negative. One. And it was a throwaway aside. Fine, you might think. They were positive calls. But "positive" isn't the same as "nothing important happened," and that's where the tool falls down: sentiment analysis uses natural language processing (NLP) to sort live language into sentiment categories in real time across customer interactions, yet here it missed the only lines that mattered.

Marcus's call averaged +0.46. His most useful line, the one about wanting to get more out of the product, scored +0.99. The best criticism in the call was rated its happiest moment, which means a tool watching for what to fix would have looked right past it.

Daniel liked us. He was genuinely weighing a move to Grain and said our recording beat the alternatives. He also mentioned, plainly, that our price was "twice as expensive" as one competitor and that the market felt like it was moving that way. Those are the two sentences that decide the deal. They scored +0.88 and +0.96, tagged with joy and trust. The score told us everything was great. The sentences told us what we had to address to win him over.

Tom was the clearest case. He was signing. He wanted access to our beta features. A genuinely happy customer. And in the middle of all that, he gave us a gift: he told us that a competitor was catching up on features he cared about. That's the most valuable thing a closing customer can tell you. It scored +0.84, lost in the warmth of a deal closing.

The emotion layer was no better. The most common emotions it found across the three calls were anticipation, trust, and joy. Almost half the lines came back with no detectable emotion at all, because real speech doesn't use the dictionary's emotion words. And real speech often carries complex emotions that basic emotion detection is meant to catch, including states like anger or gratitude, but that simple labels flatten or miss. People raising a concern in a meeting say "I just want to make sure I understand," not "I am worried."

Here's the pattern. On three good calls, the few lines that actually mattered, the price question, the competitor catching up, the feature gap to close, were the exact lines the tool rated most positive. It didn't flag bad customers, because these weren't bad customers. It flagged nothing at all, and in doing so it buried the handful of moments a team would most want to act on. It also failed to reveal the emotional trends across those interactions that should point teams to what needs work in service delivery. A score that reads every healthy conversation as uniformly great isn't telling you anything. The substance was in the nuance, and nuance is the one thing it can't see.

So the scores were wrong. But a defender could say the scores aren't the product. Nobody reads a spreadsheet of polarity numbers. The scores get fed to something else. That is exactly where the real test is.

The real test: does aspect based sentiment analysis help an AI agent?

In modern contact centers, sentiment analysis software is often sold as a real-time way to analyze sentiment, flag rising frustration, and improve agent performance. You feed the scores to a model and ask it to summarize the account, flag the risks, tell you who's unhappy. So we ran that test directly. Does handing an AI agent the sentiment data produce a better result than just handing it the raw transcript, despite the promise of actionable insights and data-driven decisions?

We set it up to be fair, and to stop ourselves from getting the answer we wanted. We wrote down the list of true critical moments in each call before generating anything. We locked the scoring rule in advance. We ran three conditions: the raw transcript, the transcript with sentiment scores appended, and a placebo with random numbers in the same format as the scores. Twenty-seven summaries in total, each written by a separate AI agent that saw only its own transcript and a neutral instruction. A different model graded them blind, never knowing which summary came from which condition.

The agent did best on the raw transcript and worst with sentiment attached.

Raw transcripts captured 4.44 of the critical moments on average, with the highest quality scores. The placebo, random numbers in the same format, came in second at 4.33. The sentiment-enriched version came in last, at 4.11. The gaps are small, but the takeaway is clean: adding sentiment data didn't help the agent at all, and it didn't even keep pace with random noise in the same slot. The extra layer failed to improve the overall sentiment read or the business outcomes this kind of tooling is supposed to support. More broadly, that undercuts the idea that these tools reliably surface customer satisfaction signals or emotional trends in a way that helps downstream systems.

The most telling moment was Tom. The thing the agents dropped most often was his tip that a competitor was catching up, the single most useful thing a customer who's signing can hand you. Across nine summaries of that call, only one caught it, and it was working from the raw transcript. The sentiment runs missed it, and one even flipped it into "the customer thinks we're ahead." The wrong score didn't sit there harmlessly. It nudged the model away from the one detail worth acting on, instead of helping detect negative turns early enough to intervene before issues escalate.

I want to be honest about the limits. Three calls, all polite business conversations, one model writing and one model grading. This is a clean demonstration, not a census. But the direction held every way we cut it, and the result we hoped to avoid finding never showed up. In other words, it did not turn subjective emotion into measurable KPIs that improved decisions about customer experience.

What you actually want instead

Skip the score. Ask the model for the answer.

Here is the difference in practice. Sentiment analysis gives you a broad label like "negative sentiment detected on pricing," while aspect based sentiment analysis targets sentiment toward specific attributes of a product and surfaces particular aspects such as pricing or reporting. That is more useful, but it still only goes part of the way. An LLM reading the same call gives you "the customer thinks our pricing is uncompetitive against Fathom and won't move without CEO sign-off." What you really need there is intent analysis on the underlying reason, not a generic sentiment label. One is a mood ring. The other is something you can take into your next conversation.

The scores were always a proxy for understanding. We used them because understanding didn't scale. Now it does. Keeping the proxy means inserting a lossy translation step in front of the one tool that reads the original better than the translation ever could.

Think about what you'd actually do with each one. A dashboard tells you the sentiment on the ACME account dropped twelve points this quarter. Now what. You still have to go read the calls to find out why. The score sent you back to the source. Compare that to asking an AI that has the full record: "What changed with ACME this quarter?" and getting "Their champion left in March, the new lead prefers a competitor's reporting, and they've raised the same integration gap on three calls since." That is the difference between drawing insights about the features or services customers love or dislike, so you can spot product gaps or service issues, and getting a vague score. One answer makes you do the work. The other does the work. You were never going to act on the number. You were going to act on the story behind it.

If you're choosing sentiment analysis tools or an AI notetaker

This matters when you're evaluating tools. Sentiment analysis still shows up on a lot of feature checklists, and many sentiment analysis solution claims from analysis tools promise coverage across multiple channels. A colored bar, a trend line, a number that goes up and to the right. It looks like insight.

Ask the harder question instead. Not "does it score sentiment," but "can it tell me what's actually happening, in language I can act on, across every conversation we've had with this account," and what input data and data sources it uses across review sites, survey responses, and social media platforms. A tool that scores moods and a tool that understands conversations are not the same product, even when the demo makes them look alike. The first hands you a feeling. The second hands you an answer. One of those changes what you do on Monday.

The feature to want isn't a sentiment score. It's a system that read the whole conversation and can talk to you about it, with PII redaction and encryption, not one relying solely on a flashy demo.

One call was never the right question anyway

There's a deeper issue under all of this. A sentiment score looks at one line, or one call, on its own. The things that actually matter show up when you aggregate signals across chat, email, calls, and social media by channel and customer segments instead of reading one exchange in isolation.

A single complaint might be nothing. The same complaint in three straight calls points to recurring issues that drive customer churn, create repeat contacts through longer resolutions, and reveal unmet customer expectations. Teams often watch sentiment by week or month and correlate it with retention or revenue, but the useful part is still the pattern inside the conversations themselves. A concern raised in one meeting and quietly dropped in the next tells a story no per-line score can hold. The real question is never "what was the sentiment of this call." It's "what is happening with this account across everything they've told us." That's synthesis across the full record, and it's exactly what a word counter can't do and an AI working from complete context can.

This is why we capture everything, internal and external, and connect data from social media mentions and other touchpoints rather than scoring calls one at a time. Monitoring sentiment trends in real time also makes it easier to catch declines early and set alerts when sentiment falls below a normal baseline for a product area or account. The value isn't in the label on a single conversation. It's in the thread that runs through all of them.

Where sentiment still has a claim

Tone of voice and facial expression carry signals that text can't, and unlike AI powered sentiment analysis on audio that evaluates tone, pitch, and pace during calls for coaching, transcript-only scoring can only see words. Someone can type "sounds great" while their face says the opposite, and a transcript will never catch it. Sentiment read from audio or video inflection, not words, is a real and separate question.

But that lives outside the transcript, it's prone to guessing, and it's not what anyone means when they sell you text-based conversation sentiment analysis. Some analysis work also spans call transcripts, chat logs, surveys, and social media posts to determine emotional tone, but that is different from scoring transcript words alone. It's a reason to keep an open mind about cameras and microphones. It's not a reason to keep scoring words.

We went looking for the strongest case that sentiment analysis still earns its keep. We ran it on the calls where it should have shined, fed it to the AI that was supposed to need it, and tested it against our own hope that it would work. Even hybrid sentiment analysis or ML sentiment analysis does not fix the core problem these transcript examples show.

On three calls we were glad to have, it buried the one thing in each we actually needed to hear.