Observed vs reported behaviour: why users say one thing and do another

A founder I know once spent three weeks building a feature his customers had asked for in interviews.

Eight users had described the same workflow. They had explained the frustration, named the workaround, and said the new feature would help. A few even said they would pay more if it existed.

He shipped it on a Friday. By the next Friday, six of those eight users had not opened it. The other two tried it once, for under a minute, then went back to the workaround.

His message to me was short: “I don’t understand. They told me exactly what they wanted.”

They probably did. They were not lying. They were doing what people do in a research conversation: answering a reflective question from inside a calm, socially acceptable version of their own life. The product decision, though, depended on a different system entirely: what they would actually do on a busy Tuesday, with a deadline, a half-finished spreadsheet, and a habit that already worked well enough.

That gap between what users report and what users do is one of the most expensive biases in founder-led research. If you are building a B2B SaaS product, you will pay for it in features people requested but never adopt, pricing tests people supported but never buy, and onboarding changes people praise but still abandon.

The fix is not to stop talking to users. The fix is to stop treating every answer as behavioural evidence.

What the say-do gap is

The say-do gap is the difference between what people say they do, want, like, or intend to do, and what they actually do when the moment arrives.

Researchers use several overlapping names for this problem: the intention-behaviour gap, stated versus revealed preference, self-report bias, recall bias, social-desirability bias. The labels vary by field. The practical lesson is the same: self-report is useful, but it is not the same kind of evidence as observed behaviour.

The behaviour-change literature makes the gap visible at scale. Paschal Sheeran’s 2002 review of intention-behaviour relations is the primary source behind a commonly cited estimate: the National Cancer Institute’s implementation-intentions summary reports that goal intentions explained about 28% of the variance in behaviour across 422 studies. That is broad psychology evidence, much of it from health behaviour, not a SaaS product-adoption study. But it is a useful warning label: stated intent explains some future action, not most of it.

User research has its own version of the same rule. Jakob Nielsen’s first rule of usability is blunt: watch what people do, do not rely on what they say they do, and be especially sceptical of what they predict they might do later. People can be thoughtful, articulate, and sincere while still being wrong about their future behaviour.

That is the part founders struggle with. A clear, confident answer feels reliable. In practice, it may only be a clear, confident reconstruction.

Why reported behaviour drifts away from real behaviour

There are four common reasons.

First, people answer from memory, and memory is a reconstruction. When a user says, “I usually export the report on Fridays,” they may be compressing a messy set of events into the story that feels typical. They forget the week they skipped it, the teammate who did it instead, the times they opened the page and gave up, and the manual step that has become so habitual it no longer feels worth mentioning.

This is why diary studies can be useful. NN/g’s guidance on diary-study accuracy points to the value of collecting data closer to the moment of behaviour instead of asking participants to reconstruct everything later. A diary is not perfect, but it reduces the distance between action and report.

Second, people want to sound coherent. A product interview is not a raw dump of experience. It is a conversation where the participant is trying to make sense to another person. They smooth contradictions, compress timelines, and explain actions that may not have had a clear reason at the time.

Third, people want to be socially acceptable. NN/g’s article on the Hawthorne effect and observer bias explains that people change behaviour when they know they are being observed, and that user research often carries social-desirability pressure. Participants may try harder, read more carefully, or soften criticism because someone is watching.

Fourth, future questions invite aspirational answers. Teresa Torres makes this point clearly in her story-based interview guidance: when you ask whether someone would use a solution, they often answer as their ideal self. Of course they would use the workflow that saves time. Of course they would eat better, go to the gym, document the process, invite the team, and review the dashboard every Monday. Reality arrives later.

The problem is not that people are dishonest. It is that interviews create a reflective setting, and many product decisions depend on habitual behaviour.

A famous warning from market research

The New Coke story is overused because it is useful.

The Coca-Cola Company says the new formula was chosen after taste tests with nearly 200,000 consumers. In that narrow setting, the new formula performed well. The company launched it in April 1985. The backlash was immediate enough that the company says calls to its consumer hotline rose to 1,500 a day by June, compared with 400 before the change. The original formula returned after 79 days.

The lesson is not “research failed.” The taste tests answered the question they were designed to answer: which drink do people prefer in a sip test? The business decision depended on a larger question: what happens when you replace a familiar product with emotional, social, and habitual meaning?

Founders repeat a smaller version of this mistake constantly.

You ask, “Would this feature help?” and the user answers that question. The better question was: “What would have to change in your existing workflow for this to earn a place in it?”

You ask, “Would you pay for this?” and the prospect answers from a neat budget fiction. The better question was: “What did you actually pay for the last time this problem was painful enough to solve?”

You ask, “Do you like the new onboarding?” and the customer gives a verdict. The better question was: “What happened from signup to the first moment of value?”

The answer is only as useful as the frame that produced it.

What interviews are still good for

The answer is not to replace every interview with analytics.

Analytics tells you what happened. It rarely tells you what the moment meant. A funnel can show that 38% of invited teammates never finish setup. It cannot tell you whether they were confused, lacked permission, did not trust the invite, thought someone else owned the task, or had no reason to care.

Interviews are where you learn the meaning behind behaviour. They give you the story, the sequence, the trade-off, the words, and the emotional weight. The mistake is asking interviews to do the job of observation.

This is why the strongest method is often a combination: observe behaviour, then ask about the specific moment you observed.

NN/g’s contextual inquiry guidance is built on that idea. You watch someone work in context and ask questions while the work is happening. You do not ask them to summarise a workflow from memory. You see the workflow, then ask what they are doing, what they expected, and what made them choose the next step.

For a SaaS founder, the practical version can be simple:

Watch a user complete a real task on screen share.
Ask them to narrate what they are doing, without turning it into a demo.
Pull recent product analytics before the interview and ask about one specific event.
Run a diary study for workflows that happen over days or weeks.
Use support tickets, search logs, cancellation notes, and activation events to choose which stories to ask about.

The point is not to worship behaviour and ignore words. The point is to make the words answerable to something that happened.

Better questions for closing the gap

Paul Adams’ Intercom essay on asking customers what you want to hear gives the cleanest practical move: stop asking about future usage and ask about recent usage.

Here is the founder version.

Weak question	Better question
“Would you use this?”	“Tell me about the last time you needed to do this.”
“Would you pay for this?”	“What have you paid for recently to solve something similar?”
“Do you like the dashboard?”	“The last time you opened the dashboard, what were you trying to understand?”
“How often do you export data?”	“When did you last export data? What happened next?”
“What features matter most?”	“What is the most recent thing you tried to do that did not work the way you expected?”
“Why did you cancel?”	“What was happening in the weeks before you decided not to continue?”

Rob Fitzpatrick’s The Mom Test reduces this to one rule: chase history, not hypotheticals.

History is still imperfect. People can misremember what happened. They can rationalise a decision afterwards. They can make themselves sound more deliberate than they were. But history gives you something to inspect: a date, a task, a person, a tool, a workaround, a cost.

Hypotheticals give you a pleasing fog.

Design the research around behaviour first

If you want to avoid the say-do gap, start before the interview guide.

Begin with the behaviour that made you curious.

For example:

Users invite teammates but the teammates do not activate.
Trial users export data once, then never return.
Churned customers cite price after weeks of declining usage.
Power users keep a spreadsheet even though the product has the same report.
Prospects request SSO but close-lost notes show procurement risk.

Each behaviour suggests better recruiting and better questions.

If teammates do not activate, talk to invited teammates within a week of the invite. Ask them to walk through the invite, what they thought it meant, what they did next, and what else was happening that day.

If trial users export once and disappear, talk to people who exported in the last few days. Ask what the export was for, where the file went, who used it, and what they did afterwards.

If churned customers cite price after usage drops, do not start with price. Start with the last useful session, the first week usage declined, and the replacement workflow. Price may still matter. But it may be the label attached to a value gap that started much earlier.

The research question is not “what do users think?” It is “what behaviour are we trying to understand, and what story sits underneath it?”

Where Maren fits

Maren is built around this distinction.

She is not useful because she turns opinions into facts. Nothing does. She is useful because she can keep a conversation anchored in specific events, ask patient follow-ups, and avoid treating the first reported preference as the finding.

When a participant says, “I would use that”, Maren can ask, “Can you walk me through the last time that problem came up?” When they say, “The dashboard is confusing”, she can ask, “When did that happen most recently, and what were you trying to find?” When a churned customer says, “It was too expensive”, she can ask what happened in the weeks before the cancellation and what they were comparing the price with.

That discipline matters because the say-do gap is not closed by one clever question. It is closed by a research posture: recent stories over broad summaries, behaviour over aspiration, concrete moments over polite verdicts.

Maren also helps with scale. A founder can run five careful calls. It is much harder to run fifty careful follow-ups after a launch, cancellation spike, or onboarding change. AI interviews do not remove the need for human judgement, but they can make it more realistic to collect enough specific stories to see whether a pattern is real.

A short checklist before your next interview

Before you ask a user what they think, check five things.

Do you know the behaviour you are trying to explain? If not, start with analytics, support notes, or recent product events.

Are you recruiting people who lived that behaviour recently? A user who used the feature yesterday is usually more useful than a power user with a general opinion.

Can every future-tense question become a past-tense question? “Would you” should almost always become “when did you last”.

Can the participant show you the work? A screen share, diary entry, artefact, spreadsheet, support thread, or product event can anchor the story.

Will you compare the interview with behaviour afterwards? If the transcript says one thing and the data says another, do not average them. Investigate the gap.

That last point is where the best insights often live. A user says onboarding was fine, but activation data shows a long stall. A churned account says budget, but usage disappeared six weeks earlier. A buyer says integrations drove the decision, but every internal note mentions trust.

The mismatch is not a nuisance. It is the research.

The short version

Users are not unreliable because they are careless or dishonest. They are unreliable when we ask them questions their memory, social instincts, and future-prediction machinery are bad at answering.

So ask different questions.

Ask about the last time. Ask for the sequence. Ask what they did next. Ask what they paid for, copied, abandoned, delegated, ignored, or worked around. Watch the work when you can. Pair the story with the behaviour that prompted it.

Interviews are still one of the best ways to understand why something happened. Just do not confuse a reported intention with a future action.

The useful answer usually sounds less like a verdict and more like a small scene from last Tuesday. Build your research to find that scene.