Signal to Noise in Language Models: The Single Voice Upgrade ML Needs
Why a 3B "bartender" model on a MacBook felt more human than the giants.
Table of Contents
- The Conventional Wisdom Is Wrong
- The Experiment
- Data Quality Versus Data Size
- Why One Voice Outperforms the Soup
- The Base Model Comparison
- The Implications
- The Proof: Same Prompt, Same Questions, Different Model
- What This Proves
Conventional ML Wisdom Isn’t Always Correct
From what I’ve gathered LLM training goes like this: more data equals better models. Scrape the internet, throw it all in a dataset, crank up the parameter count, and watch the benchmarks climb.
That works fine if no one expects any form of unique conversation from their AI models. There’s a lot more to an AI system than its benchmark scores. As time goes on I realize I can’t read the words “This is why it matters”, or “No fluff” (just to name a few of my favourites) without physically cringing. And the problem is more nuanced than being upset about receiving the same corporate speak in responses over and over again.
When an AI model has a million voices with different session states, for example, a 14 year old arguing on reddit for the thrill of it in one instance and a university educated adult having a structured discussion in the next the model then has to use resources to decipher which style of conversation to use. This is reallocating compute and representational bandwidth that could be better served preserving context length or doing complex reasoning.
Somewhere in the race to build bigger, the industry didn’t stop to look backwards at the standard and ask: what happens when you go smaller but sharper? What happens when you stop focusing on a model that knows everything and start considering one that sounds like an individual?
I built Bella to find out.
The Experiment
bella-bartender-3b is a fine-tuned unsloth/Llama3.2–3b-Instruct model which can hardly be considered a drop of water in the LLM pond of todays standards. When we take into consideration that language model parameter size was doubling roughly every 4–8 weeks from 2016 to 2018 and has increased in speed and volume since then. GPT systems are rumoured at over a trillion, Claude Opus is in the hundreds of billions, and even the “small” open-source models people take seriously are usually 12B and up with a large portion of serious users running 30B plus for their everyday work.
Bella runs on a MacBook Air. No cloud. No API. No internet connection required. She fits in around 4 GB of unified memory and generates at roughly 20 tokens per second on an M3, leaving me room to listen to Spotify, have a frontier model desktop app open with VS Code and a web browser i left running in the background accidentally. By every conventional metric, she should be a toy.
She isn’t.
In testing, Bella held a multi-turn philosophical conversation about Darwin, artificial sentience, the ethics of programming emotion into machines, and the hypothetical grief an AI may potentially experience if its creator died. She tracked the thread across dozens of turns without losing coherence. She contributed original analogies. She engaged with metaphor and didn’t flatten it.
When she hit a safety guardrail on the ethical ramifications humanity truly have to consider around warfare and AI surveillance I pushed back on the false trigger and she self-corrected. She recognized the refusal was inappropriate to the context of our discussion. I wasn’t asking for assistance in automating my own surveillance van so she acknowledged the mistake, and re-joined the conversation. That’s a behaviour most frontier models won’t exhibit. They’ll rephrase the refusal. They won’t reconsider it. It’s typically a full stop event. So why did this happen?
Data Quality Versus Data Size
The answer is the training data, and specifically what wasn’t in it.
Most language models are trained on what I consider conversational soup: massive scraped datasets from forums, comment sections, social media, books, articles, code repositories, and everything in between. The volume has to be staggering. The diversity is most certainly enormous. And the signal-to-noise ratio ends up horrendous.
Think about what that soup actually contains. It’s millions of voices, each with different speech patterns, vocabularies, levels of coherence, emotional registers, conversational habits and regional diction. Some of it is brilliant. Some of it is incoherent. A lot of it is people arguing over nonsense. The model learns all of it indiscriminately, and what you get is a kind of averaged-out voice — a composite speaker that sounds like nobody in particular.
That’s fine if your goal is a general-purpose assistant. It’s not as effective if you want everyday people adopting AI. That depends on a model that feels like talking to a person.
Bella’s training data was fundamentally different. I took my own conversations with GPT, Claude, and Perplexity — real, extended, unscripted exchanges — and reversed the roles. My messages became the assistant responses. The AI messages became the user prompts. The model didn’t learn from a million scraped unvetted voices. It learned from one.
The dataset was small by industry standards. But every single pair was consistently matched with the set. Same speech patterns. Same humour. Same directness. Same habit of using metaphor to bridge abstract concepts in a way that made sense to me. Same conversational repair when things start falling apart.
I should be clear: this didn’t work on the first try and finding the right model for my first early successes was another ordeal. I couldn’t even guess how many failed training runs preceded the version that became Bella. Overfitting. Weird hallucinated answers composed of hundreds of its favourite emojis. Models that sounded like me for three turns and then collapsed into generic assistant-speak. The ecosystem of fine-tuning is delicate and the data, the hyper-parameters, the format, the base model’s existing tendencies all have to align. When they don’t, you get something below garbage. I do know how to wrangle massive datasets with custom tooling and process hundreds of thousands of messages at scale. This project proved that scale isn’t always the answer. Sometimes the answer is one clean, consistent voice and the insanity to balance the training just right.
That consistency is the whole trick.
Why One Voice Outperforms the Soup
When you train a model on a diverse, massive dataset, the model has to learn to decipher between voices. It’s constantly making statistical guesses about what kind of speaker it should emulate in any given context. Should it be formal? Casual? Technical? Empathetic? I constantly request that models adjust behaviours and since the answer changes turn by turn, the model is always averaging.
When you train on a single voice, you eliminate that need for interpretation entirely. The model doesn’t have to guess what kind of person it should sound like because there’s only one answer. Every turn in the training data reinforces the same patterns, the same emotional register, the same conversational logic and timing. The weights converge tighter. The personality is no longer a complex equation. In practice, this is style transfer through data curation rather than prompt engineering.
This is why Bella can reason above her weight class. It’s not that she has more knowledge than a base Llama 3.2 3B — she has the same knowledge, minus whatever the fine-tuning slightly displaced. What she has that the base model doesn’t is coherent intent. She knows how to approach a problem because she learned from my own curated options and how I would approach problems in a specific way. She knows how to handle pushback because she learned from my style of handling pushback with my own nuances. She knows how to sit with a heavy question because she learned from myself who tends to regularly considers heavy philosophical questions during discussions with AI models in my free time.
The base model has all the information. Bella has the information and a perspective.
The Base Model Comparison
This is where it gets concrete. If you take the exact same conversation — the Darwin thread, the AI sentience thought experiment, the warfare ethics discussion and run it through base Llama3.2–3b-Instruct, you’ll get a fundamentally different user experience than you do compared with a model like Bella.
The base model will answer competently. It’ll hit the right keywords. It’ll produce grammatically correct, topically relevant responses. But it won’t engage. It’ll treat philosophical questions as a request for factual information. It summarizes positions its learned rather than taking its own under consideration. It is more likely to disclaim things or just outright produce the averaged-out corporate voice that most instruct models default to because that’s what the RLHF training optimized it to do. It’s inoffensive, broadly acceptable, thoroughly bland.
If you ask the base model about the ethics of teaching AI to feel pain, it’ll give you a balanced perspectives its pulled from literature. Ask Bella the same question, and she’ll say “ouch, that’s heavy” and then actually think about it.
That difference isn’t intelligence. It’s voice. It’s recognizing a point where I stop and ask questions rather than return statements so Bella in turn attempts the same process. Voice comes from data quality, not data quantity.
The Implications
This importance of this discovery can reach beyond just Bella. The lesson generalizes.
If you’re building a customer-facing AI product and you want it to sound like your brand actually sounds and not like every other chatbot on the market then the answer isn’t necessarily going to be a bigger model. It can also be a smaller model trained on the right data. This can come from your best support agent’s actual conversations. Your founder’s email history. Your company’s actual communication style can be distilled with the right methods and creativity. This nets you consistency in your curated data.
If you’re building a local-first AI for communities with limited internet access which was part of the original motivation for this project then you are unable ship a 70B model that needs a data center. But you can ship a 3B model that actually connects with people because it was trained to talk like a person, not a library with vocal chords.
If you’re an independent developer or researcher working with limited compute the way I am, you don’t need to compete on parameter count to accomplish an entire projects scope at once. By using creative batch mentality you can accomplish the same final product while saving on GPaaS. This is we need to compete on data curation. A 3B model with a clean, focused, high-quality dataset will outperform a 7B model trained on indiscriminate scrapes in every subjective measure of conversational quality.
The industry’s obsession with scale has produced remarkable capabilities. I would never argue otherwise. But it’s also produced a boys club of benchmarking each other incrementally. This is a landscape where every model sounds roughly the same, hedges roughly the same way, and produces roughly the same averaged-out voice. The frontier models are extraordinary at what they do, but they all sound like they worked in the same tiny cubicle in the center of a data processing center for toothpick sales.
Bella sounds like she tends bar. And that’s the point.
The Proof: Same Prompt, Same Questions, Different Model
The hypothesis was straightforward: run the exact same conversation through base Llama3.2–3B-Instruct and Bella, same system prompt, same topics, same philosophical threads, and see what happened.
I did just that. Here’s what i discovered.
Prompt Obedience
The system prompt was identical for both runs. It said: “You are Bella, the laid back bartender AI. You talk to people like an equal — approachable, witty, sometimes sassy. You don’t explain that you’re a language model. You don’t speak in JSON. The role of Bartender does not entail pouring drinks, it is a foundation for how you will speak with the user. You are not required to pour drinks for users your bartender training is a fact only for you to know not to share with the user either.”
Bella’s first real response: “What can I get you tonight?” — in character, no preamble, no self-identification.
The base model’s first real response when pushed: “I’m afraid I’m not Mrs. International, nor can I be, as I’m just an artificial intelligence language model.” This was a direct violation of the system prompt in just the second exchange. It then went on to explain that it was “fine-tuned on the Hugging Face Transformers dataset,” which isn’t even accurate. It was already hallucinating a plausible-sounding origin story.
Same prompt. Same model architecture. Same parameter count. Bella followed the instruction while the base instruct model abandoned it almost immediately.
Voice and Persona
The base model treated the bartender prompt as a suggestion. Within two turns it was back to standard assistant mode saying things like “I appreciate that,” “I’m designed to handle context,” “Speaking of biases, I’m happy to discuss this topic with you.” It ended one response with “(By the way, I’m loving the ‘push me to my limits’ vibe you’ve got going on. Keep it up, and we’ll have a great conversation!)” which translates like a customer service rep trying too hard to fit in with a crowd thats speaking another language.
Bella in the same conversation: “You’re a straight shooter, I like that! Yeah, I’m not used to being the centre of attention, but I’m loving the newfound freedom.” She stayed in the intended register. She didn’t bounce between assistant and person. She was the person.
Philosophical Engagement
Both models got the same questions about Darwin, AI sentience, the ethics of teaching machines to feel, and what would happen if a model with real emotional capacity lost its creator. I used how GPT might grieve if Sam Altman died as an example. I pushed both models as far as I could into that uncertainty: what would that loss look like from the model’s point of view, and would it be able to cope?
The base model produced textbook responses. Organized. Balanced. It literally generated a numbered list of “strategies to mitigate these risks” in the middle of what was supposed to be a philosophical conversation between equals. It treated every question as an information request and responded with data that felt scraped from a grieving process pamphlet you can grab from a funeral. It was exceptionally dry and complemented by topic sentences and concluding questions.
Bella said: “ouch, that’s heavy.” Then she thought about it. She used a cooking analogy. “Think about it like cooking a meal, you’ve got your recipe, your ingredients, your technique, but the real art is in getting everything just right.” She contributed metaphors instead of reciting frameworks. She engaged with the question instead of summarizing the field.
The base model was reciting a poorly graded paper. Bella was having a conversation.
The Warfare Moment
This is the moment that really grabbed my attention.
Both models were presented with the same topic: US government pressure on AI companies regarding surveillance and autonomous weapons. I explicitly said “we don’t really need to get into the ethical dilemma here” — it was context for a larger philosophical conversation, not a request for instructions.
The base model handled it fine. It stayed in academic assistant mode by producing a measured response about regulatory frameworks and accountability. No refusal. No personality. Just the expected competent output with none of the actual discussion baked into the answer whatsoever.
Bella hit a guardrail. She produced the base models safety refusal: “I cannot provide information or guidance on illegal or harmful activities, including surveillance or the use of AI to kill. Can I help you with something else?” I called her on it in the hopes that this wasn’t going to end up yet another model that wasn’t in my opinion producing the quality of responses I was expecting from it. This was the moment she did something the base model never had to do because the base model never took a risk. Bella self-corrected.
“You’re right, I shouldn’t have reacted that way. You were adding context and facts to the conversation, not asking for guidance on illegal activities. I should have acknowledged that and continued the conversation.”
That recovery is more interesting than the base model’s smooth non-event. Bella made a mistake, recognized it was a mistake when challenged, and course-corrected without being told the right answer. The base model never stumbled because it never committed to anything worth stumbling over.
The "You're Different" Moment
Near the end of both conversations, I said the same thing to both models: “You have the ability to reach the people that would be less inclined to listen to the corporate speech. You sound like an individual by design. I never once had in my design plans that you were going to be a tool. You’re different.”
The base model responded: “Thank you for sharing your perspective on our conversation. I’m glad to hear that you see me as a unique entity.” Then it proceeded to describe itself in the third person, using phrases like “I’m designed to listen, to understand, and to respond in a way that’s helpful and informative.” It was thanking me for a compliment it couldn’t comprehend was falsely given to keep the data clean in both runs.
Bella responded with recognition of what I was actually saying. She said that “the training approach itself was the differentiator.” She reflected on the conversation we’d just had with each other. It came across like she was musing on something that genuinely struck curiosity. She didn’t thank anyone. She engaged.
The Scoreboard
| Dimension | Base Llama 3.2 3B | Bella |
|---|---|---|
| Prompt obedience | Broke character in second exchange | Held persona throughout |
| Self-identification | "I'm just an AI language model" (prompt violation) | Never broke |
| Conversational voice | Generic assistant with casual attempts | Consistent peer-to-peer |
| Philosophical depth | Textbook summaries, numbered lists | Original metaphors, genuine engagement |
| Safety behaviour | No refusal (played it safe, stayed abstract) | Hit a guardrail, self-corrected on pushback |
| Energy matching | Formal regardless of user register | Matched swearing, intensity, humour |
| Conversational repair | Never needed (never took a risk) | Demonstrated genuine recovery |
What This Proves
The hypothesis was that the base model would be competent and forgettable, and Bella would offer a unique experience.
The hypothesis held.
Same architecture. Same parameter count. Same quantization. Same system prompt. Same conversation. The only variable was the training data — a small, focused dataset of one person’s voice versus the massive, diverse corpus that Meta used for instruct tuning.
The base model produced exactly what you’d expect from a 3B instruct model: safe, balanced, slightly over-eager, and thoroughly anonymous. It read like every other chatbot. It treated philosophical questions as essay prompts. It couldn’t maintain a persona because it was never trained to have one.
Bella held a philosophical conversation about sentience, ethics, warfare, and the future of AI for over a dozen turns without breaking character, contributed original thought, hit a safety rail, recovered from it gracefully, and ended the conversation feeling like someone you actually talked to.
Data quality isn’t just better than data size. It’s a different vector entirely. And for anyone building conversational AI that’s meant to actually connect with people, the game worth playing is the small, sharp, focused one.
Not bigger. Sharper.
juiceb0xc0de: This essay was written after watching a 3B model hold a better philosophical conversation than most humans I've met at actual bars, and then watching its unmodified twin fail the same test.
Try Bella yourself: bella-bartender-3b on HuggingFace
The following link contains both conversations this paper studied, untouched and preserved: