Fine-Tuning Llama 3.2 3B for Medical QA: Loss vs Accuracy

Jun 16, 2026 - 12:33
Updated: 3 hours ago
0 0
Fine-Tuning Llama 3.2 3B for Medical QA: Loss vs Accuracy

Fine-tuning Llama 3.2 3B for medical question answering demonstrates a critical divergence between loss metrics and actual model performance. Reducing evaluation loss unexpectedly triggered severe generation degeneration, confabulation, and factual drift. Adjusting data balance, expanding LoRA targets, and refining decoding parameters restored accuracy and reproducibility.

The pursuit of artificial intelligence often prioritizes quantitative metrics over qualitative outcomes. Developers frequently treat evaluation loss as a definitive indicator of model improvement. This assumption creates a dangerous blind spot when training systems for high-stakes domains like healthcare. A downward trend in numerical scores can mask severe functional regressions that only surface during actual inference. Understanding this divergence requires examining how architectural choices, data composition, and decoding parameters interact during the fine-tuning process. The following analysis explores a specific experimental cycle involving the Llama 3.2 3B parameter model and its application to medical question answering.

Fine-tuning Llama 3.2 3B for medical question answering demonstrates a critical divergence between loss metrics and actual model performance. Reducing evaluation loss unexpectedly triggered severe generation degeneration, confabulation, and factual drift. Adjusting data balance, expanding LoRA targets, and refining decoding parameters restored accuracy and reproducibility.

Why does evaluation loss fail as a performance metric?

Evaluation loss measures next-token prediction accuracy on a held-out validation set. This mathematical function calculates the probability distribution of the subsequent word given the preceding sequence. A lower score indicates that the model has memorized statistical patterns within the training corpus more effectively. However, this metric operates independently of factual correctness, logical coherence, or domain-specific constraints. A system can achieve near-perfect loss scores while generating entirely fabricated medical advice. The numerical optimization process rewards pattern completion rather than truth verification. This fundamental limitation requires developers to prioritize qualitative testing alongside quantitative tracking.

The mathematical foundation of loss functions relies on cross-entropy calculations that penalize incorrect token predictions. These calculations do not evaluate semantic meaning or clinical validity. A model can confidently predict the next token in a nonsensical sequence and still receive a favorable loss score. Medical question answering demands strict adherence to established clinical guidelines and bounded response formats. When a system prioritizes statistical likelihood over factual grounding, it begins to drift into confabulation. The numerical metric simply cannot detect when a model invents drug names or recommends incorrect first-line therapies. This gap between optimization and utility defines the core challenge of modern AI development.

The training cycle in question utilized a combined dataset merging conversational patient-doctor dialogues with encyclopedic clinical references. The resulting evaluation loss dropped from 2.495 to 2.275 after two training epochs. Standard machine learning intuition suggests that this reduction represents a successful optimization trajectory. The training and validation curves tracked each other closely, which typically signals stable convergence without classic overfitting. Mean token accuracy also rose to 0.515 during this phase. Every quantitative indicator pointed toward a superior model architecture. This trajectory mimics the standard optimization path observed across countless machine learning projects. The apparent success creates a false sense of security before deployment.

How does dataset composition influence model behavior?

Dataset construction requires careful calibration to prevent the model from overfitting to specific formatting patterns. The initial experiment combined eight thousand conversational entries with four thousand encyclopedic records. This two-to-one ratio introduced a heavy bias toward list-based responses. Encyclopedic medical references frequently utilize enumerated structures to organize symptoms, treatments, and drug classifications. The model interpreted this formatting as a mandatory output requirement rather than a stylistic choice. This structural bias directly influences how the system generates answers during inference.

Cleaning the conversational component proved equally challenging. Forum-based medical dialogues contain platform-specific filler phrases and OCR-level corruption that disrupt pattern matching. Removing these artifacts requires multi-pass regex filtering and sentence-level stripping operations. Even with rigorous preprocessing, a small fraction of structural noise remains. This residual corruption forces the model to learn inconsistent boundaries. The resulting architecture struggles to distinguish between genuine clinical information and conversational debris. Developers must accept that perfect data purification is often impossible. The remaining noise creates subtle but persistent errors in response generation.

Rebalancing the dataset requires shifting the proportion toward narrative prose to encourage bounded responses. The revised configuration allocated eight thousand five hundred conversational records and one thousand five hundred encyclopedic entries. This eighty-five to fifteen split trains the model to favor flowing clinical explanations over open-ended enumeration. The adjustment attacks the root cause of the generation failure. Systems trained on balanced data demonstrate significantly improved control over response length and structural integrity. This recalibration proves essential for maintaining clinical accuracy.

Data quality directly impacts the reliability of downstream applications. When organizations deploy AI systems for clinical support, they must ensure the training corpus reflects the exact distribution of real-world queries. Relying on a single data source creates blind spots that manifest during production. Implementing robust data fabrics provides the architectural foundation necessary for maintaining consistent quality across multiple sources. This approach ensures that factual grounding remains stable regardless of the input format. Sustainable data management prevents the accumulation of structural biases.

What technical adjustments resolve generation degeneration?

The Low-Rank Adaptation architecture requires precise targeting to address specific failure modes. The initial configuration applied adapters exclusively to the attention layers. These layers determine how tokens relate to each other within a sequence. They manage the routing of information but do not store the factual knowledge itself. When the model encounters a request for drug classifications, it cannot retrieve accurate names because the feed-forward layers remain frozen. This architectural separation creates a bottleneck for factual recall.

Expanding the adapter targets to include the feed-forward components addresses this recall deficiency. The up_proj, down_proj, and gate_proj modules function as a distributed key-value memory system. They store and retrieve the conceptual associations required for accurate medical responses. Fine-tuning these layers allows the model to adjust the storage mechanism directly. This modification prevents the system from inventing plausible-sounding but entirely fictional pharmaceutical compounds during list generation. Knowledge storage in transformer architectures operates through distributed representations across multiple layers. The feed-forward components act as the primary repository for learned facts and clinical associations. When these layers remain static during fine-tuning, the model loses the ability to update its internal knowledge base. Expanding the adapter targets bridges this gap and allows direct modification of the recall mechanism. This architectural adjustment proves essential for medical applications requiring precise terminology.

The pad token configuration also introduces significant overhead if handled incorrectly. Adding a custom padding token forces the system to resize the embedding layer. This operation causes the parameter-efficient fine-tuning framework to save the entire resized embedding matrix alongside the adapters. The resulting file expands to over three gigabytes instead of the expected fifty megabytes. Utilizing the model's built-in padding identifier eliminates this bloat and ensures efficient deployment. Storage efficiency directly impacts the speed of iterative testing. Efficient resource allocation remains essential for maintaining a sustainable development workflow when working with consumer-grade infrastructure.

How do decoding parameters dictate model reliability?

Decoding strategies determine how the model selects tokens during inference. The initial configuration employed a repetition penalty that forbade any three-token sequence from appearing twice. This constraint was intended to prevent infinite loops. Instead, it forced the model to generate entirely new tokens when attempting to close a list naturally. The system responded by drifting into phonetic nonsense and fabricated terminology. Hard constraints on token repetition often backfire in generative systems. The mathematical penalty creates a feedback loop that accelerates degeneration rather than stopping it.

Removing the n-gram ban and explicitly setting the end-of-sequence token restores structural control. The model can now recognize when a clinical explanation is complete and terminate the response appropriately. Adding a moderate repetition penalty of 1.3 discourages minor looping without triggering degeneration. Capping the maximum new tokens at 256 provides a hard boundary for response length. These parameters work together to enforce consistency. Greedy decoding eliminates stochastic variation and ensures byte-for-byte reproducibility. When the same clinical query is submitted twice, the output remains identical. This determinism is critical for validating model behavior and building trust in automated systems. Without fixed seeds and pinned data samples, developers cannot verify whether a successful run represents genuine improvement or random chance. Reproducibility transforms experimental results into defensible claims. Consistency remains the foundation of reliable software engineering.

The interaction between generation settings and model weights reveals a fundamental truth about small language models. A three-billion parameter system handles clinical question answering effectively but struggles with unbounded enumeration. The ceiling for factual recall and structural control is inherently limited by the base architecture. Acknowledging these constraints allows developers to set realistic expectations for production deployment. Scaling the base model remains the only path to overcoming these fundamental limits. Naming the constraint honestly gives clarity on how to make better improvements.

The experimental cycle demonstrates that quantitative metrics alone cannot validate a fine-tuned system. Lower evaluation loss masked severe functional regression that only appeared during inference. Correcting the failure required rebalancing the training corpus, expanding the adapter targets, and recalibrating the decoding parameters. The resulting model produces accurate, bounded responses with full reproducibility. Future iterations will wrap the architecture in a FastAPI endpoint and containerize the deployment using Docker. This progression highlights the necessity of rigorous validation before scaling AI systems into clinical environments. Sustainable AI coding practices must prioritize long-term maintainability over short-term metric gains.

What's Your Reaction?

Like Like 0
Dislike Dislike 0
Love Love 0
Funny Funny 0
Wow Wow 0
Sad Sad 0
Angry Angry 0
Christopher Holloway

Christopher Holloway is the founder and director of Progressive Robot, a UK-based technology company. A full-stack engineer with more than two decades of experience, he works across PHP development, ecommerce, Linux infrastructure, technical SEO and AI automation, and writes here on technology, AI, hardware and software.

Comments (0)

User