Starter choices with finetuning
Get started with finetuning LLMs by picking a use case and data.
Finetuning large language models can be a powerful way to tailor their capabilities to specific tasks and datasets. This guide will walk you through the initial steps of finetuning LLMs, including selecting a use case, gathering the appropriate data, choosing a base model, and evaluating the success of your finetuning efforts. By following these steps, you can ensure that your finetuning project is well-scoped, manageable, and aligned with your goals.
This is a high-level overview before we dive into the code examples, but it's important to get these decisions right before you start coding. Your use case is only as good as your data, and you'll need to choose a base model that is appropriate for your use case.
🔍 Quick Assessment Questions
Before starting your finetuning project, ask:
Can you define success with numbers?
✅ "95% accuracy in extracting order IDs"
❌ "Better customer satisfaction"
Is your data ready?
✅ "We have 1000 labeled support tickets"
❌ "We could manually label some emails"
Is the task consistent?
✅ "Convert email to 5 specific fields"
❌ "Respond naturally to customers"
Can a human verify correctness?
✅ "Check if extracted date matches document"
❌ "Evaluate if response is creative"
Picking a use case
In general, try to pick something that is small and self-contained, ideally the smaller the better. It should be something that isn't easily solvable by other (non-LLM) means — as then you'd be best just solving it that way — but it also shouldn't veer too much in the direction of 'magic'. Your LLM use case, in other words, should be something where you can test to know if it is handling the task you're giving to it.
For example, a general use case of "answer all customer support emails" is almost certainly too vague, whereas something like "triage incoming customer support queries and extract relevant information as per some pre-defined checklist or schema" is much more realistic.
It's also worth picking something where you can reach some sort of answer as to whether this the right approach in a short amount of time. If your use case depends on the generation or annotation of lots of data, or organization and sorting of pre-existing data, this is less of an ideal starter project than if you have data that already exists within your organization and that you can repurpose here.
Picking data for your use case
The data needed for your use case will follow directly from the specific use case you're choosing, but ideally it should be something that is already mostly in the direction of what you need. It will take time to annotate and manually transform data if it is too distinct from the specific use case you want to use, so try to minimize this as much as you possibly can.
A couple of examples of where you might be able to reuse pre-existing data:
you might have examples of customer support email responses for some specific scenario which deal with a well-defined technical topic that happens often but that requires these custom responses instead of just a pro-forma reply
you might have manually extracted metadata from customer data or from business data and you have hundreds or (ideally) thousands of examples of these
In terms of data volume, a good rule of thumb is that for a result that will be rewarding to work on, you probably want somewhere in the order of hundreds to thousands of examples.
🎯 Good vs Not-So-Good Use Cases
Structured Data Extraction
Clear inputs/outputs, easily measurable accuracy
Extracting order details from customer emails (order_id
, issue_type
, priority
)
500-1000 annotated emails
Domain-Specific Classification
Well-defined categories, objective evaluation
Categorizing support tickets by department (Billing/Technical/Account)
1000+ labeled examples per category
Standardized Response Generation
Consistent format, verifiable accuracy
Generating technical troubleshooting responses from documentation
500+ pairs of queries and approved responses
Form/Document Parsing
Structured output, clear success metrics
Extracting fields from invoices (date, amount, vendor)
300+ annotated documents
Code Comment Generation
Specific domain, measurable quality
Generating docstrings for Python functions
1000+ function/docstring pairs
Open-ended Chat
Hard to measure success, inconsistent format
Use instruction tuning or prompt engineering instead
Creative Writing
Subjective quality, no clear metrics
Focus on specific formats/templates rather than open creativity
General Knowledge QA
Too broad, hard to validate accuracy
Narrow down to specific knowledge domain or use RAG
Complex Decision Making
Multiple dependencies, hard to verify
Break down into smaller, measurable subtasks
Real-time Content Generation
Consistency issues, timing constraints
Use templating or hybrid approaches
As you can see, the challenging use cases are often the ones that are more open-ended or creative, and so on. With LLMs and finetuning, the real skill is finding a way to scope down your use case to something that is both small and manageable, but also where you can still make meaningful progress.
📊 Success Indicators
You can get a sense of how well-scoped your use case is by considering the following indicators:
Task Scope
"Extract purchase date from receipts"
"Handle all customer inquiries"
Output Format
Structured JSON, fixed fields
Free-form text, variable length
Data Availability
500+ examples ready to use
"We'll need to create examples"
Evaluation Method
Field-by-field accuracy metrics
"Users will tell us if it's good"
Business Impact
"Save 10 hours of manual data entry"
"Make our AI more human-like"
You'll want to pick a use case that has a good mix of these indicators and where you can reasonably expect to be able to measure success in a timely manner.
Picking a base model
In these early stages, picking the right model probably won't be the most significant choice you make. If you stick to some tried-and-tested base models you will usually be able to get a sense of how well the LLM is able to align itself to your particular task. That said, choosing from the Llama3.1-8B or Mistral-7B families would probably be the best option.
As to whether to go with a base model or one that has been instruction-tuned, this depends a little on your use case. If your use case is in the area of structured data extraction (highly recommended to start with something well-scoped like this) then you're advised to use the base model as it is more likely to align to this kind of text generation. If you're looking for something that more resembles a chat-style interface, then an instruction-tuned model is probably more likely to give you results that suit your purposes. In the end you'll probably want to try both out to confirm this, but this rule of thumb should give you a sense of what to start with.
📊 Quick Model Selection Matrix
• Structured data extraction • Classification • Code generation
• 16GB GPU RAM • Mid-range compute
• 8 billion parameters • Strong logical reasoning • Efficient inference
When you need a balance of performance and resource efficiency
• Complex reasoning • Technical content • Longer outputs
• 80GB GPU RAM • High compute
• 70 billion parameters • Advanced reasoning • More nuanced outputs • Higher accuracy
When accuracy is critical and substantial resources are available
• General text generation • Dialogue • Summarization
• 16GB GPU RAM • Mid-range compute
• 7.3 billion parameters • Strong instruction following • Good context handling • Efficient training
When you need reliable instruction following with moderate resources
• Lightweight tasks • Quick experimentation • Educational use
• 8GB GPU RAM • Low compute
• 2.7 billion parameters • Fast training • Smaller footprint • Good for prototyping
When resources are limited or for rapid prototyping
🎯 Task-Specific Recommendations
Remember: Start with the smallest model that meets your needs - you can always scale up if necessary!
How to evaluate success
Part of the work of scoping your use case down is to make it easier to define whether the project has been successful or not. We have a separate section which deals with evaluation but the important thing to remember here is that if you are unable to specify some sort of scale of how well the LLM addresses your problems then it's going to be both hard to know if you should continue with the work and also hard to know whether specific tweaks and changes are pushing you more into the right direction.
In the early stages, you'll rely on so-called 'vibes'-based checks. You'll try out some queries or tasks and see whether the response is roughly what you'd expect, or way off and so on. But beyond that, you'll want to have a more precise measurement of success. So the extent to which you can scope the use case down will define how much you're able to measure your success.
A use case which is simply to function as a customer-support chatbot is really hard to measure. Which aspects of this task should we track and which should we classify as some kind of failure scenario? In the case of structured data extraction, we can do much more fine-grained measurement of exactly which parts of the data extraction are difficult for the LLM and how they improve (or degrade) when we change certain parameters, and so on.
For structured data extraction, you might measure:
Accuracy of extracted fields against a test dataset
Precision and recall for specific field types
Processing time per document
Error rates on edge cases
These are all covered in more detail in the evaluation section.
Next steps
Now that you have a clear understanding of how to scope your finetuning project, select appropriate data, and evaluate results, you're ready to dive into the technical implementation. In the next section, we'll walk through a practical example of finetuning using the Accelerate library, showing you how to implement these concepts in code.
Last updated