Since having enough data during fine-tuning is all that matters, there isn’t really a fixed ratio. The key is maybe whether there is enough data.
Yes. There is a guideline. But it is not a universal percentage rule.
For audio datasets, the best practice is:
Use a split that matches the kind of “unseen” audio you want the model to handle, and keep validation and test independent.
Your example, 85% train / 5% validation / 10% test, can be a perfectly reasonable starting point. But it is only good if the validation and test portions are still large enough to be meaningful, representative, and cleanly separated from training data. Google’s ML guidance is explicit that there are no fixed percentage requirements for train, validation, and test sets, and that good holdout sets should be representative, statistically meaningful, and free of duplicates from training. (Google for Developers)
What the three splits are for
The training set is the part the model learns from.
The validation set is the part you use during development to make decisions, such as hyperparameter tuning, early stopping, augmentation choices, threshold selection, model selection, and preprocessing choices.
The test set is the final untouched set you use only after development is finished, to estimate how well the system will work on new data. Google describes this as the normal train → validate → adjust → test workflow, and scikit-learn states plainly that learning parameters and testing on the same data is a methodological mistake. (Google for Developers)
So is 85 / 5 / 10 a good split?
It is a valid starting split, not a law.
The important question is not “Is 85 / 5 / 10 the standard?” The important question is:
Do the 5% and 10% sets contain enough diversity and enough examples to tell you something reliable?
If yes, then 85 / 5 / 10 can work well. If not, then even a “standard-looking” split is weak. Google’s guidance does not prescribe a single ratio. It emphasizes instead that the holdout sets must be large enough to be statistically meaningful and must resemble real-world data. (Google for Developers)
Why audio is different from many other data types
With audio, the biggest danger is usually not the exact percentage. It is leakage.
Two files can be technically different files and still be too closely related for fair evaluation. That happens when:
- the same speaker appears in train and test,
- many clips are cut from the same source recording,
- the same session, room, or microphone appears across splits,
- augmented variants of the same clip are spread across train, validation, and test,
- or repeated utterances appear in more than one split. (scikit-learn)
This is exactly why scikit-learn provides GroupKFold and StratifiedGroupKFold: they keep groups non-overlapping across splits instead of pretending every row is independent. (scikit-learn)
What strong audio benchmarks do
Well-designed audio benchmarks usually do not just random-shuffle files.
For example, the CHiME challenge states that its evaluation data is disjoint from training and development, with no overlap in participants or rooms, and warns that participant overlap in a dev set can encourage overfitting. ESC-50 uses predefined folds so that clips from the same original source stay in the same fold. VoxCeleb explicitly uses disjoint speakers between development and test. These are not arbitrary details. They are the reason the benchmark is credible. (CHiME Challenges and Workshops)
The right way to think about splitting audio
Do not start with percentages.
Start with this question:
What does “unseen” mean for my real use case?
If the model must work on new speakers, then split by speaker.
If it must work on new sessions or meetings, split by session or meeting.
If it must work on new environments, split by room, device, or environment.
If your clips are cut from long recordings, split by source recording, not by clip. That is the same logic used in grouped cross-validation and in audio benchmarks such as ESC-50 and CHiME. (scikit-learn)
A practical recommendation
If you are building an audio dataset from scratch and do not yet have a special reason to do otherwise, this is a strong default:
- use your proposed 85 / 5 / 10 or a similar three-way split,
- but make the split by speaker, session, or source recording, not by raw file name alone,
- and check that each split is still representative in class balance, duration, noise conditions, and recording conditions. (Google for Developers)
If the dataset is small, a fixed three-way split may waste too much data. In that case, a cleaner approach is often to keep one final holdout test set and use grouped cross-validation on the rest during development. scikit-learn’s cross-validation guidance supports this logic, especially when you need model selection without contaminating the final test estimate. (scikit-learn)
Your bonus question: can test and evaluation use the same MP3 files?
For a proper setup, no.
If by “evaluation” you mean validation/dev, then using the same MP3 files for validation and test defeats the purpose of having both sets. The validation set is used repeatedly while you make decisions. Once you use those files to choose model settings, they are no longer an independent final test. Google explicitly says validation and test sets can “wear out” with repeated use, and scikit-learn warns that test information leaking into fitting or preprocessing produces overly optimistic scores. (Google for Developers)
So the clean rule is:
- training teaches the model,
- validation helps you choose,
- test gives the final unbiased check. (Google for Developers)
If validation and test are the same files, then in reality you have only one holdout set, not two separate ones. That is acceptable only as a compromise when data is scarce, and then it should be described honestly as a single holdout setup, not as a full train/validation/test design. (Google for Developers)
In audio, “same files” is broader than it sounds
This point matters a lot.
Even if the MP3 filenames are different, the split can still be weak if validation and test contain:
- different cuts from the same long recording,
- noisy and clean versions of the same utterance,
- augmented variants of the same clip,
- repeated recordings from the same speaker in the same session,
- or material that is nearly identical except for trivial transformations. Google recommends removing duplicates from validation and test, and scikit-learn recommends splitting before preprocessing to avoid leakage. ESC-50’s fold design is a concrete example of keeping related clips together. (Google for Developers)
The shortest useful answer
Yes, 85 / 5 / 10 can be a good guideline.
No, it is not a universal rule.
For audio, how you split matters more than the exact percentages.
And validation and test should not be the same MP3 files if you want an honest final result. (Google for Developers)