Best practices to create an audio dataset

1Developer · March 16, 2026, 9:13am

Hello everyone,

I have a question about the percentage breakdown of test, training, and validation data.

Is there some kind of guideline, such as

using 10% for testing, 5% for evaluation, and 85% for training?

And my bonus question: does it make a difference if the test and evaluation files are the same? I mean the same MP3 files.

Thanks in advance

John6666 · March 16, 2026, 12:37pm

Since having enough data during fine-tuning is all that matters, there isn’t really a fixed ratio. The key is maybe whether there is enough data.

Yes. There is a guideline. But it is not a universal percentage rule.

For audio datasets, the best practice is:

Use a split that matches the kind of “unseen” audio you want the model to handle, and keep validation and test independent.

Your example, 85% train / 5% validation / 10% test, can be a perfectly reasonable starting point. But it is only good if the validation and test portions are still large enough to be meaningful, representative, and cleanly separated from training data. Google’s ML guidance is explicit that there are no fixed percentage requirements for train, validation, and test sets, and that good holdout sets should be representative, statistically meaningful, and free of duplicates from training. (Google for Developers)

What the three splits are for

The training set is the part the model learns from.
The validation set is the part you use during development to make decisions, such as hyperparameter tuning, early stopping, augmentation choices, threshold selection, model selection, and preprocessing choices.
The test set is the final untouched set you use only after development is finished, to estimate how well the system will work on new data. Google describes this as the normal train → validate → adjust → test workflow, and scikit-learn states plainly that learning parameters and testing on the same data is a methodological mistake. (Google for Developers)

So is 85 / 5 / 10 a good split?

It is a valid starting split, not a law.

The important question is not “Is 85 / 5 / 10 the standard?” The important question is:

Do the 5% and 10% sets contain enough diversity and enough examples to tell you something reliable?

If yes, then 85 / 5 / 10 can work well. If not, then even a “standard-looking” split is weak. Google’s guidance does not prescribe a single ratio. It emphasizes instead that the holdout sets must be large enough to be statistically meaningful and must resemble real-world data. (Google for Developers)

Why audio is different from many other data types

With audio, the biggest danger is usually not the exact percentage. It is leakage.

Two files can be technically different files and still be too closely related for fair evaluation. That happens when:

the same speaker appears in train and test,
many clips are cut from the same source recording,
the same session, room, or microphone appears across splits,
augmented variants of the same clip are spread across train, validation, and test,
or repeated utterances appear in more than one split. (scikit-learn)

This is exactly why scikit-learn provides GroupKFold and StratifiedGroupKFold: they keep groups non-overlapping across splits instead of pretending every row is independent. (scikit-learn)

What strong audio benchmarks do

Well-designed audio benchmarks usually do not just random-shuffle files.

For example, the CHiME challenge states that its evaluation data is disjoint from training and development, with no overlap in participants or rooms, and warns that participant overlap in a dev set can encourage overfitting. ESC-50 uses predefined folds so that clips from the same original source stay in the same fold. VoxCeleb explicitly uses disjoint speakers between development and test. These are not arbitrary details. They are the reason the benchmark is credible. (CHiME Challenges and Workshops)

The right way to think about splitting audio

Do not start with percentages.

Start with this question:

What does “unseen” mean for my real use case?

If the model must work on new speakers, then split by speaker.
If it must work on new sessions or meetings, split by session or meeting.
If it must work on new environments, split by room, device, or environment.
If your clips are cut from long recordings, split by source recording, not by clip. That is the same logic used in grouped cross-validation and in audio benchmarks such as ESC-50 and CHiME. (scikit-learn)

A practical recommendation

If you are building an audio dataset from scratch and do not yet have a special reason to do otherwise, this is a strong default:

use your proposed 85 / 5 / 10 or a similar three-way split,
but make the split by speaker, session, or source recording, not by raw file name alone,
and check that each split is still representative in class balance, duration, noise conditions, and recording conditions. (Google for Developers)

If the dataset is small, a fixed three-way split may waste too much data. In that case, a cleaner approach is often to keep one final holdout test set and use grouped cross-validation on the rest during development. scikit-learn’s cross-validation guidance supports this logic, especially when you need model selection without contaminating the final test estimate. (scikit-learn)

Your bonus question: can test and evaluation use the same MP3 files?

For a proper setup, no.

If by “evaluation” you mean validation/dev, then using the same MP3 files for validation and test defeats the purpose of having both sets. The validation set is used repeatedly while you make decisions. Once you use those files to choose model settings, they are no longer an independent final test. Google explicitly says validation and test sets can “wear out” with repeated use, and scikit-learn warns that test information leaking into fitting or preprocessing produces overly optimistic scores. (Google for Developers)

So the clean rule is:

training teaches the model,
validation helps you choose,
test gives the final unbiased check. (Google for Developers)

If validation and test are the same files, then in reality you have only one holdout set, not two separate ones. That is acceptable only as a compromise when data is scarce, and then it should be described honestly as a single holdout setup, not as a full train/validation/test design. (Google for Developers)

In audio, “same files” is broader than it sounds

This point matters a lot.

Even if the MP3 filenames are different, the split can still be weak if validation and test contain:

different cuts from the same long recording,
noisy and clean versions of the same utterance,
augmented variants of the same clip,
repeated recordings from the same speaker in the same session,
or material that is nearly identical except for trivial transformations. Google recommends removing duplicates from validation and test, and scikit-learn recommends splitting before preprocessing to avoid leakage. ESC-50’s fold design is a concrete example of keeping related clips together. (Google for Developers)

The shortest useful answer

Yes, 85 / 5 / 10 can be a good guideline.
No, it is not a universal rule.
For audio, how you split matters more than the exact percentages.
And validation and test should not be the same MP3 files if you want an honest final result. (Google for Developers)

1Developer · March 16, 2026, 3:45pm

Wow, John6666,

First of all, thank you so much for the detailed and brilliant explanation and for the time you’ve invested.

Everything is much clearer to me now, and you’ve just saved me from entering incorrect data into my model over the next few days.

I’ll put the new knowledge you’ve shared with me to good use and check out the links as well.

Thanks, man, you really helped me out.

See you next time and Have a great day!

system · March 17, 2026, 3:46am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Create own dataset of train and test in separate folders 🤗Datasets	1	796	January 26, 2023
How to create an audio dataset from local files already split into train and test without losing labels Beginners	2	450	March 17, 2024
Loading train and test splits with `audiofolder` 🤗Datasets	5	1758	February 10, 2024
Composition Training/Validation Split of AutoTrain Research	0	1043	November 18, 2022
What Data Should i Validate my Model Against while Training? Beginners	0	443	April 27, 2021