未知数据源 2024年11月26日
The Hitchhiker's Guide to Hyperparameter Tuning
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文分享了Taboola团队在深度学习模型训练过程中,如何自动化超参数调优的经验。从最初的简单脚本到自动生成实验,逐步优化了调优流程,提升了模型准确率。文章详细介绍了脚本的演进过程,包括JSON格式的实验配置、指标收集、结果存储、日期范围选择、训练轮数控制等,并探讨了随机搜索等方法在超参数调优中的应用。通过自动化,Taboola团队显著提升了模型性能,并加深了对模型的理解。

🤔 **脚本初期:** 采用简单的JSON格式配置实验,包括模型架构、日期范围和参数范围,并通过Jenkins自动运行,方便管理和跟踪实验结果。

📊 **指标收集与云存储:** 训练过程中收集MSE、损失函数和训练时间等指标,并将其保存到Google Cloud Storage,方便随时查看和分析实验结果。

🗓️ **日期范围选择:** 为了避免过度拟合,脚本会随机选择不同的日期范围进行训练,确保模型的泛化能力。

⏳ **训练轮数控制:** 通过分析TensorBoard上的损失函数曲线,确定合适的训练轮数,避免过度训练,提高训练效率。

🎲 **随机搜索:** 在积累了经验后,团队将脚本改造成支持随机搜索,进一步提升了超参数调优的效率,找到更好的超参数组合。

This post was originally published by me at the Taboola engineering blog.

Now that more than a year has passed since our first deep learning project emerged, we have had to keep moving forward and delivering the best models we can.

Doing so has involved a lot of research, trying out different models, from as simple as bag-of-words, LSTM and CNN, to the more advanced attention, MDN and multi-task learning.

Even the simplest model we tried has many hyperparameters, and tuning these might be even more important than the actual architecture we ended up using - in terms of the model's accuracy.

Although there's a lot of active research in the field of hyperparameter tuning (see 1, 2, 3), implementing this tuning process has evaded the spotlight.

If you go around and ask people how they tune their models, their most likely answer will be "just write a script that does it for you". Well, that's easier said than done...

Apparently, there are a few things you should keep in mind when implementing such a script.

Here, at Taboola, we implemented a hyperparameter tuning script. Let me share with you the things we learned along the way...

Let's start simple

Sometimes using scikit-learn for hyperparameter tuning might be enough - at least for personal projects.

For long term projects, when you need to keep track of the experiments you've performed, and the variety of different architectures you try keeps increasing, it might not suffice.

Bare bones

The first version of the script was simple, but encompassed most of our needs.The requirements were:

Easy to run

You are going to run this script many times. So, it should be as easy as possible to specify what experiments you want to run.

We ended up with the following JSON format:

{    "architecture": "lstm-attention",    "date-range": [["2017-01-01", "2017-04-01"]],    "parameters": {        "num_of_attentions": [1, 2, 3],        "attention_hidden_layer_size": [32, 128, 256],        "attention_regularization": [0.1, 0.01, 0.001]    }}

The script randomly generates experiments out of this JSON.

We created a Jenkins job that runs the script on one of our machines with GPUs, thus freeing us from the need to use SSH.

Enrich experiments with metrics

Our training process generates many metrics, such as MSE, loss and training time. You can choose whichever you want, and they will show in the results.

Save results to the cloud

Results are saved as a CSV file in Google Cloud Storage, which enables us to launch the script from any machine, and watch the results from our laptops.

The results are continuously uploaded, so we don't have to wait for all the experiments to finish to start inspecting the results.

The models themselves are also saved.

So what have we learned so far?

This first implementation was the most important one.

Being simple, the script didn't do anything smart for you. It didn't specify which experiments to perform - you had to manually define the JSON input for every run.

After running the script many times, you start to understand which values work better than others. It's one of the most important things we got from the script - a more in depth understanding of our models.

Using this script for the first time got us a big improvement - the MSE dropped by more than 10%.

Are we sure it's real?

When you run hundreds of experiments, the best ones usually have negligible differences. How can you know it's statistically significant?

One way to tackle this problem, assuming you have enough data, is to train the same model on several date-ranges. If one model is better than the others in all of the date-ranges, you can be more confident it's real.

Let's say today is June, and you run the script. Here are the date-ranges the script will choose:

Note that the script chooses a new set of date-ranges if you run it on a different month. This is important, since otherwise you could accidentally overfit your models due to extensive hyperparameter tuning.

I don't have time for this

Who has the time to run every experiment three times? It's nice that you can get reliable results, but it means you'll end up running fewer different experiments.

The next version of the script tackled this problem by supporting a new mode of operation:

But are the results correlated with what we would get if we used more data and epochs?

To answer this, we performed some experiments, each using a different amount of data ranging from one week to three months. We found out that one month had a good tradeoff between MSE and training time.

(Did you notice we used the hyperparameter tuning script to tune the hyperparameter script? How cool is that?)

To investigate what the right number of epochs would be, we analyzed the MSE on TensorBoard.

Each plot in the graph represents a different model trained on a different amount of data. After 20 epochs all of the models have almost converged, so it's safe to stop there.

C'mon script, do the job for me!

At this point we decided the script should choose the hyperparameter values for you.We started with the learning rate related hyperparameters: initial learning rate, decay factor, number of epochs with no improvement for early stopping, etc.

Why learning rate?

The new mode of the script uses hard-coded ranges of values that are reasonable for our models.

Then came the next demand:

The next version of the script did just that: provided with an architecture name, it automatically generates experiments for you. You don't need to specify any hyperparameter values.

Let's go random

Although being worse than random search in some cases, grid search is easier to analyze: every value is used by multiple experiments, so it's easy to spot trends.

Since we had already gained intuition on what values work better, it was time to implement random search. Doing so helped the script find better hyperparameters.

Some final thoughts

In the research phase of any machine learning project, hyperparameter tuning can be done manually. However, when you want to take the project to the next level, it's highly effective to automate the process. In this post I described some of the small touches we implemented into the automation process. Some might be useful for you, some might not. Drop a line in the comments if you found any other exciting things to be helpful.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

超参数调优 深度学习 模型训练 自动化 机器学习
相关文章