在文本分类任务上，Qwen3-0.6B真的比Bert效果好么？

Changelog

Qwen3-0.6B

Ag_news

Zero-Shot

Qwen3-0.6B

Bert

epoch

eval_steps

利用Qwen3-0.6B``ppl、zero-shot筛选难样本，观察Qwen3-0.6B（SFT分类）在不同数据量级，不同数据难度情况下的性能变化。

ppl

筛选出的难样本对Qwen33-0.6B（SFT分类）Qwen3-0.6B（线性层分类）影响是否具有同质性。

不同尺寸模型Think与No Think状态下Zero-Shot能力变化。

使用大模型蒸馏Think数据，观察Think和No Think模式下对Qwen3-0.6B（SFT分类）性能的影响。

测试其他难开源分类数据集（更多分类数、多语言、长样本）。

前言

最近在知乎上刷到一个很有意思的提问Qwen3-0.6B这种小模型有什么实际意义和用途。查看了所有回答，有人提到小尺寸模型在边缘设备场景中的优势（低延迟）、也有人提出小模型只是为了开放给其他研究者验证scaling law（Qwen2.5系列丰富的模型尺寸为开源社区验证方法有效性提供了基础）、还有人说4B、7B的Few-Shot效果就已经很好了甚至直接调用更大的LLM也能很好的解决问题。让我比较感兴趣的是有大佬提出小模型在向量搜索、命名实体识别（NER）和文本分类领域中很能打，而另一个被拿来对比的就是Bert模型。在中文文本分类中，若对TextCNN、FastText效果不满意，可能会尝试Bert系列及其变种（RoBerta等）。但以中文语料为主的类Encoder-Only架构模型其实并不多（近期发布的ModernBERT，也是以英文和Code语料为主），中文文本分类还是大量使用bert-base-chinese为基础模型进行微调，而距Bert发布已经过去了6年。Decoder-Only架构的LLM能在文本分类中击败参数量更小的Bert吗？所以我准备做一个实验来验证一下。

不想看实验细节的，可以直接看最后的结论和实验局限性部分。

实验设置

GPU：RTX 3090（24G）模型配置：

模型	参数量	训练方式
google-bert/bert-base-cased	0.1B	添加线性层，输出维度为分类数
Qwen/Qwen3-0.6B	0.6B	构造Prompt，SFT

数据集配置：fancyzhx/ag_news，分类数为4，分别为World（0）、Sports（1）、Business（2）、Sci/Tech（3）。训练样本数120000，测试样本数7600，样本数量绝对均衡。数据集展示：

{  "text":"New iPad released Just like every other September, this one is no different. Apple is planning to release a bigger, heavier, fatter iPad that..."  "label":3  }

Paper with code

Text Classification

Bert Tokenizer

token

Bert训练细节

Bert 的训练比较简单，将文本使用Tokenizer转换成input_ids后，使用Trainer进行正常训练即可。训练参数（若未单独指出，则代表使用Trainer默认值）：

参数名称	值
lr_scheduler_type（学习率衰减策略）	cosine
learning_rate（学习率）	1.0e-5
per_device_train_batch_size（训练batch_size）	64
gradient_accumulation_steps（梯度累积）	1
per_device_eval_batch_size（验证batch_size）	256
num_train_epochs（epoch）	3
weight_decay	1e-6
eval_steps（验证频率）	0.05

训练过程中模型对测试集的指标变化：

Step	Training Loss	Validation Loss	Accuracy	Precision	Recall	F1
282	0.274700	0.263394	0.909737	0.910311	0.909737	0.909676
564	0.207800	0.222230	0.922237	0.922701	0.922237	0.922246
846	0.199600	0.204222	0.931579	0.932552	0.931579	0.931510
1128	0.215600	0.191824	0.934605	0.935274	0.934605	0.934737
1410	0.190500	0.192846	0.932763	0.934421	0.932763	0.932937
1692	0.193300	0.180665	0.937895	0.938941	0.937895	0.937849
1974	0.143000	0.180497	0.940526	0.940945	0.940526	0.940636
2256	0.141500	0.177630	0.941711	0.941988	0.941711	0.941644
2538	0.147100	0.173602	0.943947	0.944022	0.943947	0.943908
2820	0.131600	0.176895	0.940658	0.941790	0.940658	0.940683
3102	0.152800	0.170928	0.945000	0.945140	0.945000	0.944925
3384	0.140000	0.169215	0.944474	0.944766	0.944474	0.944399
3666	0.149900	0.168865	0.944474	0.944538	0.944474	0.944483
3948	0.112000	0.172459	0.946184	0.946142	0.946184	0.946159
4230	0.124000	0.172826	0.945000	0.945254	0.945000	0.944924
4512	0.122300	0.171583	0.944737	0.944925	0.944737	0.944708
4794	0.104400	0.171969	0.944868	0.945059	0.944868	0.944854
5076	0.117500	0.171504	0.945395	0.945502	0.945395	0.945363
5358	0.099800	0.171761	0.945263	0.945510	0.945263	0.945232

Bert

Qwen3训练细节

Qwen3

Prompt

SFT

线性层分类

Tokenizer

input_ids

Trainer

参数名称	值
lr_scheduler_type（学习率衰减策略）	cosine
learning_rate（学习率）	1.0e-5
per_device_train_batch_size（训练batch_size）	8
gradient_accumulation_steps（梯度累积）	8
per_device_eval_batch_size（验证batch_size）	16
num_train_epochs（epoch）	1
weight_decay	1.0e-6
eval_steps（验证频率）	0.05

训练过程中模型对测试集的指标变化：

Step	Training Loss	Validation Loss	Accuracy	Precision	Recall	F1
94	0.281800	0.243619	0.918158	0.918180	0.918158	0.917893
188	0.224100	0.220015	0.924211	0.925216	0.924211	0.924289
282	0.197700	0.236405	0.919211	0.920127	0.919211	0.919257
376	0.182800	0.243235	0.920132	0.925368	0.920132	0.919136
470	0.191500	0.207864	0.928289	0.929563	0.928289	0.928304
564	0.208400	0.192414	0.935658	0.935668	0.935658	0.935647
658	0.201900	0.191506	0.938553	0.938695	0.938553	0.938607
752	0.191900	0.179849	0.937500	0.937417	0.937500	0.937378
846	0.156100	0.177319	0.938684	0.938983	0.938684	0.938653
940	0.159900	0.177048	0.938289	0.939433	0.938289	0.938175
1034	0.159100	0.172280	0.943553	0.943725	0.943553	0.943455
1128	0.117000	0.168742	0.943026	0.942911	0.943026	0.942949
1222	0.151500	0.164628	0.943421	0.944371	0.943421	0.943503
1316	0.143600	0.158676	0.945921	0.946856	0.945921	0.945965
1410	0.183200	0.154356	0.946184	0.946708	0.946184	0.946221
1504	0.159400	0.153549	0.947763	0.947847	0.947763	0.947771
1598	0.147100	0.152530	0.948553	0.948609	0.948553	0.948539
1692	0.161400	0.151299	0.949079	0.949216	0.949079	0.949029
1786	0.150500	0.151270	0.948421	0.948572	0.948421	0.948363

可以看到使用线性层分类的Qwen3-0.6B在测试集上最好结果是：0.949

SFT分类

我们先基于数据集写一个选择题形式的Prompt，Prompt模板为：

prompt="""Please read the following news article and determine its category from the options below.    Article:  {news_article}    Question: What is the most appropriate category for this news article?  A. World  B. Sports  C. Business  D. Science/Technology    Answer:/no_think"""    answer="<think>\n\n</think>\n\n{answer_text}"

news_article 为新闻文本，answer_text表示标签。

zero-shot

准确率

...

模型	思考	非思考
Qwen3-0.6B	0.7997	0.7898

LLama Factory

Prompt

Qwen3

/no_think

\n\n\n\n

{  'instruction':"Please read the following news article and determine its category from the options below.\n\nArticle:\nWall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.\n\nQuestion: What is the most appropriate category for this news article?\nA. World\nB. Sports\nC. Business\nD. Science/Technology\n\nAnswer:/no_think",  'output':'<think>\n\n</think>\n\nC'  }

训练参数配置文件：

### model  model_name_or_path:model/Qwen3-0.6B    ### method  stage:sft  do_train:true  finetuning_type:full    ### dataset  dataset:agnews_train  template:qwen3  cutoff_len:512    overwrite_cache:true  preprocessing_num_workers:8    ### output  output_dir:Qwen3-0.6B-Agnews  save_strategy:steps  logging_strategy:steps  logging_steps:0.01  save_steps:0.2  plot_loss:true  report_to:tensorboard  overwrite_output_dir:true    ### train  per_device_train_batch_size:12  gradient_accumulation_steps:8  learning_rate:1.2e-5  warmup_ratio:0.01  num_train_epochs:1  lr_scheduler_type:cosine  bf16:true

Bert

epoch

Qwen3

epoch

注意！为保证推理结果稳定，我们选择选项ppl低的作为预测结果

Step	Training Loss	Accuracy	Precision	Recall	F1
250	0.026	0.912	0.917	0.912	0.912
500	0.027	0.924	0.924	0.924	0.924
750	0.022	0.937	0.937	0.937	0.937
1000	0.022	0.941	0.941	0.941	0.941
1250	0.023	0.940	0.940	0.940	0.940

Qwen3-0.6B

Loss

Bert和Qwen3-0.6B训练耗时

模型	Epoch	训练耗时	推理耗时	总耗时
Bert	3	35 min	-	0.58 h
Qwen3-0.6B（线性层分类）	1	52 min	-	0.86 h
Qwen3-0.6B（SFT分类）	1	62 min	30 min	1.5 h

Bert和Qwen3-0.6B RPS测试

Bert

Qwen3-0.6B

Bert

Qwen3-0.6B

RPS

GPU

RTX 3090

模型	推理引擎	最大输出Token数	RPS
Bert	HF	-	60.3
Qwen3-0.6B（SFT分类）	HF	8	13.2
Qwen3-0.6B（SFT分类）	VLLM	8	27.1
Qwen3-0.6B（线性层分类）	HF	-	38.1

结论

在Ag_new数据集上，各模型效果：Qwen3-0.6B（线性层分类）> Bert > Qwen3-0.6B（SFT分类）> Qwen3-0.6B（Think Zero-Shot）> Qwen3-0.6B（No Think Zero-Shot）。

各模型训练推理耗时： Qwen3-0.6B（SFT分类）> Bert > Qwen3-0.6B（线性层分类）。

各模型RPS：Bert > Qwen3-0.6B（线性层分类） > Qwen3-0.6B（SFT分类）。

  Think

模式下的Qwen3-0.6B比No Think模式下的Qwen3-0.6B准确率仅高出1%，推理时间比No Think慢20倍（HF推理引擎，Batch推理）。

在训练Qwen3-0.6B（线性层分类）时，Loss在前期有点抖动，或许微调一下学习率预热比率会对最终结果有微弱正向效果。

实验局限性

Think

Qwen3-0.6B

GRPO

0.6B

Think

SFT

GRPO

token

Bert Tokenizer

1024

AgNews

Bert

Qwen3-0.6B

F1

0.94

Bert

Qwen3-0.6B

Changelog

前言

实验设置

Bert训练细节

Qwen3训练细节

线性层分类

SFT分类

Bert和Qwen3-0.6B训练耗时

Bert和Qwen3-0.6B RPS测试

结论

实验局限性

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签