You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This document provides a comprehensive reference for all training and evaluation parameters available in ERNIEKit. It covers:
Basic model configuration and training setup
Evaluation metrics and strategies
Performance optimization techniques
Distributed training configurations
Memory optimization options
Checkpoint saving strategies
Acceleration methods
Mixed precision training settings
Specialized configurations for SFT, LoRA, DPO and FP8 training
Each parameter is documented with its type, default value and detailed description to help developers properly configure their training jobs.
1. General Configuration
1.1 Basic Configuration
Parameter
Type
Default
Description
model_name_or_path
str
Required
Model name or local model path for the model and tokenizer
hidden_dropout_prob
float
0.0
Dropout probability for hidden layers
attention_probs_dropout_prob
float
0.0
Dropout probability for attention layers
dropout_warmup_steps
int
0
Warmup steps for dropout. Dropout probability increases linearly during warmup and disables afterward. Set to 0 to disable dropout.
weight_quantize_algo
str
Required
Model quantization algorithm. Options: weight_only_mix (expert weights as int4, other linear layers as int8) or weight_only_int8 (all linear layers as int8) or fp8_linear
output_dir
str
Required
Directory to save model files, checkpoints, tokenizers, and evaluation results
logging_steps
int
Required
Logging interval. Decrease for more frequent updates.
logging_dir
str
Required
Log directory (defaults to output_dir if unspecified)
do_eval
bool
False
Enable model evaluation
do_train
bool
False
Enable training
disable_tqdm
bool
False
Disable tqdm progress bar for estimating total training time
Enable greedy token-based packing. Instead of sequential sampling, a global buffer of samples is maintained and greedily packed into sequences to maximize token utilization and minimize padding.
dataloader_num_workers
int
1
Dataloader subprocess count (0 to disable)
distributed_dataloader
int
0
Use distributed dataloader for large datasets
moe_multimodal_dispatch_use_allgather
str
v2-alltoall-unpad
Optimize MoE layer with allgather+unpad
1.8 Mixed Precision (Recommended Defaults)
Parameter
Type
Default
Description
bf16
bool
False
Enable BF16 training
fp16_opt_level
str
O1
AMP level (O2 converts params to float16/bfloat16)