Skip to content

Latest commit

 

History

History
188 lines (159 loc) · 11.3 KB

File metadata and controls

188 lines (159 loc) · 11.3 KB

Training Configuration

This document provides a comprehensive reference for all training and evaluation parameters available in ERNIEKit. It covers:

  • Basic model configuration and training setup
  • Evaluation metrics and strategies
  • Performance optimization techniques
  • Distributed training configurations
  • Memory optimization options
  • Checkpoint saving strategies
  • Acceleration methods
  • Mixed precision training settings
  • Specialized configurations for SFT, LoRA, DPO and FP8 training

Each parameter is documented with its type, default value and detailed description to help developers properly configure their training jobs.

1. General Configuration

1.1 Basic Configuration

Parameter Type Default Description
model_name_or_path str Required Model name or local model path for the model and tokenizer
hidden_dropout_prob float 0.0 Dropout probability for hidden layers
attention_probs_dropout_prob float 0.0 Dropout probability for attention layers
dropout_warmup_steps int 0 Warmup steps for dropout. Dropout probability increases linearly during warmup and disables afterward. Set to 0 to disable dropout.
weight_quantize_algo str Required Model quantization algorithm. Options: weight_only_mix (expert weights as int4, other linear layers as int8) or weight_only_int8 (all linear layers as int8) or fp8_linear
output_dir str Required Directory to save model files, checkpoints, tokenizers, and evaluation results
logging_steps int Required Logging interval. Decrease for more frequent updates.
logging_dir str Required Log directory (defaults to output_dir if unspecified)
do_eval bool False Enable model evaluation
do_train bool False Enable training
disable_tqdm bool False Disable tqdm progress bar for estimating total training time
continue_training bool True Load pretrained weights to continue training
from_hf_hub bool False Downloading model from HuggingFace Hub
from_aistudio bool False Downloading model from Aistudio
from_modelscope bool False Downloading model from ModelScope

1.2 Evaluation

Parameter Type Default Description
per_device_eval_batch_size int Required Evaluation batch size (micro batch size)
eval_dataset_path str Required Path to evaluation dataset (see sft-eval.jsonl
eval_dataset_prob str 1.0 Evaluation dataset sampling probability.
eval_dataset_type str erniekit Evaluation dataset type.
eval_steps int Required Evaluation interval steps
evaluation_strategy str "steps" Evaluation strategy. "steps" enables periodic evaluation
max_evaluate_steps int 1 Maximum steps per evaluation (if positive)

1.3 Training Performance

Parameter Type Default Description
train_dataset_path str Required Training dataset path (see sft-train.jsonl)
train_dataset_prob str 1.0 Training dataset sampling probability.
train_dataset_type str erniekit Training dataset type.
max_steps int Required Maximum training steps (overrides num_train_epochs if set)
num_train_epochs int Required Training epochs
per_device_train_batch_size int Required Training batch size (micro batch size). Global batch size = DP sharding micro_batch_size * gradient_accumulation_steps
gradient_accumulation_steps int Required Gradient accumulation steps
weight_decay float 0.0 AdamW optimizer weight decay
seed int 42 Random seed
max_seq_len int Required Maximum token length. Reduce if OOM occurs when increasing GBS.
learning_rate float Required Learning rate (SFT: 3e-5, DPO: 1e-6, SFT-LoRA: 3e-4, DPO-LoRA: 1e-5)
warmup_steps int Required Warmup steps (typically 1%-10% of max_steps)
lr_scheduler_type str linear Learning rate scheduler (linear/cosine/polynomial/constant/constant_with_warmup)
min_lr float 0.0 Minimum learning rate (cosine scheduler only)
layerwise_lr_decay_bound float 1.0 Layerwise LR decay factor (0,1]. 1 means no decay.
random_shuffle bool True Enable dataset shuffling
num_cycles float 0.5 Cosine scheduler: number of waves
lr_end float 1e-7 Polynomial scheduler: final LR
power float 1.0 Polynomial scheduler: power
adam_beta1 float 0.9 AdamW beta1
adam_beta2 float 0.999 AdamW beta2
adam_epsilon float 1e-8 AdamW epsilon

1.4 Distributed Training

Parameter Type Default Description
tensor_parallel_degree int Required Tensor parallelism degree
tensor_parallel_config str Required Recommended: "sync_param sync_grad sync_moment"
tensor_parallel_output bool True Enable parallel output for last Transformer layer to save memory
pipeline_parallel_degree int Required Pipeline parallelism degree
pipeline_parallel_config str Required Recommended: "disable_partial_send_recv enable_clear_every_step_cache enable_delay_scale_loss enable_overlap_p2p_comm best_unbalanced_scheduler"
pp_seg_method str Required Pipeline layer segmentation method
virtual_pp_degree int 1 Virtual pipeline degree (effective when pipeline_parallel_degree > 1)
add_tail_layers int 0 Add EmptyLayers after DecodeLayer for virtual pipeline requirements
sharding_parallel_degree int Required Sharding parallelism degree
sharding_parallel_config str Required Recommended: "enable_stage1_overlap enable_release_grads"
sharding str Required Sharding stage (stage1: optimizer, stage2: gradients, stage3: parameters)
sequence_parallel bool True Enable sequence parallelism
moe_group str "dummy" MoE communication group ("mp" for training, "dummy" for inference)

1.5 Memory Optimization

Parameter Type Default Description
release_grads bool False Release gradients after each iteration to reduce peak memory
use_sparse_head_and_loss_fn bool False Use sparse LM Head and loss function
use_fused_head_and_loss_fn bool False Fuse LM head and CrossEntropyLoss to save memory
use_attn_mask_startend_row_indices bool True Use sparse mask representation with start row indices
recompute_use_reentrant bool False Recompute implementation (PyLayer if True, hooks if False)
recompute bool False Enable gradient checkpointing
recompute_granularity str "full" Recompute granularity ("full"/"full_attn"/"core_attn")
offload_optim bool False Offload optimizer to CPU

1.6 Checkpoint

Parameter Type Default Description
save_steps int Required Checkpoint save interval (when save_strategy=="steps")
save_strategy str "no" Checkpoint save strategy
unified_checkpoint bool True Use unified checkpoint format
unified_checkpoint_config str "" See Unified Checkpoint
disable_ckpt_quant bool False See Unified Checkpoint
ignore_save_lr_and_optim bool False Skip saving optimizer states
ignore_load_lr_and_optim bool False Skip loading optimizer states
save_total_limit int None Maximum number of checkpoints to keep

1.7 Acceleration

Parameter Type Default Description
use_flash_attention bool True Enable FlashAttention
use_sparse_flash_attn bool True Enable FlashMask (requires use_attn_mask_startend_row_indices)
fuse_rope bool False Fuse rotary position embedding
fuse_linear bool False F fuse linear operations
greedy_intokens bool True Enable greedy token-based packing. Instead of sequential sampling, a global buffer of samples is maintained and greedily packed into sequences to maximize token utilization and minimize padding.
dataloader_num_workers int 1 Dataloader subprocess count (0 to disable)
distributed_dataloader int 0 Use distributed dataloader for large datasets
moe_multimodal_dispatch_use_allgather str v2-alltoall-unpad Optimize MoE layer with allgather+unpad

1.8 Mixed Precision (Recommended Defaults)

Parameter Type Default Description
bf16 bool False Enable BF16 training
fp16_opt_level str O1 AMP level (O2 converts params to float16/bfloat16)
scale_loss int 2 ** 15 Loss scaling factor for float16
amp_custom_white_list str Required AMP O2 whitelist (e.g., "lookup_table flash_attn matmul")
amp_custom_black_list str Required AMP O2 blacklist (e.g., "reduce_sum elementwise_div")
amp_master_grad bool False Maintain float32 gradients for AMP O2

2. Specialized Configurations

2.1 SFT

Parameter Type Default Description
num_samples_each_epoch int 6000000 Virtual epoch size (recommend keeping default)

2.2 LoRA

Parameter Type Default Description
lora_rank int 8 LoRA rank (typical: 8/16/32. Higher improves quality but increases memory)
lora_alpha float -1 LoRA scaling factor (scaling = alpha/rank or alpha/sqrt(rank) for rslora)
rslora bool False Enable rslora scaling (recommended for rank ≥64)
lora_plus_scale float 1 LoRA+ learning rate multiplier (recommended: 4-16)
rslora_plus bool False Enhanced LoRA (improves performance but may cause forgetting)
lora bool False Enable LoRA training

2.3 DPO

Parameter Type Default Description
beta float 0.1 DPO loss temperature
simpo_gamma float 0.5 SimPO loss gamma
offset_alpha float 0.0 Score-based DPO loss offset
max_prompt_len int 2048 Maximum prompt length (truncated beyond max_seq_len-10)
loss_type str sigmoid Preference loss type (sigmoid/ipo/kto_pair)
pref_loss_ratio float 1.0 Preference loss weight
sft_loss_ratio float 0.0 Chosen data SFT loss weight
label_smoothing float 0.0 Label smoothing for sigmoid loss
reference_free bool False Disable reference model
ref_model_update_steps int -1 Reference model update interval (-1 to disable)

2.4 FP8 Training

Parameter Type Default Description
apply_hadamard bool True Use Hadamard transform for FP8 precision
use_lowprecision_moment bool False Use BF16 optimizer momentum (recommended for FP8)
tensorwise_offload_optimizer bool False Offload optimizer to reduce memory
apply_online_actscale_step bool 200 Dynamic quantization scale steps
optim_shard_num int 1 Split optimizer state files during saving to avoid memory OOM. Works only when unified_checkpoint_config: ignore_merge_optimizer.