Packing is a technique used to optimize batch processing by combining multiple short input sequences into a single longer sequence before feeding them into the LLM. This reduces padding overhead and improves hardware utilization (e.g., GPU/TPU efficiency).
The greedy intokens strategy is a token-level optimization that prioritizes filling the available token budget (e.g., max sequence length) in a greedy manner during batch processing. It ensures that the model generates as many tokens as possible within the constraints, minimizing wasted capacity.
| packing | greedy_intokens | Packing Strategy |
|---|---|---|
| false | any | No packing |
| true | false | packing is enabled without greedy intokens strategy |
| true | true | greedy intokens packing is enabled |
Currently, four data sampling strategies are supported: random, concat, interleave_under, interleave_over
| Data Sampling Strategy | Applicable Scenarios | Limitations | Description |
|---|---|---|---|
random |
The dataset is extremely large and strict data proportioning is required | max_steps > 0 | In random mode, based on the input dataset probs, a fixed-size sample pool of num_samples_each_epoch is constructed, and the data loader randomly acquires data from this sample pool. |
concat |
Need to train all data in the datasets | None | In concat mode, the input dataset probs are not used. Instead, multiple datasets are directly concatenated. The size of the dataset is equal to the total size of the input multi-source datasets. When max_steps = -1, setting num_train_epochs allows for a complete traversal of the input datasets for num_train_epochs rounds. |
interleave_under |
When small datasets are important but have limited samples | None | The interleave strategy involves cross-concatenating multiple datasets according to data proportioning. interleave_under indicates undersampling, meaning that sampling stops as soon as one of the datasets is exhausted. |
interleave_over |
When small datasets are important but have limited samples | None | The interleave strategy involves cross-concatenating multiple datasets according to data proportioning. interleave_over indicates oversampling, meaning that sampling stops only after all datasets have been exhausted. |
- Note:
num_samples_each_epochonly works inrandomdata sampling strategy.
ERNIEKit currently supports reading local datasets and downloading specified Hugging Face datasets in two formats: erniekit and alpaca.
- CLI: Modify the following fields in the YAML config file:
- Set
train_dataset_path/eval_dataset_pathto the absolute or relative path of your local dataset file - Set
train_dataset_type/eval_dataset_typeto the dataset format (erniekit/alpaca) - Set
train_dataset_prob/eval_dataset_probfor multi-source dataset mixing probabilities
- Set
# single-source
train_dataset_type: "erniekit"
train_dataset_path: "./examples/data/sft-train.jsonl"
train_dataset_prob: "1.0"
# multi-source
train_dataset_type: "erniekit,erniekit"
train_dataset_path: "./examples/data/sft-train1.jsonl,./examples/data/sft-train2.jsonl"
train_dataset_prob: "0.8,0.2"- WebUI:
- Under
Set Custom Dataset, input the local file path inDataset Path - Select the corresponding format (erniekit/alpaca) in
Optional Data Type
- Under
- CLI: Modify the following fields in the YAML config file:
- Set
train_dataset_path/eval_dataset_pathto the Hugging Face repo ID - Set
train_dataset_type/eval_dataset_typeto alpaca - Set
train_dataset_prob/eval_dataset_probfor multi-source dataset mixing probabilities
- Set
# single-source
train_dataset_type: "alpaca"
train_dataset_path: "BelleGroup/train_2M_CN"
train_dataset_prob: "1.0"
# multi-source
train_dataset_type: "alpaca,alpaca"
train_dataset_path: "llamafactory/alpaca_gpt4_zh,BelleGroup/train_2M_CN"
train_dataset_prob: "0.8,0.2"- WebUI:
- Under
Set Built-in Dataset, select the dataset name inDataset Selection - The system will automatically configure the path and type, then download and read from Hugging Face
- Under
Supported Hugging Face datasets are defined:
| Dataset Name | Type | Format | File | File Format |
|---|---|---|---|---|
| llamafactory/alpaca_en | sft | alpaca | alpaca_data_en_52k.json | json |
| llamafactory/alpaca_zh | sft | alpaca | alpaca_data_zh_51k.json | json |
| llamafactory/alpaca_gpt4_en | sft | alpaca | alpaca_gpt4_data_en.json | json |
| llamafactory/alpaca_gpt4_zh | sft | alpaca | alpaca_gpt4_data_zh.json | json |
| BelleGroup/train_2M_CN | sft | alpaca | train_2M_CN.json | jsonl |
| BelleGroup/train_1M_CN | sft | alpaca | Belle_open_source_1M.json | jsonl |
| BelleGroup/train_0.5M_CN | sft | alpaca | Belle_open_source_0.5M.json | jsonl |
| BelleGroup/generated_chat_0.4M | sft | alpaca | generated_chat_0.4M.json | jsonl |
| BelleGroup/school_math_0.25M | sft | alpaca | school_math_0.25M.json | jsonl |
| sahil2801/CodeAlpaca-20k | sft | alpaca | code_alpaca_20k.json | json |
| TIGER-Lab/MathInstruct | sft | alpaca | MathInstruct.json | json |
| YeungNLP/firefly-train-1.1M | sft | alpaca | firefly-train-1.1M.jsonl | jsonl |
| suolyer/webqa | sft | alpaca | train.json | jsonl |
| zxbsmk/webnovel_cn | sft | alpaca | novel_cn_token512_50k.json | json |
| AstraMindAI/SFT-Nectar | sft | alpaca | sft_data_structured.json | json |
| hfl/stem_zh_instruction | sft | alpaca | bio_50282.json | jsonl |
| llamafactory/OpenO1-SFT | sft | alpaca | OpenO1-SFT-Pro.jsonl | jsonl |
| Congliu/Chinese-DeepSeek-R1-Distill-data-110k-SFT | sft | alpaca | distill_r1_110k_sft.jsonl | jsonl |
| mayflowergmbh/oasst_de | sft | alpaca | oasst_de.json | json |
| mayflowergmbh/dolly-15k_de | sft | alpaca | dolly_de.json | json |
| mayflowergmbh/alpaca-gpt4_de | sft | alpaca | alpaca_gpt4_data_de.json | json |
| mayflowergmbh/openschnabeltier_de | sft | alpaca | openschnabeltier.json | json |
| mayflowergmbh/evol-instruct_de | sft | alpaca | evol_instruct_de.json | json |
| mayflowergmbh/dolphin_de | sft | alpaca | dolphin.json | json |
| mayflowergmbh/booksum_de | sft | alpaca | booksum.json | json |
| mayflowergmbh/airoboros-3.0_de | sft | alpaca | airoboros_3.json | json |
| mayflowergmbh/ultra-chat_de | sft | alpaca | ultra_chat_german.json | json |
| Intel/orca_dpo_pairs | dpo | alpaca | orca_rlhf.jsonl | jsonl |
| shibing624/sharegpt_gpt4 | sft | sharegpt | sharegpt_gpt4.jsonl | jsonl |
| llamafactory/lima | sft | sharegpt | lima.json | json |
| Open-Orca/SlimOrca | sft | sharegpt | oo-labeled_correct.gpt4.sharegpt.jsonl | jsonl |
| totally-not-an-llm/sharegpt-hyperfiltered-3k | sft | sharegpt | sharegptclean_final.json | json |
| m-a-p/neo_sft_phase2 | sft | sharegpt | neo_sft_phase2.json | json |
| llamafactory/DPO-En-Zh-20k | sft | sharegpt | dpo_zh.json | json |
| avemio/German-RAG-DPO-ShareGPT-HESSIAN-AI | dpo | sharegpt | qa-with-multiple-references/DPO_equally-distributed-wikipedia-trainingdata-qa-with-multiple-references_id-over-800k-under-1000k_sharegpt.jsonl | jsonl |
We provide demo data for quick testing. You can either use these samples or train with your own data.
Required fields for SFT:
system(optional): System configurationsrc: User conversation contenttgt: System response contentlabel(optional): Training flag (1=include in training, 0=exclude)
Notes:
srcandtgtare List objects supporting multi-turn conversations- Each training sample is in JSON format, with multiple samples separated by newlines
{
"system": "你是一个生活小助理",
"src": [
"我们如何在日常生活中减少用水?",
"还有别的建议吗?"
],
"tgt": [
"1. 使用节水装置,如节水淋浴喷头和水龙头。",
"2. 使用水箱或水桶收集家庭废水,例如洗碗和洗浴。 \n3. 在社区中提高节水意识。 \n4. 检查水管和灌溉系统的漏水情况,并及时修复它们。 \n5. 洗澡时间缩短,使用低流量淋浴头节约用水。 \n6. 收集雨水,用于园艺或其他非饮用目的。 \n7. 刷牙或擦手时关掉水龙头。 \n8. 减少浇水草坪的时间。 \n9. 尽可能多地重复使用灰水(来自洗衣机、浴室水槽和淋浴的水)。 \n10. 只购买能源效率高的洗碗机和洗衣机。"
],
"label": [0, 1],
}We provide demo data for quick testing. You can either use these samples or train with your own data.
Required fields for DPO:
system(optional): System configurationsrc: User conversation content (first item=question1, second=question2, etc.)tgt: System response content (one fewer item than src)response: Contains chosen/rejected responses (must contain odd number of strings)sort: Differentiates chosen/rejected (lower value=rejected, higher=chosen)- Each training sample is in JSON format, with multiple samples separated by newlines
{
"system": "你是一个生活小助理",
"src": [
"你好。",
"哪一个富含蛋白质,床还是墙?"
],
"tgt": ["你好呀,我是你的生活小助理。"],
"response": [
[
"床和墙都不是蛋白质的来源,因为它们都是无生命的物体。蛋白质通常存在于肉类、奶制品、豆类和坚果等食物中。"
],
[
"对不起,我无法回答那个问题。请提供更具体的信息,让我知道你需要什么帮助。"
]
],
"sort": [
1,
0
]
}We provide demo data for quick training, please download the image or video data according to your needs and unzip it to the demo data directory. You can either use these samples or train with your own data.
Required fields for SFT VL:
text_info: The list of text data, each element contains atextand atagtext: The text content from User question or System responsetag: The mask tag (no_mask=include in training,mask=exclude)
image_info: The list of image data, each element contains aimage_urland amatched_text_indeximage_url: The url to download image online or the path to access image locallymatched_text_index: The index of matched text intext_info- Default:
matched_text_index=0means the image is matched with the first text, and will be palced before the first text
- Default:
is_system(optional): The system flag (1=system configuration, 0=no system configuration)- system configuration =
text_info[0]ifis_system=1
- system configuration =
Notes:
- Each training sample is in JSON format, with multiple samples separated by newlines
- Video data is supported by replacing the
image_infowithvideo_info- the
image_urlcan be a video url or video path
- the
- Please ensure that
maskitems andno_maskitems alternate in thetext_info
Here is a multi-image example of SFT VL dataset:
{
"image_info": [
{"matched_text_index": 0, "image_url": "./DoclingMatix/218/0.png"},
{"matched_text_index": 0, "image_url": "./DoclingMatix/218/1.png"}
],
"text_info": [
{"text": "What is the purpose of the resolution discussed in the text?", "tag": "mask"},
{"text": "The purpose of the resolution is to approve the redevelopment contract of the Philadelphia Redevelopment Authority for the redevelopment and urban renewal of a portion of the Haddington Urban Renewal Area, Unit Nos. 2 and 3, and to authorize the Redevelopment Authority to execute the redevelopment contract with Danielle M. Carson-Varns.", "tag": "no_mask"},
{"text": "Who introduced Resolution No. 160204 to the City Council?", "tag": "mask"},
{"text": "Councilmember Blackwell introduced Resolution No. 160204 to the City Council.", "tag": "no_mask"},
...
]
}Here is a video example of SFT VL dataset:
{
"video_info": [
{"matched_text_index": 0, "image_url": "./NExTVideo/1027/4789497818.mp4"}
],
"text_info": [
{"text": "how does the man sit on the grass?\nA. kneel\nB. one leg in the air\nC. sitting on bicycle seat\nD. legs spread out\nE. squatting down\n Answer with the option's letter from the given choices directly.", "tag": "mask"},
{"text": "D", "tag": "no_mask"}
]
}Here is a system configuration example of SFT VL dataset:
{
"is_system": 1,
"text_info": [
{"text": "Your role as ...", "tag": "mask"},
{"text": "好的", "tag": "no_mask"},
{"text": "What is written...", "tag": "mask"},
{"text": "<think>So I've got...", "tag": "no_mask"},
...
]
"image_info": [...]
}Required fields for SFT VL Function Call:
text_info: The list of text data, each element contains atext,tag, andtool_responsetext: The text content from User question or System responsetag: The mask tag (no_mask=include in training,mask=exclude)tool_response:true=role is tool,fasle=role is user, only valid whentagismasktool_calls: The tool calls information
image_info: The list of image data, each element contains aimage_urland amatched_text_indeximage_url: The url to download image online or the path to access image locallymatched_text_index: The index of matched text intext_info- Default:
matched_text_index=0means the image is matched with the first text, and will be palced before the first text
- Default:
is_system(optional): The system flag (1=system configuration, 0=no system configuration)- system configuration =
text_info[0]ifis_system=1
- system configuration =
tools: The tools information
Notes:
- Each training sample is in JSON format, with multiple samples separated by newlines
- Video data is supported by replacing the
image_infowithvideo_info- the
image_urlcan be a video url or video path
- the
- Please ensure that
maskitems andno_maskitems alternate in thetext_info
Here is a multi-image example of SFT VL dataset:
{
"image_info": [
{"matched_text_index": 0, "image_url": "000000213579.jpg"},
{"matched_text_index": 2, "image_url": "000000213579_0_493_518_786.jpg"},
{"matched_text_index": 4, "image_url": "000000213579_468_391_670_575.jpg"},
],
"text_info": [
{"text": "Is the fork on the left side of the cup?", "tag": "mask", "tool_response": false},
{"text": "\n<think>\nI need to check the relative positions of the fork and the cup in the image. I'll first locate both objects and then determine if the fork is on the left side of the cup.\n\nI will zoom in on the plate with the cake and the fork to get a better view of their position.\n</think>\n\n", "tool_calls": [{"name": "image_zoom_in_tool", "arguments": "{\"bbox_2d\": [0, 493, 518, 786]}"}], "tag": "no_mask"},
{"text": "", "tag": "mask", "tool_response": true},
{"text": "\n<think>\nOkay, now I see the fork clearly. It's positioned on the left side of the plate, next to a slice of cake. The plate itself is very colorful with sections of green, red, blue, and yellow. Now that I have confirmed the location of the fork, I need to find the cup to determine their relative positions. I will now zoom in on the person holding the cup to get a better look.\n</think>\n\n", "tool_calls": [{"name": "image_zoom_in_tool", "arguments": "{\"bbox_2d\": [468, 391, 670, 575]}"}], "tag": "no_mask"},
{"text": "", "tag": "mask", "tool_response": true},
{"text": "\n<think>\nBased on the visual analysis of the image, I can determine the relative positions of the fork and the cup.\n\n- The fork is located on a colorful plate, positioned to the left of a slice of cake.\n- The cup, which has a blue design with a cartoon character, is being held by a person.\n- When viewing the overall scene, the plate with the fork is situated to the left of the person holding the cup.\n\nTherefore, yes, the fork is on the left side of the cup.\n</think>\n\nYes, the fork is on the left side of the cup.", "tool_calls": [], "tag": "no_mask"}
],
"tools": [
{
"type": "function",
"function": {
"name": "image_zoom_in_tool",
"description": "Zoom in on a specific region of an image by cropping it based on a bounding box (bbox) and an optional object label.",
"parameters": {
"type": "object",
"properties": {
"bbox_2d": {
"type": "array",
"items": {"type": "number"},
"minItems": 4,
"maxItems": 4,
"description": "The bounding box of the region to zoom in, as [x1, y1, x2, y2], where (x1, y1) is the top-left corner and (x2, y2) is the bottom-right corner, and the values of x1, y1, x2, y2 are all normalized to the range 0–1000 based on the original image dimensions."
},
"label": {
"type": "string",
"description": "The name or label of the object in the specified bounding box (optional)."
}
},
"required": ["bbox_2d"]
},
"strict": false
}
}
]
}The messages Format is used for training thinking models and function call training:
Demo data for thinking models:
[
{"messages": [{"role": "system", "content": "You are a good coder."}, {"role": "user", "content": "Given an integer array nums and an integer target value target, find two integers in the array whose sum equals the target value target, and return their array indices. You may assume that each input will have exactly one solution, and you may not use the same element twice. You can return the answer in any order. Example 1: Input: nums = [2,7,11,15], target = 9\nOutput: [0,1]\nExplanation: Because nums[0] + nums[1] == 9, return [0, 1]."}, {"role": "assistant", "content": "<think>We are going to use a hash map (dictionary) to store the numbers we have seen so far along with their indices.\n For each number in the array, we calculate the complement (target - current number).\n If the complement exists in the hash map, that means we have found the two numbers that add up to the target.\n We then return the current index and the index of the complement from the hash map.\n Since we are guaranteed exactly one solution, we can return immediately when we find it.\n</think>\nTo solve this problem efficiently, we can use a hash map to store each number's index as we iterate through the array. For each number, we calculate its complement (target minus the current number). If the complement exists in the hash map, we immediately return the current index and the complement's index. This approach ensures we find the solution in linear time.\n\n### Approach\n1. **Initialization**: Create an empty dictionary to store numbers and their indices.\n2. **Iterate through the array**: For each number in the array:\n - Calculate the complement (target - current number).\n - Check if the complement exists in the dictionary.\n - If found, return the current index and the complement's index.\n - Otherwise, add the current number and its index to the dictionary.\n3. **Result**: Since the problem guarantees exactly one solution, we will always find it during iteration.\n\n### Solution Code\n```python\ndef twoSum(nums, target):\n num_map = {}\n for i, num in enumerate(nums):\n complement = target - num\n if complement in num_map:\n return [num_map[complement], i]\n num_map[num] = i\n```\n\n### Explanation\n- **Initialization**: The dictionary `num_map` will track each number encountered and its index.\n- **Iteration**: For each number `num` at index `i`:\n - Compute `complement = target - num`.\n - If `complement` is in `num_map`, return `[num_map[complement], i]` since these two numbers sum to the target.\n - Otherwise, store `num` with its index in `num_map` for future checks.\n- **Efficiency**: This approach runs in O(n) time, where n is the number of elements, as each dictionary operation (insertion and lookup) is O(1) on average. The space complexity is O(n) to store the dictionary.\n"}]}
]- Note: The demo datasets provided in
examples/data/sft_think-train.jsonlandexamples/data/sft_think-eval.jsonlare derived from the OpenCodeReasoning dataset, published by NVIDIA. This dataset is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0).
Demo data for function call training:
[
{"messages": [{"role": "user", "content": "I'm feeling a bit down. Can you tell me a joke to cheer me up?"}, {"role": "assistant", "content": "<think>Okay, let me try to figure out how to approach this. The user is feeling down and asks for a joke to cheer up. I need to connect this request to the appropriate function call. Looking at the available tools, there's a function called get_random_joke which is designed exactly for this purpose. Since the user's main need is to feel better, providing a joke makes sense. The function doesn't require any parameters, so it's straightforward to call it without any additional arguments.\n</think>", "tool_calls": [{"type": "function", "function": {"name": "get_random_joke", "arguments": {}}}]}, {"role": "tool", "content": [{"joke": "Why don't scientists trust atoms? Because they make up everything!"}]}, {"role": "assistant", "content": "Sure, here's a joke for you: \"Why don't scientists trust atoms? Because they make up everything!\" I hope that brings a smile to your face."}], "tools": [{"type": "function", "function": {"name": "get_random_joke", "description": "Get a random joke", "parameters": {"type": "object", "properties": {}, "required": []}}}, {"type": "function", "function": {"name": "generate_random_number", "description": "Generate a random number within a specified range", "parameters": {"type": "object", "properties": {"min": {"type": "number", "description": "The minimum value of the range"}, "max": {"type": "number", "description": "The maximum value of the range"}}, "required": ["min", "max"]}}}]}
]Required fields for SFT
instruction: A clear task directive (e.g., "Translate the following Chinese text to English").input: Task-specific input content (may be empty for tasks like "Write a poem").output: The expected model response.
Supports json and jsonl file formats
- json: Each line contains one JSON object:
[
{"instruction":"instructionA", "input":"inputA", "output":"outputA"},
{"instruction":"instructionB", "input":"inputB", "output":"outputB"},
{"instruction":"instructionC", "input":"inputC", "output":"outputC"}
]- jsonl: All data in a single JSON array:
{"instruction":"instructionA", "input":"inputA", "output":"outputA"}
{"instruction":"instructionB", "input":"inputB", "output":"outputB"}
{"instruction":"instructionC", "input":"inputC", "output":"outputC"}Field Mapping Between alpaca and erniekit
| alpaca | erniekit | Mapping |
|---|---|---|
| instruction, input | src | src[-1] = instruction + input |
| output | tgt | tgt[-1] = output |
| history | src, tgt | history = zip(src[:-1], tgt[:-1]) |
| system | system | system=system |
Required fields for DPO
system(optional): System configurationquestion: User question.chosen: The higher-quality output selected by human annotators.rejected: The lower-quality output for the same question.
Supports json and jsonl file formats
- json: Each line contains one JSON object:
[
{"system": "你是一个AI小助理", "question": "哪一个富含蛋白质,床还是墙?", "chosen": "床和墙都不是蛋白质的来源,因为它们都是无生命的物体。蛋白质通常存在于肉类、奶制品、豆类和坚果等食物中。", "rejected": "对不起,我无法回答那个问题。请提供更具体的信息,让我知道你需要什么帮助。"},
{"system": "你是一个AI小助理", "question": "哪一个富含蛋白质,床还是墙?", "chosen": "床和墙都不是蛋白质的来源,因为它们都是无生命的物体。蛋白质通常存在于肉类、奶制品、豆类和坚果等食物中。", "rejected": "对不起,我无法回答那个问题。请提供更具体的信息,让我知道你需要什么帮助。"},
{"system": "你是一个AI小助理", "question": "哪一个富含蛋白质,床还是墙?", "chosen": "床和墙都不是蛋白质的来源,因为它们都是无生命的物体。蛋白质通常存在于肉类、奶制品、豆类和坚果等食物中。", "rejected": "对不起,我无法回答那个问题。请提供更具体的信息,让我知道你需要什么帮助。"}
]- jsonl: All data in a single JSON array:
{"system": "你是一个AI小助理", "question": "哪一个富含蛋白质,床还是墙?", "chosen": "床和墙都不是蛋白质的来源,因为它们都是无生命的物体。蛋白质通常存在于肉类、奶制品、豆类和坚果等食物中。", "rejected": "对不起,我无法回答那个问题。请提供更具体的信息,让我知道你需要什么帮助。"}
{"system": "你是一个AI小助理", "question": "哪一个富含蛋白质,床还是墙?", "chosen": "床和墙都不是蛋白质的来源,因为它们都是无生命的物体。蛋白质通常存在于肉类、奶制品、豆类和坚果等食物中。", "rejected": "对不起,我无法回答那个问题。请提供更具体的信息,让我知道你需要什么帮助。"}
{"system": "你是一个AI小助理", "question": "哪一个富含蛋白质,床还是墙?", "chosen": "床和墙都不是蛋白质的来源,因为它们都是无生命的物体。蛋白质通常存在于肉类、奶制品、豆类和坚果等食物中。", "rejected": "对不起,我无法回答那个问题。请提供更具体的信息,让我知道你需要什么帮助。"}