ERNIEKit Data Packing Strategy

Packing is a technique used to optimize batch processing by combining multiple short input sequences into a single longer sequence before feeding them into the LLM. This reduces padding overhead and improves hardware utilization (e.g., GPU/TPU efficiency).

The greedy intokens strategy is a token-level optimization that prioritizes filling the available token budget (e.g., max sequence length) in a greedy manner during batch processing. It ensures that the model generates as many tokens as possible within the constraints, minimizing wasted capacity.

packing	greedy_intokens	Packing Strategy
false	any	No packing
true	false	packing is enabled without greedy intokens strategy
true	true	greedy intokens packing is enabled

ERNIEKit Data Sampling Strategy

Currently, four data sampling strategies are supported: random, concat, interleave_under, interleave_over

Data Sampling Strategy	Applicable Scenarios	Limitations	Description
`random`	The dataset is extremely large and strict data proportioning is required	max_steps > 0	In `random` mode, based on the input dataset probs, a fixed-size sample pool of `num_samples_each_epoch` is constructed, and the data loader randomly acquires data from this sample pool.
`concat`	Need to train all data in the datasets	None	In `concat` mode, the input dataset probs are not used. Instead, multiple datasets are directly concatenated. The size of the dataset is equal to the total size of the input multi-source datasets. When max_steps = -1, setting `num_train_epochs` allows for a complete traversal of the input datasets for `num_train_epochs` rounds.
`interleave_under`	When small datasets are important but have limited samples	None	The `interleave` strategy involves cross-concatenating multiple datasets according to data proportioning. `interleave_under` indicates undersampling, meaning that sampling stops as soon as one of the datasets is exhausted.
`interleave_over`	When small datasets are important but have limited samples	None	The `interleave` strategy involves cross-concatenating multiple datasets according to data proportioning. `interleave_over` indicates oversampling, meaning that sampling stops only after all datasets have been exhausted.

Note: num_samples_each_epoch only works in random data sampling strategy.

ERNIEKit Data Format Specification

ERNIEKit currently supports reading local datasets and downloading specified Hugging Face datasets in two formats: erniekit and alpaca.

Local Datasets

CLI: Modify the following fields in the YAML config file:
- Set train_dataset_path / eval_dataset_path to the absolute or relative path of your local dataset file
- Set train_dataset_type / eval_dataset_type to the dataset format (erniekit/alpaca)
- Set train_dataset_prob / eval_dataset_prob for multi-source dataset mixing probabilities

# single-source
train_dataset_type: "erniekit"
train_dataset_path: "./examples/data/sft-train.jsonl"
train_dataset_prob: "1.0"

# multi-source
train_dataset_type: "erniekit,erniekit"
train_dataset_path: "./examples/data/sft-train1.jsonl,./examples/data/sft-train2.jsonl"
train_dataset_prob: "0.8,0.2"

WebUI:
- Under Set Custom Dataset, input the local file path in Dataset Path
- Select the corresponding format (erniekit/alpaca) in Optional Data Type

Hugging Face Datasets

CLI: Modify the following fields in the YAML config file:
- Set train_dataset_path / eval_dataset_path to the Hugging Face repo ID
- Set train_dataset_type / eval_dataset_type to alpaca
- Set train_dataset_prob / eval_dataset_prob for multi-source dataset mixing probabilities

# single-source
train_dataset_type: "alpaca"
train_dataset_path: "BelleGroup/train_2M_CN"
train_dataset_prob: "1.0"

# multi-source
train_dataset_type: "alpaca,alpaca"
train_dataset_path: "llamafactory/alpaca_gpt4_zh,BelleGroup/train_2M_CN"
train_dataset_prob: "0.8,0.2"

WebUI:
- Under Set Built-in Dataset, select the dataset name in Dataset Selection
- The system will automatically configure the path and type, then download and read from Hugging Face

Supported Hugging Face datasets are defined:

Supported Hugging Face Datasets

Dataset Name	Type	Format	File	File Format
llamafactory/alpaca_en	sft	alpaca	alpaca_data_en_52k.json	json
llamafactory/alpaca_zh	sft	alpaca	alpaca_data_zh_51k.json	json
llamafactory/alpaca_gpt4_en	sft	alpaca	alpaca_gpt4_data_en.json	json
llamafactory/alpaca_gpt4_zh	sft	alpaca	alpaca_gpt4_data_zh.json	json
BelleGroup/train_2M_CN	sft	alpaca	train_2M_CN.json	jsonl
BelleGroup/train_1M_CN	sft	alpaca	Belle_open_source_1M.json	jsonl
BelleGroup/train_0.5M_CN	sft	alpaca	Belle_open_source_0.5M.json	jsonl
BelleGroup/generated_chat_0.4M	sft	alpaca	generated_chat_0.4M.json	jsonl
BelleGroup/school_math_0.25M	sft	alpaca	school_math_0.25M.json	jsonl
sahil2801/CodeAlpaca-20k	sft	alpaca	code_alpaca_20k.json	json
TIGER-Lab/MathInstruct	sft	alpaca	MathInstruct.json	json
YeungNLP/firefly-train-1.1M	sft	alpaca	firefly-train-1.1M.jsonl	jsonl
suolyer/webqa	sft	alpaca	train.json	jsonl
zxbsmk/webnovel_cn	sft	alpaca	novel_cn_token512_50k.json	json
AstraMindAI/SFT-Nectar	sft	alpaca	sft_data_structured.json	json
hfl/stem_zh_instruction	sft	alpaca	bio_50282.json	jsonl
llamafactory/OpenO1-SFT	sft	alpaca	OpenO1-SFT-Pro.jsonl	jsonl
Congliu/Chinese-DeepSeek-R1-Distill-data-110k-SFT	sft	alpaca	distill_r1_110k_sft.jsonl	jsonl
mayflowergmbh/oasst_de	sft	alpaca	oasst_de.json	json
mayflowergmbh/dolly-15k_de	sft	alpaca	dolly_de.json	json
mayflowergmbh/alpaca-gpt4_de	sft	alpaca	alpaca_gpt4_data_de.json	json
mayflowergmbh/openschnabeltier_de	sft	alpaca	openschnabeltier.json	json
mayflowergmbh/evol-instruct_de	sft	alpaca	evol_instruct_de.json	json
mayflowergmbh/dolphin_de	sft	alpaca	dolphin.json	json
mayflowergmbh/booksum_de	sft	alpaca	booksum.json	json
mayflowergmbh/airoboros-3.0_de	sft	alpaca	airoboros_3.json	json
mayflowergmbh/ultra-chat_de	sft	alpaca	ultra_chat_german.json	json
Intel/orca_dpo_pairs	dpo	alpaca	orca_rlhf.jsonl	jsonl
shibing624/sharegpt_gpt4	sft	sharegpt	sharegpt_gpt4.jsonl	jsonl
llamafactory/lima	sft	sharegpt	lima.json	json
Open-Orca/SlimOrca	sft	sharegpt	oo-labeled_correct.gpt4.sharegpt.jsonl	jsonl
totally-not-an-llm/sharegpt-hyperfiltered-3k	sft	sharegpt	sharegptclean_final.json	json
m-a-p/neo_sft_phase2	sft	sharegpt	neo_sft_phase2.json	json
llamafactory/DPO-En-Zh-20k	sft	sharegpt	dpo_zh.json	json
avemio/German-RAG-DPO-ShareGPT-HESSIAN-AI	dpo	sharegpt	qa-with-multiple-references/DPO_equally-distributed-wikipedia-trainingdata-qa-with-multiple-references_id-over-800k-under-1000k_sharegpt.jsonl	jsonl

erniekit Data Format

SFT Dataset

We provide demo data for quick testing. You can either use these samples or train with your own data.

Required fields for SFT:

system(optional): System configuration
src: User conversation content
tgt: System response content
label(optional): Training flag (1=include in training, 0=exclude)

Notes:

src and tgt are List objects supporting multi-turn conversations
Each training sample is in JSON format, with multiple samples separated by newlines

{
    "system": "你是一个生活小助理",
    "src": [
        "我们如何在日常生活中减少用水？",
        "还有别的建议吗？"
    ],
    "tgt": [
        "1. 使用节水装置，如节水淋浴喷头和水龙头。",
        "2. 使用水箱或水桶收集家庭废水，例如洗碗和洗浴。 \n3. 在社区中提高节水意识。 \n4. 检查水管和灌溉系统的漏水情况，并及时修复它们。 \n5. 洗澡时间缩短，使用低流量淋浴头节约用水。 \n6. 收集雨水，用于园艺或其他非饮用目的。 \n7. 刷牙或擦手时关掉水龙头。 \n8. 减少浇水草坪的时间。 \n9. 尽可能多地重复使用灰水（来自洗衣机、浴室水槽和淋浴的水）。 \n10. 只购买能源效率高的洗碗机和洗衣机。"
    ],
    "label": [0, 1],
}

DPO Dataset

We provide demo data for quick testing. You can either use these samples or train with your own data.

Required fields for DPO:

system(optional): System configuration
src: User conversation content (first item=question1, second=question2, etc.)
tgt: System response content (one fewer item than src)
response: Contains chosen/rejected responses (must contain odd number of strings)
sort: Differentiates chosen/rejected (lower value=rejected, higher=chosen)
Each training sample is in JSON format, with multiple samples separated by newlines

{
    "system": "你是一个生活小助理",
    "src": [
        "你好。",
        "哪一个富含蛋白质，床还是墙？"
    ],
    "tgt": ["你好呀，我是你的生活小助理。"],
    "response": [
        [
            "床和墙都不是蛋白质的来源，因为它们都是无生命的物体。蛋白质通常存在于肉类、奶制品、豆类和坚果等食物中。"
        ],
        [
            "对不起，我无法回答那个问题。请提供更具体的信息，让我知道你需要什么帮助。"
        ]
    ],
    "sort": [
        1,
        0
    ]
}

SFT VL Dataset

We provide demo data for quick training, please download the image or video data according to your needs and unzip it to the demo data directory. You can either use these samples or train with your own data.

Required fields for SFT VL:

text_info: The list of text data, each element contains a text and a tag
- text: The text content from User question or System response
- tag: The mask tag (no_mask=include in training, mask=exclude)
image_info: The list of image data, each element contains a image_url and a matched_text_index
- image_url: The url to download image online or the path to access image locally
- matched_text_index: The index of matched text in text_info
  - Default: matched_text_index=0 means the image is matched with the first text, and will be palced before the first text
is_system(optional): The system flag (1=system configuration, 0=no system configuration)
- system configuration = text_info[0] if is_system=1

Notes:

Each training sample is in JSON format, with multiple samples separated by newlines
Video data is supported by replacing the image_info with video_info
- the image_url can be a video url or video path
Please ensure that mask items and no_mask items alternate in the text_info

Here is a multi-image example of SFT VL dataset:

{
    "image_info": [
        {"matched_text_index": 0, "image_url": "./DoclingMatix/218/0.png"},
        {"matched_text_index": 0, "image_url": "./DoclingMatix/218/1.png"}
    ],
    "text_info": [
        {"text": "What is the purpose of the resolution discussed in the text?", "tag": "mask"},
        {"text": "The purpose of the resolution is to approve the redevelopment contract of the Philadelphia Redevelopment Authority for the redevelopment and urban renewal of a portion of the Haddington Urban Renewal Area, Unit Nos. 2 and 3, and to authorize the Redevelopment Authority to execute the redevelopment contract with Danielle M. Carson-Varns.", "tag": "no_mask"},
        {"text": "Who introduced Resolution No. 160204 to the City Council?", "tag": "mask"},
        {"text": "Councilmember Blackwell introduced Resolution No. 160204 to the City Council.", "tag": "no_mask"},
        ...
    ]
}

Here is a video example of SFT VL dataset:

{
    "video_info": [
        {"matched_text_index": 0, "image_url": "./NExTVideo/1027/4789497818.mp4"}
    ],
    "text_info": [
        {"text": "how does the man sit on the grass?\nA. kneel\nB. one leg in the air\nC. sitting on bicycle seat\nD. legs spread out\nE. squatting down\n Answer with the option's letter from the given choices directly.", "tag": "mask"},
        {"text": "D", "tag": "no_mask"}
    ]
}

Here is a system configuration example of SFT VL dataset:

{
    "is_system": 1,
    "text_info": [
        {"text": "Your role as ...", "tag": "mask"},
        {"text": "好的", "tag": "no_mask"},
        {"text": "What is written...", "tag": "mask"},
        {"text": "<think>So I've got...", "tag": "no_mask"},
        ...
    ]
    "image_info": [...]
}

SFT VL Dataset For function call

Required fields for SFT VL Function Call:

text_info: The list of text data, each element contains a text, tag, and tool_response
- text: The text content from User question or System response
- tag: The mask tag (no_mask=include in training, mask=exclude)
- tool_response: true=role is tool, fasle=role is user, only valid when tag is mask
- tool_calls: The tool calls information
image_info: The list of image data, each element contains a image_url and a matched_text_index
- image_url: The url to download image online or the path to access image locally
- matched_text_index: The index of matched text in text_info
  - Default: matched_text_index=0 means the image is matched with the first text, and will be palced before the first text
is_system(optional): The system flag (1=system configuration, 0=no system configuration)
- system configuration = text_info[0] if is_system=1
tools: The tools information

Notes:

Each training sample is in JSON format, with multiple samples separated by newlines
Video data is supported by replacing the image_info with video_info
- the image_url can be a video url or video path
Please ensure that mask items and no_mask items alternate in the text_info

Here is a multi-image example of SFT VL dataset:

{
    "image_info": [
        {"matched_text_index": 0, "image_url": "000000213579.jpg"},
        {"matched_text_index": 2, "image_url": "000000213579_0_493_518_786.jpg"},
        {"matched_text_index": 4, "image_url": "000000213579_468_391_670_575.jpg"},
    ],
    "text_info": [
        {"text": "Is the fork on the left side of the cup?", "tag": "mask", "tool_response": false},
        {"text": "\n<think>\nI need to check the relative positions of the fork and the cup in the image. I'll first locate both objects and then determine if the fork is on the left side of the cup.\n\nI will zoom in on the plate with the cake and the fork to get a better view of their position.\n</think>\n\n", "tool_calls": [{"name": "image_zoom_in_tool", "arguments": "{\"bbox_2d\": [0, 493, 518, 786]}"}], "tag": "no_mask"},
        {"text": "", "tag": "mask", "tool_response": true},
        {"text": "\n<think>\nOkay, now I see the fork clearly. It's positioned on the left side of the plate, next to a slice of cake. The plate itself is very colorful with sections of green, red, blue, and yellow. Now that I have confirmed the location of the fork, I need to find the cup to determine their relative positions. I will now zoom in on the person holding the cup to get a better look.\n</think>\n\n", "tool_calls": [{"name": "image_zoom_in_tool", "arguments": "{\"bbox_2d\": [468, 391, 670, 575]}"}], "tag": "no_mask"}, 
        {"text": "", "tag": "mask", "tool_response": true},
        {"text": "\n<think>\nBased on the visual analysis of the image, I can determine the relative positions of the fork and the cup.\n\n- The fork is located on a colorful plate, positioned to the left of a slice of cake.\n- The cup, which has a blue design with a cartoon character, is being held by a person.\n- When viewing the overall scene, the plate with the fork is situated to the left of the person holding the cup.\n\nTherefore, yes, the fork is on the left side of the cup.\n</think>\n\nYes, the fork is on the left side of the cup.", "tool_calls": [], "tag": "no_mask"}
    ],
    "tools": [
        {
            "type": "function",
            "function": {
                "name": "image_zoom_in_tool",
                "description": "Zoom in on a specific region of an image by cropping it based on a bounding box (bbox) and an optional object label.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "bbox_2d": {
                            "type": "array",
                            "items": {"type": "number"},
                            "minItems": 4,
                            "maxItems": 4,
                            "description": "The bounding box of the region to zoom in, as [x1, y1, x2, y2], where (x1, y1) is the top-left corner and (x2, y2) is the bottom-right corner, and the values of x1, y1, x2, y2 are all normalized to the range 0–1000 based on the original image dimensions."
                        },
                        "label": {
                            "type": "string",
                            "description": "The name or label of the object in the specified bounding box (optional)."
                        }
                    },
                    "required": ["bbox_2d"]
                },
                "strict": false
            }
        }
    ]
}

messages Format

The messages Format is used for training thinking models and function call training:

Demo data for thinking models:

[
    {"messages": [{"role": "system", "content": "You are a good coder."}, {"role": "user", "content": "Given an integer array nums and an integer target value target, find two integers in the array whose sum equals the target value target, and return their array indices. You may assume that each input will have exactly one solution, and you may not use the same element twice. You can return the answer in any order. Example 1: Input: nums = [2,7,11,15], target = 9\nOutput: [0,1]\nExplanation: Because nums[0] + nums[1] == 9, return [0, 1]."}, {"role": "assistant", "content": "<think>We are going to use a hash map (dictionary) to store the numbers we have seen so far along with their indices.\n For each number in the array, we calculate the complement (target - current number).\n If the complement exists in the hash map, that means we have found the two numbers that add up to the target.\n We then return the current index and the index of the complement from the hash map.\n Since we are guaranteed exactly one solution, we can return immediately when we find it.\n</think>\nTo solve this problem efficiently, we can use a hash map to store each number's index as we iterate through the array. For each number, we calculate its complement (target minus the current number). If the complement exists in the hash map, we immediately return the current index and the complement's index. This approach ensures we find the solution in linear time.\n\n### Approach\n1. **Initialization**: Create an empty dictionary to store numbers and their indices.\n2. **Iterate through the array**: For each number in the array:\n   - Calculate the complement (target - current number).\n   - Check if the complement exists in the dictionary.\n   - If found, return the current index and the complement's index.\n   - Otherwise, add the current number and its index to the dictionary.\n3. **Result**: Since the problem guarantees exactly one solution, we will always find it during iteration.\n\n### Solution Code\n```python\ndef twoSum(nums, target):\n    num_map = {}\n    for i, num in enumerate(nums):\n        complement = target - num\n        if complement in num_map:\n            return [num_map[complement], i]\n        num_map[num] = i\n```\n\n### Explanation\n- **Initialization**: The dictionary `num_map` will track each number encountered and its index.\n- **Iteration**: For each number `num` at index `i`:\n  - Compute `complement = target - num`.\n  - If `complement` is in `num_map`, return `[num_map[complement], i]` since these two numbers sum to the target.\n  - Otherwise, store `num` with its index in `num_map` for future checks.\n- **Efficiency**: This approach runs in O(n) time, where n is the number of elements, as each dictionary operation (insertion and lookup) is O(1) on average. The space complexity is O(n) to store the dictionary.\n"}]}
]

Note: The demo datasets provided in examples/data/sft_think-train.jsonl and examples/data/sft_think-eval.jsonl are derived from the OpenCodeReasoning dataset, published by NVIDIA. This dataset is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0).

Demo data for function call training:

[
    {"messages": [{"role": "user", "content": "I'm feeling a bit down. Can you tell me a joke to cheer me up?"}, {"role": "assistant", "content": "<think>Okay, let me try to figure out how to approach this. The user is feeling down and asks for a joke to cheer up. I need to connect this request to the appropriate function call. Looking at the available tools, there's a function called get_random_joke which is designed exactly for this purpose. Since the user's main need is to feel better, providing a joke makes sense. The function doesn't require any parameters, so it's straightforward to call it without any additional arguments.\n</think>", "tool_calls": [{"type": "function", "function": {"name": "get_random_joke", "arguments": {}}}]}, {"role": "tool", "content": [{"joke": "Why don't scientists trust atoms? Because they make up everything!"}]}, {"role": "assistant", "content": "Sure, here's a joke for you: \"Why don't scientists trust atoms? Because they make up everything!\" I hope that brings a smile to your face."}], "tools": [{"type": "function", "function": {"name": "get_random_joke", "description": "Get a random joke", "parameters": {"type": "object", "properties": {}, "required": []}}}, {"type": "function", "function": {"name": "generate_random_number", "description": "Generate a random number within a specified range", "parameters": {"type": "object", "properties": {"min": {"type": "number", "description": "The minimum value of the range"}, "max": {"type": "number", "description": "The maximum value of the range"}}, "required": ["min", "max"]}}}]}
]

alpaca Format

SFT Dataset

Required fields for SFT

instruction: A clear task directive (e.g., "Translate the following Chinese text to English").
input: Task-specific input content (may be empty for tasks like "Write a poem").
output: The expected model response.

Supports json and jsonl file formats

json: Each line contains one JSON object:

[
    {"instruction":"instructionA", "input":"inputA", "output":"outputA"},
    {"instruction":"instructionB", "input":"inputB", "output":"outputB"},
    {"instruction":"instructionC", "input":"inputC", "output":"outputC"}
]

jsonl: All data in a single JSON array:

{"instruction":"instructionA", "input":"inputA", "output":"outputA"}
{"instruction":"instructionB", "input":"inputB", "output":"outputB"}
{"instruction":"instructionC", "input":"inputC", "output":"outputC"}

Field Mapping Between alpaca and erniekit

alpaca	erniekit	Mapping
instruction, input	src	src[-1] = instruction + input
output	tgt	tgt[-1] = output
history	src, tgt	history = zip(src[:-1], tgt[:-1])
system	system	system=system

DPO Dataset

Required fields for DPO

system(optional): System configuration
question: User question.
chosen: The higher-quality output selected by human annotators.
rejected: The lower-quality output for the same question.

Supports json and jsonl file formats

json: Each line contains one JSON object:

[
    {"system": "你是一个AI小助理", "question": "哪一个富含蛋白质，床还是墙？", "chosen": "床和墙都不是蛋白质的来源，因为它们都是无生命的物体。蛋白质通常存在于肉类、奶制品、豆类和坚果等食物中。", "rejected": "对不起，我无法回答那个问题。请提供更具体的信息，让我知道你需要什么帮助。"},
    {"system": "你是一个AI小助理", "question": "哪一个富含蛋白质，床还是墙？", "chosen": "床和墙都不是蛋白质的来源，因为它们都是无生命的物体。蛋白质通常存在于肉类、奶制品、豆类和坚果等食物中。", "rejected": "对不起，我无法回答那个问题。请提供更具体的信息，让我知道你需要什么帮助。"},
    {"system": "你是一个AI小助理", "question": "哪一个富含蛋白质，床还是墙？", "chosen": "床和墙都不是蛋白质的来源，因为它们都是无生命的物体。蛋白质通常存在于肉类、奶制品、豆类和坚果等食物中。", "rejected": "对不起，我无法回答那个问题。请提供更具体的信息，让我知道你需要什么帮助。"}
]

jsonl: All data in a single JSON array:

{"system": "你是一个AI小助理", "question": "哪一个富含蛋白质，床还是墙？", "chosen": "床和墙都不是蛋白质的来源，因为它们都是无生命的物体。蛋白质通常存在于肉类、奶制品、豆类和坚果等食物中。", "rejected": "对不起，我无法回答那个问题。请提供更具体的信息，让我知道你需要什么帮助。"}
{"system": "你是一个AI小助理", "question": "哪一个富含蛋白质，床还是墙？", "chosen": "床和墙都不是蛋白质的来源，因为它们都是无生命的物体。蛋白质通常存在于肉类、奶制品、豆类和坚果等食物中。", "rejected": "对不起，我无法回答那个问题。请提供更具体的信息，让我知道你需要什么帮助。"}
{"system": "你是一个AI小助理", "question": "哪一个富含蛋白质，床还是墙？", "chosen": "床和墙都不是蛋白质的来源，因为它们都是无生命的物体。蛋白质通常存在于肉类、奶制品、豆类和坚果等食物中。", "rejected": "对不起，我无法回答那个问题。请提供更具体的信息，让我知道你需要什么帮助。"}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ERNIEKit Data Packing Strategy

ERNIEKit Data Sampling Strategy

ERNIEKit Data Format Specification

Local Datasets

Hugging Face Datasets

Supported Hugging Face Datasets

erniekit Data Format

SFT Dataset

DPO Dataset

SFT VL Dataset

SFT VL Dataset For function call

messages Format

alpaca Format

SFT Dataset

DPO Dataset

FilesExpand file tree

datasets.md

Latest commit

History

datasets.md

File metadata and controls

ERNIEKit Data Packing Strategy

ERNIEKit Data Sampling Strategy

ERNIEKit Data Format Specification

Local Datasets

Hugging Face Datasets

Supported Hugging Face Datasets

erniekit Data Format

SFT Dataset

DPO Dataset

SFT VL Dataset

SFT VL Dataset For function call

messages Format

alpaca Format

SFT Dataset

DPO Dataset