Skip to content

[DRAFT] adding generated and custom code for custom training#45951

Open
jayesh-tanna wants to merge 1 commit intofeature/azure-ai-projects/2.0.2from
jatanna/trainingv1
Open

[DRAFT] adding generated and custom code for custom training#45951
jayesh-tanna wants to merge 1 commit intofeature/azure-ai-projects/2.0.2from
jatanna/trainingv1

Conversation

@jayesh-tanna
Copy link
Member

@jayesh-tanna jayesh-tanna commented Mar 27, 2026

Description

Typespec PullRequest: Azure/azure-rest-api-specs#41619

Add Training Jobs support to azure-ai-projects SDK

Overview

This PR introduces CommandJob support under client.beta.training.jobs (sync) and
async_client.beta.training.jobs (async), enabling users to create, get, list, update,
cancel, and delete training jobs from the Azure AI Projects SDK without wrapping boilerplate.

A lot of our customers are currently using azure-ai-ml feels familiar to them — same patterns, same mental model. That way, when they are ready to move to Azure AI Foundry, the migration is a small step rather than a full rewrite.


Design Choices

1. Flat CommandJob surface — no envelope required
Callers pass CommandJob directly to create_or_update and receive CommandJob back from
get/list. The SDK wraps/unwraps the Job(properties=...) wire envelope transparently.

2. Custom CommandJob subclass (model patch)
CommandJob extends the auto-generated _RestCommandJob and exposes read-only name and id
properties promoted from the outer Job envelope returned by the service.

3. _from_rest_object factory method
A classmethod on CommandJob constructs the flat model from any service response object,
with explicit ValueError/TypeError on unexpected shapes rather than silent None fields.

4. CommandJobLimits.timeout accepts int, float, or timedelta
The patched CommandJobLimits.__init__ converts plain numeric seconds to timedelta before
forwarding to the generated model, eliminating a common serialization foot-gun.

5. Auto-injection of Foundry-Features preview header
Every operation (list, get, create_or_update, begin_delete, begin_cancel) automatically injects
Foundry-Features: Jobs=V1Preview so callers never need to pass it manually as a custom header.

6. Automatic local-path resolution for code and inputs
If code or an input path is a local file or folder, the SDK transparently uploads it as a
dataset asset and swaps in the returned datastore URI before the request is sent.

7. Input validation before every create/update
create_or_update validates name, command, environment_image_reference, and compute
are non-empty upfront, surfacing clear ValueErrors instead of opaque HTTP 400 responses.

8. Full async mirror (_patch_jobs_async.py)
All sync customizations are mirrored in the async operations class using async/await and
distributed_trace_async, including async dataset upload resolution for code and inputs.


Customizations Summary

Customization What it does
Flat CommandJob model with name and id properties The service returns jobs wrapped in an outer Job envelope. We subclass the generated model to surface name and id directly on the object so callers never need to unwrap job.properties.name.
CommandJob._from_rest_object factory Converts a raw service Job response into a flat CommandJob in one place, with typed error messages if the response shape is unexpected (missing properties, wrong job type).
Job envelope wrapping in create_or_update The service wire format requires Job(properties=CommandJob(...)). The patch wraps the caller's flat CommandJob into the envelope automatically before the HTTP call, keeping the public API clean.
CommandJobLimits timeout coercion Overrides __init__ to accept plain int/float seconds in addition to timedelta, converting them automatically. Removes a class of runtime serialization errors when callers pass numeric timeouts.
Foundry-Features preview header injection Injects Foundry-Features: Jobs=V1Preview into every request from _inject_preview_header, so the preview feature flag is always active without callers needing to know about it.
Local-path auto-upload for code and inputs Before sending a job, any local file or folder in code or input path is uploaded to a new dataset via DatasetsOperations and the field is replaced with the returned datastore URI transparently.
Dataset name:version short-form resolution An input URI in name:version or azureai:name:version form is resolved to a full datastore URI by fetching the existing dataset, removing the need for callers to look up URIs manually.
Pre-flight _validate guard Checks name, command, environment_image_reference, and compute are non-empty before any network call, giving callers an immediate ValueError with a clear message instead of a cryptic HTTP 400.
Async mirror of all sync customizations Every sync customization (envelope wrap/unwrap, validation, path resolution, header injection) is duplicated with async/await in _patch_jobs_async.py so the async client has identical behaviour.

Pending / Future Work

  • command() factory function — Following the same pattern as azure-ai-ml's top-level
    command() function (see azure.ai.ml.entities._builders.command_func), a standalone
    command(*, command, environment, compute, inputs, outputs, ...) helper will be added so users
    can write job = command(...); client.beta.training.jobs.create_or_update(name, job) without
    constructing CommandJob directly.
  • Unit & live test coverage — Tests for the patch layer (validation, local-path resolution,
    header injection, _from_rest_object error paths, async equivalents) will be added in this PR
    in the next commit.

Sample code

job = CommandJob(
        command="python train.py --epochs 10 --lr 0.001 --output $AZUREML_MODEL_DIR/outputs",
        environment_image_reference="mcr.microsoft.com/azureml/minimal-ubuntu22.04-py39-cuda11.8-gpu-inference",
        compute=compute_id,
        display_name="Sample Command Job - Full",
        description="A sample job created via the Azure AI Projects SDK.",
        tags={"framework": "pytorch", "priority": "low", "team": "ai-platform"},
        properties={"experiment_id": "exp-42", "model_version": "1.0"},
        code="./src",
        environment_variables={
            "NCCL_DEBUG": "INFO",
            "PYTHONPATH": "/opt/conda/lib/python3.9/site-packages",
        },
        inputs={
            "training_data": Input(
                type=AssetTypes.URI_FILE,
                path="./data/train.csv",
                mode=InputOutputModes.READ_ONLY_MOUNT,
                description="CIFAR-10 training split",
            ),
        },
        outputs={
            "model_output": Output(
                type=AssetTypes.URI_FOLDER,
                path="azureai://datastores/workspaceblobstore/paths/outputs/cifar10-model/",
                mode=InputOutputModes.UPLOAD,
                asset_name="cifar10-trained-model",
                description="CIFAR-10 training split"
            ),
        },
        resources=JobResourceConfiguration(
            instance_count=2,
            instance_type="Standard_NC6s_v3",
            shm_size="8g",
            docker_args="--ipc=host",
            properties={"AISuperComputer": {"slaTier": "Premium", "priority": "high"}},
        ),
        distribution=PyTorchDistribution(process_count_per_instance=1),
        limits=CommandJobLimits(timeout=7200),
        queue_settings=QueueSettings(job_tier="Spot"),
        is_archived=False,
    )
job = project_client.beta.training.jobs.create_or_update(name='job_name', body=job)
print(job)

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

General Guidelines and Best Practices

  • Title of the pull request is clear and informative.
  • There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

  • Pull request includes test coverage for the included changes.

@jayesh-tanna jayesh-tanna changed the title adding generated and custom code for custom training [DRAFT] adding generated and custom code for custom training Mar 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant