[DRAFT] adding generated and custom code for custom training by jayesh-tanna · Pull Request #45951 · Azure/azure-sdk-for-python

jayesh-tanna · 2026-03-27T00:57:24Z

Description

Typespec PullRequest: Azure/azure-rest-api-specs#41619

Add Training Jobs support to `azure-ai-projects` SDK

Overview

This PR introduces CommandJob support under client.beta.training.jobs (sync) and
async_client.beta.training.jobs (async), enabling users to create, get, list, update,
cancel, and delete training jobs from the Azure AI Projects SDK without wrapping boilerplate.

A lot of our customers are currently using azure-ai-ml feels familiar to them — same patterns, same mental model. That way, when they are ready to move to Azure AI Foundry, the migration is a small step rather than a full rewrite.

Design Choices

1. Flat CommandJob surface — no envelope required
Callers pass CommandJob directly to create_or_update and receive CommandJob back from
get/list. The SDK wraps/unwraps the Job(properties=...) wire envelope transparently.

2. Custom CommandJob subclass (model patch)
CommandJob extends the auto-generated _RestCommandJob and exposes read-only name and id
properties promoted from the outer Job envelope returned by the service.

3. _from_rest_object factory method
A classmethod on CommandJob constructs the flat model from any service response object,
with explicit ValueError/TypeError on unexpected shapes rather than silent None fields.

4. CommandJobLimits.timeout accepts int, float, or timedelta
The patched CommandJobLimits.__init__ converts plain numeric seconds to timedelta before
forwarding to the generated model, eliminating a common serialization foot-gun.

5. Auto-injection of Foundry-Features preview header
Every operation (list, get, create_or_update, begin_delete, begin_cancel) automatically injects
Foundry-Features: Jobs=V1Preview so callers never need to pass it manually as a custom header.

6. Automatic local-path resolution for code and inputs
If code or an input path is a local file or folder, the SDK transparently uploads it as a
dataset asset and swaps in the returned datastore URI before the request is sent.

7. Input validation before every create/update
create_or_update validates name, command, environment_image_reference, and compute
are non-empty upfront, surfacing clear ValueErrors instead of opaque HTTP 400 responses.

8. Full async mirror (_patch_jobs_async.py)
All sync customizations are mirrored in the async operations class using async/await and
distributed_trace_async, including async dataset upload resolution for code and inputs.

Customizations Summary

Customization	What it does
Flat `CommandJob` model with `name` and `id` properties	The service returns jobs wrapped in an outer `Job` envelope. We subclass the generated model to surface `name` and `id` directly on the object so callers never need to unwrap `job.properties.name`.
`CommandJob._from_rest_object` factory	Converts a raw service `Job` response into a flat `CommandJob` in one place, with typed error messages if the response shape is unexpected (missing properties, wrong job type).
`Job` envelope wrapping in `create_or_update`	The service wire format requires `Job(properties=CommandJob(...))`. The patch wraps the caller's flat `CommandJob` into the envelope automatically before the HTTP call, keeping the public API clean.
`CommandJobLimits` timeout coercion	Overrides `__init__` to accept plain `int`/`float` seconds in addition to `timedelta`, converting them automatically. Removes a class of runtime serialization errors when callers pass numeric timeouts.
`Foundry-Features` preview header injection	Injects `Foundry-Features: Jobs=V1Preview` into every request from `_inject_preview_header`, so the preview feature flag is always active without callers needing to know about it.
Local-path auto-upload for `code` and `inputs`	Before sending a job, any local file or folder in `code` or input `path` is uploaded to a new dataset via `DatasetsOperations` and the field is replaced with the returned datastore URI transparently.
Dataset name:version short-form resolution	An input URI in `name:version` or `azureai:name:version` form is resolved to a full datastore URI by fetching the existing dataset, removing the need for callers to look up URIs manually.
Pre-flight `_validate` guard	Checks `name`, `command`, `environment_image_reference`, and `compute` are non-empty before any network call, giving callers an immediate `ValueError` with a clear message instead of a cryptic HTTP 400.
Async mirror of all sync customizations	Every sync customization (envelope wrap/unwrap, validation, path resolution, header injection) is duplicated with `async`/`await` in `_patch_jobs_async.py` so the async client has identical behaviour.

Pending / Future Work

command() factory function — Following the same pattern as azure-ai-ml's top-level
command() function (see azure.ai.ml.entities._builders.command_func), a standalone
command(*, command, environment, compute, inputs, outputs, ...) helper will be added so users
can write job = command(...); client.beta.training.jobs.create_or_update(name, job) without
constructing CommandJob directly.
Unit & live test coverage — Tests for the patch layer (validation, local-path resolution,
header injection, _from_rest_object error paths, async equivalents) will be added in this PR
in the next commit.

Sample code

job = CommandJob(
        command="python train.py --epochs 10 --lr 0.001 --output $AZUREML_MODEL_DIR/outputs",
        environment_image_reference="mcr.microsoft.com/azureml/minimal-ubuntu22.04-py39-cuda11.8-gpu-inference",
        compute=compute_id,
        display_name="Sample Command Job - Full",
        description="A sample job created via the Azure AI Projects SDK.",
        tags={"framework": "pytorch", "priority": "low", "team": "ai-platform"},
        properties={"experiment_id": "exp-42", "model_version": "1.0"},
        code="./src",
        environment_variables={
            "NCCL_DEBUG": "INFO",
            "PYTHONPATH": "/opt/conda/lib/python3.9/site-packages",
        },
        inputs={
            "training_data": Input(
                type=AssetTypes.URI_FILE,
                path="./data/train.csv",
                mode=InputOutputModes.READ_ONLY_MOUNT,
                description="CIFAR-10 training split",
            ),
        },
        outputs={
            "model_output": Output(
                type=AssetTypes.URI_FOLDER,
                path="azureai://datastores/workspaceblobstore/paths/outputs/cifar10-model/",
                mode=InputOutputModes.UPLOAD,
                asset_name="cifar10-trained-model",
                description="CIFAR-10 training split"
            ),
        },
        resources=JobResourceConfiguration(
            instance_count=2,
            instance_type="Standard_NC6s_v3",
            shm_size="8g",
            docker_args="--ipc=host",
            properties={"AISuperComputer": {"slaTier": "Premium", "priority": "high"}},
        ),
        distribution=PyTorchDistribution(process_count_per_instance=1),
        limits=CommandJobLimits(timeout=7200),
        queue_settings=QueueSettings(job_tier="Spot"),
        is_archived=False,
    )
job = project_client.beta.training.jobs.create_or_update(name='job_name', body=job)
print(job)

All SDK Contribution checklist:

The pull request does not introduce [breaking changes]
CHANGELOG is updated for new features, bug fixes or other significant changes.
I have read the contribution guidelines.

General Guidelines and Best Practices

Title of the pull request is clear and informative.
There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

Pull request includes test coverage for the included changes.

adding generated and custom code for custom training

947eebd

jayesh-tanna requested review from bobogogo1990, dargilco, glharper, howieleung, kingernupur, nick863, trangevi and trrwilson as code owners March 27, 2026 00:57

github-actions bot added the AI Projects label Mar 27, 2026

jayesh-tanna changed the title ~~adding generated and custom code for custom training~~ [DRAFT] adding generated and custom code for custom training Mar 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT] adding generated and custom code for custom training#45951

[DRAFT] adding generated and custom code for custom training#45951
jayesh-tanna wants to merge 1 commit intofeature/azure-ai-projects/2.0.2from
jatanna/trainingv1

jayesh-tanna commented Mar 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jayesh-tanna commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Add Training Jobs support to azure-ai-projects SDK

Overview

Design Choices

Customizations Summary

Pending / Future Work

Sample code

All SDK Contribution checklist:

General Guidelines and Best Practices

Testing Guidelines

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jayesh-tanna commented Mar 27, 2026 •

edited

Loading

Add Training Jobs support to `azure-ai-projects` SDK