Skip to main content

TrainingClient API Reference

Overview

The TrainingClient is used for training ML models with forward/backward passes and optimization. You typically get a TrainingClient by calling service_client.create_lora_training_client() or service_client.create_training_client_from_state().

Base URL: https://www.hpc-ai.com/finetunesdk

Table of Contents


forward()

Description

Performs a forward pass on the model without computing gradients. This method sends input data and loss-fn type (currently only supports "cross_entropy") to the training service and returns an asynchronous handle (APIFuture) for retrieving the ForwardBackwardOutput.

Signature

def forward(
data: List[types.Datum],
loss_fn: str = "cross_entropy"
) -> APIFuture[ForwardBackwardOutput]

Parameters

ParameterTypeDefaultDescription
dataList[types.Datum]requiredA list of Datum objects (from hpcai.types.Datum). Each Datum contains:
- model_input: The model input, which is input token ids.
- loss_fn_inputs: A dictionary of tensors required by the specified loss function. For cross_entropy loss, we need the "weights" (loss mask to mask per token loss) and "target_tokens" (the shifted labels matching the shape of weights and model_input).
loss_fnstr"cross_entropy"The name of the loss function to be used during the forward pass (currently only supports "cross_entropy").

Returns

APIFuture[ForwardBackwardOutput]: An asynchronous handle that resolves to ForwardBackwardOutput when .result() is called.

For "cross_entropy" loss, the handle returns a ForwardBackwardOutput object that contains:

  • loss_fn_outputs: A dictionary of log probability of each token and masked elementwise_loss, both of the same shape as the model_input
  • metrics: A dictionary of training related metrics, including the mean training loss, the number of examples in the batch and the current step number

Example

from hpcai import types

data = [
types.Datum(
model_input=types.ModelInput.from_ints([1, 2, 3, 4, 5]),
loss_fn_inputs={
"target_tokens": types.TensorData(
data=[2, 3, 4, 5, 6],
dtype="int64",
shape=[5]
),
"weights": types.TensorData(
data=[1.0, 1.0, 1.0, 1.0, 1.0],
dtype="float32",
shape=[5]
)
}
)
]

out = training_client.forward(data, loss_fn="cross_entropy")
res = out.result()

print(res)

Expected Output:

ForwardBackwardOutput(loss_fn_output_type='cross_entropy', loss_fn_outputs=[{'logprobs': TensorData(data=[-8.9375, -10.375, -1.2734375, -0.65234375, -1.0], dtype='float32', shape=[5]), 'elementwise_loss': TensorData(data=[8.9375, 10.375, 1.2734375, 0.65234375, 1.0], dtype='float32', shape=[5])}], metrics={'loss:mean': 4.447656154632568, 'num_examples:sum': 1.0, 'step:max': 0.0})

forward_backward()

Description

Performs a forward–backward pass on the model. This call invokes loss.backward(), and gradients will accumulate across multiple forward_backward calls until optim_step is invoked. This method sends input data and loss-fn type (currently only supports "cross_entropy") to the training service and returns an asynchronous handle (APIFuture) for retrieving the ForwardBackwardOutput.

Signature

def forward_backward(
data: List[types.Datum],
loss_fn: str = "cross_entropy"
) -> APIFuture[ForwardBackwardOutput]

Parameters

ParameterTypeDefaultDescription
dataList[types.Datum]requiredA list of Datum objects (from hpcai.types.Datum). Each Datum contains:
- model_input: The model input, which is input token ids.
- loss_fn_inputs: A dictionary of tensors required by the specified loss function. For cross_entropy loss, we need the "weights" (loss mask to mask per token loss) and "target_tokens" (the shifted labels matching the shape of weights and model_input).
loss_fnstr"cross_entropy"The name of the loss function to be used during the forward pass (currently only supports "cross_entropy").

Returns

An asynchronous handle that resolves to ForwardBackwardOutput when .result() is called.

For "cross_entropy" loss, the handle returns a ForwardBackwardOutput object that contains:

  • loss_fn_outputs: A dictionary of log probability of each token and masked elementwise_loss, both of the same shape as the model_input
  • metrics: A dictionary of training related metrics, including the mean training loss, the number of examples in the batch and the current step number

Example

from hpcai import types

data = [
types.Datum(
model_input=types.ModelInput.from_ints([1, 2, 3, 4, 5]),
loss_fn_inputs={
"target_tokens": types.TensorData(
data=[2, 3, 4, 5, 6],
dtype="int64",
shape=[5]
),
"weights": types.TensorData(
data=[1.0, 1.0, 1.0, 1.0, 1.0],
dtype="float32",
shape=[5]
)
}
)
]

out = training_client.forward_backward(data, loss_fn="cross_entropy")
res = out.result()

print(res)

Expected Output:

ForwardBackwardOutput(loss_fn_output_type='cross_entropy', loss_fn_outputs=[{'logprobs': TensorData(data=[-8.9375, -10.375, -1.2734375, -0.65234375, -1.0], dtype='float32', shape=[5]), 'elementwise_loss': TensorData(data=[8.9375, 10.375, 1.2734375, 0.65234375, 1.0], dtype='float32', shape=[5])}], metrics={'loss:mean': 4.447656154632568, 'num_examples:sum': 1.0, 'step:max': 0.0})

optim_step()

Description

Executes a single optimization step using the Adam optimizer and returns an asynchronous handle (APIFuture) for retrieving the OptimStepResponse. This call applies the accumulated gradients to update the model parameters based on the provided Adam hyperparameters.

Signature

def optim_step(
adam_params: types.AdamParams
) -> APIFuture[OptimStepResponse]

Parameters

ParameterTypeDescription
adam_paramstypes.AdamParamsConfiguration of the Adam optimizer, including:
- learning_rate (float): Step size for parameter updates. Default to 0.0001.
- beta1 (float): Exponential decay rate for the first-moment estimates. Default to 0.9.
- beta2 (float): Exponential decay rate for the second-moment estimates. Default to 0.95.
- eps (float): Small constant for numerical stability. Default to 1e-12.

Returns

Returns an OptimStepResponse object containing optimization metrics:

  • step:max: The current step number
  • learning_rate:mean: The learning rate used in this step
  • recent_loss:mean: Mean training loss

Example

res = training_client.optim_step(adam_params=types.AdamParams(learning_rate=1e-4, beta1=0.9, beta2=0.95, eps=1e-12)).result()

print(res)

Expected Output:

OptimStepResponse(metrics={'step:max': 1.0, 'learning_rate:mean': 0.0001, 'recent_loss:mean': 4.447656154632568})

get_info()

Description

Retrieves metadata and configuration details about the current training session. This includes model_id (run id), model architecture information, LoRA settings.

Signature

def get_info() -> GetInfoResponse

Parameters

None

Returns

Returns a GetInfoResponse object containing fields such as:

  • type: The response type, usually "get_info"
  • model_data: A ModelData object including architecture and model name
  • model_id: Unique identifier of the model/run
  • is_lora: Boolean indicating whether LoRA fine-tuning is enabled
  • lora_rank: The LoRA rank if LoRA is enabled
  • model_name: The base model name

Example

info = training_client.get_info()

print(info)

Expected Output:

GetInfoResponse(type='get_info', model_data=ModelData(arch='', model_name='Qwen/Qwen3-8B'), model_id='e5c88495-46e9-43df-9bf8-3185aceaa222', is_lora=True, lora_rank=16, model_name='Qwen/Qwen3-8B')

get_tokenizer()

Description

Returns a transformers pre-trained tokenizer object according to the loaded model of the current training run.

Signature

def get_tokenizer() -> PreTrainedTokenizer

Parameters

None

Returns

A PreTrainedTokenizer object from the Hugging Face Transformers library.

Example

tokenizer = training_client.get_tokenizer()

print(tokenizer.encode("Hello, world!"))

Expected Output:

[9707, 11, 1879, 0]

save_state()

Description

Save the trainable weights (e.g., if LoRA is enabled, only the weights of the LoRA adapter would be saved) of the loaded model with a user specified checkpoint_id for resuming training and returns an asynchronous handle (APIFuture) for retrieving the OptimStepResponse.

Signature

def save_state(
name: str
) -> APIFuture[SaveWeightsResponse]

Parameters

ParameterTypeDescription
namestrThe user specified checkpoint_id, which will be used as an identifier of the saved weights.

Returns

An asynchronous handle (APIFuture) for retrieving the SaveWeightsResponse object, which contains the url path of the saved weights in the format: "hpcai://{model_id}/weights/{checkpoint_id}_training"

Example

res = training_client.save_state("initial").result()

print(res)

Expected Output:

SaveWeightsResponse(path='hpcai://214c086a-f75b-49e1-9b1a-62ce2670baa7/weights/initial_training', type='save_weights')

load_state()

Description

Load the trainable weights (e.g., if LoRA is enabled, only the weights of the LoRA adapter would be loaded) of the user specified checkpoint by checkpoint_id and returns an asynchronous handle (APIFuture) for retrieving the OptimStepResponse.

Signature

def load_state(
path: str
) -> APIFuture[LoadWeightsResponse]

Parameters

ParameterTypeDescription
pathstrThe url path of a saved checkpoint in the format: "hpcai://{model_id}/weights/{checkpoint_id}_sampler" (for inference) or in the format: "hpcai://{model_id}/weights/{checkpoint_id}_training" (for resume training)

Returns

An asynchronous handle (APIFuture) for retrieving the LoadWeightsResponse object, which contains the url path of the loaded weights if succeed.

Example

res = training_client.load_state('hpcai://e8b29733-9efa-476a-b62b-40b8f9c6d999/weights/initial_training').result()

print(res)

Expected Output:

LoadWeightsResponse(path='hpcai://e8b29733-9efa-476a-b62b-40b8f9c6d999/weights/initial_training', type='load_weights')

save_weights_for_sampler()

Description

Save the trainable weights (e.g., if LoRA is enabled, only the weights of the LoRA adapter would be saved) of the loaded model with a user specified checkpoint_id for inference and returns an asynchronous handle (APIFuture) for retrieving the SaveWeightsForSamplerResponse.

Signature

def save_weights_for_sampler(
name: str
) -> APIFuture[SaveWeightsForSamplerResponse]

Parameters

ParameterTypeDescription
namestrThe user specified checkpoint_id, which will be used as an identifier of the saved weights.

Returns

An asynchronous handle (APIFuture) for retrieving the SaveWeightsForSamplerResponse object, which contains the url path of the saved weights in the format: "hpcai://{model_id}/weights/{checkpoint_id}_sampler"

Example

res = training_client.save_weights_for_sampler("initial").result()

print(res)

Expected Output:

SaveWeightsForSamplerResponse(path='hpcai://c857f372-1979-43e4-8afe-d0546ce38d98/sampler_weights/initial_sampler', type='save_weights_for_sampler')

unload_model()

Description

Signal the stop of the current training session and release resources such as GPUs. Please note that this API doesn't automatically save the model weights. It won't affect any previously saved checkpoints, you can still download them after model unload.

Signature

def unload_model() -> APIFuture[UnloadModelResponse]

Parameters

None

Returns

An asynchronous handle (APIFuture) for retrieving the UnloadModelResponse object, which contains the model id of the unloaded model if succeed.

Example

res = training_client.unload_model().result()

print(res)

res = rest_client.list_checkpoints(training_client.model_id).result()

print(res)

Expected Output:

UnloadModelResponse(model_id='c857f372-1979-43e4-8afe-d0546ce38d98', type='unload_model')
CheckpointsListResponse(checkpoints=[Checkpoint(checkpoint_id='initial_sampler', checkpoint_type='sampler', time=datetime.datetime(2025, 11, 28, 1, 58, 36, tzinfo=TzInfo(0)), checkpoint_path='hpcai://e8b29733-9efa-476a-b62b-40b8f9c6d999/sampler_weights/initial_sampler'), Checkpoint(checkpoint_id='initial_training', checkpoint_type='training', time=datetime.datetime(2025, 11, 28, 1, 58, 25, tzinfo=TzInfo(0)), checkpoint_path='hpcai://e8b29733-9efa-476a-b62b-40b8f9c6d999/weights/initial_training')])