Einsum fails on Triton-ONNX-Runtime
When exporting this model into ONNX and serving it on NVIDIA Triton with ONNX-Runtime backend I get the following error:
2025-02-11 15:44:11.410228203 [E:onnxruntime:log, cuda_call.cc:123 CudaCall] CUBLAS failure 7: CUBLAS_STATUS_INVALID_VALUE ; GPU=0 ; hostname=9b6c766f16ea ; file=/workspace/onnxruntime/onnxruntime/core/providers/cuda/math/einsum_utils/einsum_auxiliary_ops.cc ; line=54 ; expr=cublasGemmStridedBatchedHelper( static_cast<EinsumCudaAssets*>(einsum_cuda_assets)->cublas_handle_, CUBLAS_OP_N, CUBLAS_OP_N, static_cast(N), static_cast(M), static_cast(K), &one, reinterpret_cast<const CudaT*>(input_2_data), static_cast(N), static_cast(right_stride), reinterpret_cast<const CudaT*>(input_1_data), static_cast(K), static_cast(left_stride), &zero, reinterpret_cast<CudaT*>(output_data), static_cast(N), static_cast(output_stride), static_cast(num_batches), static_cast<EinsumCudaAssets*>(einsum_cuda_assets)->cuda_ep_->GetDeviceProp(), static_cast<EinsumCudaAssets*>(einsum_cuda_assets)->cuda_ep_->UseTF32());
2025-02-11 15:44:11.432230090 [E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running Einsum node. Name:'/Einsum' Status Message: /workspace/onnxruntime/onnxruntime/core/providers/cpu/math/einsum_utils/einsum_auxiliary_ops.cc:341 std::unique_ptronnxruntime::Tensor onnxruntime::EinsumOp::MatMul(const onnxruntime::Tensor&, const gsl::span&, const onnxruntime::Tensor&, const gsl::span&, onnxruntime::AllocatorPtr, onnxruntime::concurrency::ThreadPool*, void*, DeviceHelpers::MatMul&) [with T = float; onnxruntime::AllocatorPtr = std::shared_ptronnxruntime::IAllocator; DeviceHelpers::MatMul = std::function<onnxruntime::common::Status(const float*, const float*, float*, long unsigned int, long unsigned int, long unsigned int, long unsigned int, long unsigned int, long unsigned int, long unsigned int, onnxruntime::concurrency::ThreadPool*, void*)>] 21Einsum op: Exception during MatMul operation: CUBLAS failure 7: CUBLAS_STATUS_INVALID_VALUE ; GPU=0 ; hostname=9b6c766f16ea ; file=/workspace/onnxruntime/onnxruntime/core/providers/cuda/math/einsum_utils/einsum_auxiliary_ops.cc ; line=54 ; expr=cublasGemmStridedBatchedHelper( static_cast<EinsumCudaAssets*>(einsum_cuda_assets)->cublas_handle_, CUBLAS_OP_N, CUBLAS_OP_N, static_cast(N), static_cast(M), static_cast(K), &one, reinterpret_cast<const CudaT*>(input_2_data), static_cast(N), static_cast(right_stride), reinterpret_cast<const CudaT*>(input_1_data), static_cast(K), static_cast(left_stride), &zero, reinterpret_cast<CudaT*>(output_data), static_cast(N), static_cast(output_stride), static_cast(num_batches), static_cast<EinsumCudaAssets*>(einsum_cuda_assets)->cuda_ep_->GetDeviceProp(), static_cast<EinsumCudaAssets*>(einsum_cuda_assets)->cuda_ep_->UseTF32());
Was anyone able to serve the onnx model on Triton?
Thanks
I also want to deploy it via Triton, did you find out how to do so and what config.pbtxt did you use?
I also want to deploy it via Triton, did you find out how to do so and what config.pbtxt did you use?
yes, I have it working.
I sued torch to export into onnx (torch.onnx.export), using torch.int32 as datatype.
This is my config.pbtxt:
platform: "onnxruntime_onnx"
max_batch_size: 128
input [
{
name: "input_ids"
data_type: TYPE_INT32
dims: [ -1 ]
},
{
name: "attention_mask"
data_type: TYPE_INT32
dims: [ -1 ]
},
{
name: "words_mask"
data_type: TYPE_INT32
dims: [ -1 ]
},
{
name: "text_lengths"
data_type: TYPE_INT32
dims: [ 1 ]
},
{
name: "span_idx"
data_type: TYPE_INT32
dims: [ -1, 2 ]
},
{
name: "span_mask"
data_type: TYPE_BOOL
dims: [ -1 ]
}
]
output [
{
name: "logits"
data_type: TYPE_FP32
dims: [-1, -1, -1]
}
]
Thanks that helped a lot in getting it running on our triton server!!
can you share some of the configs and model code for the preprocessor and postprocessor models?