Skip to content

Latest commit

 

History

History
1283 lines (1055 loc) · 75.5 KB

CHANGELOG.md

File metadata and controls

1283 lines (1055 loc) · 75.5 KB

Changelog

0.13.1

  • fix: Add AutocastType to public API

0.13.0

  • new: Introducing custom_args in TensorConfig for custom runners to use which allows dynamic shapes setup for TorchTensorRT compilation
  • new: autocast_dtype added Torch runner configuration to set the dtype for autocast
  • new: New version of Onnx Runtime 1.20 for python version >= 3.10
  • new: Use torch.compile path in heuristic search for max batch size
  • change: Removed TensorFlow dependencies for nav.jax.optimize
  • change: Removed PyTorch dependencies from nav.profile
  • change: Collect all Python packages in status instead of filtered list
  • change: Use default throughput cutoff threshold for max batch size heuristic when None provided in configuration
  • change: Updated default ONNX opset to 20 for Torch >= 2.5
  • fix: Exception is raised with Python >=3.11 due to wrong dataclass initialization
  • fix: Removed option from ExportOption removed from Torch 2.5
  • fix: Improved preprocessing stage in Torch based runners
  • fix: Warn when using autocast with bfloat16 in Torch
  • fix: Pass runner configuration to runners in nav.profile

0.12.0

  • new: simple and detailed reporting of the optimization process
  • new: adjusted exporting TensorFlow SavedModel for Keras 3.x
  • new: inform user when wrapped a module which is not called during optimize
  • new: inform user when module uses a custom forward function
  • new: support for dynamic shapes in Torch ExportedProgram
  • new: use ExportedProgram for Torch-TensorRT conversion
  • new: support back-off policy during profiling to avoid reporting local minimum
  • new: automatically scale conversion batch size when modules have different batch sizes in scope of a single pipeline
  • change: TensorRT conversion max batch size search rely on saturating throughput for base formats
  • change: adjusted profiling configuration for throughput cutoff search
  • change: include optimized pipeline to list of examined variants during nav.profile
  • change: performance is not executed when correctness failed for format and runtime
  • change: verify command is not executed when verify function is not provided
  • change: do not create a model copy before executing torch.compile
  • fix: pipelines sometimes obtain model and tensors on different devices during nav.profile
  • fix: extract graph from ExportedProgram for running inference
  • fix: runner configuration not propagated to pre-processing steps

0.11.0

  • new: Python 3.12 support
  • new: Improved logging
  • new: optimized in-place module can be stored to Triton model repository
  • new: multi-profile support for TensorRT model build and runtime
  • new: measure duration of each command executed in optimization pipeline
  • new: TensorRT-LLM model store generation for deployment on Triton Inference Server
  • change: filter unsupported runners instead of raising an error when running optimize
  • change: moved JAX to support to experimental module and limited support
  • change: use autocast=True for Torch based runners
  • change: use torch.inference_mode or torch.no_grad context in nav.profile measurements
  • change: use multiple strategies to select optimized runtime, defaults to [MaxThroughputAndMinLatencyStrategy, MinLatencyStrategy]
  • change: trt_profiles are not set automatically for module when using nav.optimize
  • fix: properly revert log level after torch onnx dynamo export

0.10.1

  • fix: Check if torch 2 is available before doing dynamo cleanup

0.10.0

  • new: inplace nav.Module accepts batching flag which overrides a config setting and precision which allows setting appropriate configuration for TensorRT
  • new: Allow to set device when loading optimized modules using nav.load_optimized()
  • new: Add support for custom i/o names and dynamic shapes in Torch ONNX Dynamo path
  • new: Added nav.bundle.save and nav.bundle.load to save and load optimized models from cache
  • change: Improved optimize and profile status in inplace mode
  • change: Improved handling defaults for ONNX Dynamo when executing nav.package.optimize
  • fix: Maintaining modules device in nav.profile()
  • fix: Add support for all precisions for TensorRT in nav.profile()
  • fix: Forward method not passed to other inplace modules.

0.9.0

  • new: TensorRT Timing Tactics Cache Management - using timing tactics cache files for optimization performance improvements
  • new: Added throughput saturation verification in nav.profile() (enabled by default)
  • new: Allow to override Inplace cache dir through MODEL_NAVIGATOR_DEFAULT_CACHE_DIR env variable
  • new: inplace nav.Module can now receive a function name to be used instead of call in modules/submodules, allows customizing modules with non-standard calls
  • fix: torch dynamo export and torch dynamo onnx export
  • fix: measurement stabilization in nav.profile()
  • fix: inplace inference through Torch
  • fix: trt_profiles argument handling in ONNX to TRT conversion
  • fix: optimal shape configuration for batch size in Inplace API
  • change: Disable TensorRT profile builder
  • change: nav.optimize() does not override module configuration

0.8.1

  • fix: Inference with TensorRT when model has input with empty shape
  • fix: Using stabilized runners when model has no batching
  • fix: Invalid dependencies for cuDNN - review known issues
  • fix: Make ONNX Graph Surgeon produce artifacts within protobuf Limit (2G)
  • change: Remove TensorRTCUDAGraph from default runners
  • change: updated ONNX package to 1.16

0.8.0

  • new: Allow to select device for TensorRT runner
  • new: Add device output buffers to TensorRT runner
  • new: nav.profile added for profiling any Python function
  • change: API for Inplace optimization (breaking change)
  • fix: Passing inputs for Torch to ONNX export
  • fix: Parse args to kwargs in torchscript-trace export
  • fix: Lower peak memory usage when loading Torch inplace optimized model

0.7.7

  • change: Add input and output specs for Triton model repositories generated from packages

0.7.6

  • fix: Passing inputs for Torch to ONNX export
  • fix: Passing input data to OnnxCUDA runner

0.7.5

  • new: FP8 precision support for TensorRT
  • new: Support for autocast and inference mode configuration for Torch runners
  • new: Allow to select device for Torch and ONNX runners
  • new: Add support for default_model_filename in Triton model configuration
  • new: Detailed profiling of inference steps (pre- and postprocessing, memcpy and compute)
  • fix: JAX export and TensorRT conversion fails when custom workspace is used
  • fix: Missing max workspace size passed to TensorRT conversion
  • fix: Execution of TensorRT optimize raise error during handling output metadata
  • fix: Limited Polygraphy version to work correctly with onnxruntime-gpu package

0.7.4

  • new: decoupled mode configuration in Triton Model Config
  • new: support for PyTorch ExportedProgram and ONNX dynamo export
  • new: added GraphSurgeon ONNX optimization
  • fix: compatibility of generating PyTriton model config through adapter
  • fix: installation of packages that are platform dependent
  • fix: update package config with model loaded from source
  • change: in TensorRT runner, when TensorType.TORCH is the return type lazily convert tensor to Torch
  • change: move from Polygraphy CLI to Polygraphy Python API
  • change: removed Windows from support list

0.7.3

  • new: Data dependent dynamic control flow support in nav.Module (multiple computation graphs per module)

  • new: Added find max batch size utility

  • new: Added utilities API documentation

  • new: Add Timer class for measuring execution time of models and Inplace modules.

  • fix: Use wide range of shapes for TensorRT conversion

  • fix: Sorting of samples loaded from workspace

  • change: in Inplace, store one sample by default per module and store shape info for all samples

  • change: always execute export for all supported formats

  • Known issues and limitations:

    • nav.Module moves original torch.nn.Module to the CPU, in case of weight sharing that might result in unexpected behaviour
    • For data dependent dynamic control flow (multiple computation graphs) nav.Module might copy the weights for each separate graph

0.7.2

  • fix: Obtaining inputs names from ONNX file for TensorRT conversion
  • change: Raise exception instead of exit with code when required command has failed

0.7.1

  • fix: gather onnx input names based on model's forward signature
  • fix: do not run TensorRT max batch size search when max batch size is None
  • fix: use pytree metadata to flatten torch complex outputs

0.7.0

  • new: Inplace Optimize feature - optimize models directly in the Python code
  • new: Non-tensor inputs and outputs support
  • new: Model warmup support in Triton model configuration
  • new: nav.tensorrt.optimize api added for testing and measuring performance of TensorRT models
  • new: Extended custom configs to pass arguments directly to export and conversion operations like torch.onnx.export or polygraphy convert
  • new: Collect GPU clock during model profiling
  • new: Add option to configure minimal trials and stabilization windows for performance verification and profiling
  • change: Navigator package version change to 0.2.3. Custom configurations now use trt_profiles list instead single value
  • change: Store separate reproduction scripts for runners used during correctness and profiling

0.6.3

  • fix: Conditional imports of supported frameworks in export commands

0.6.2

  • new: Collect information about TensorRT shapes used during conversion
  • fix: Invalid link in documentation
  • change: Improved rendering documentation

0.6.1

  • fix: Add model from package to Triton model store with custom configs

0.6.0

  • new: Zero-copy runners for Torch, ONNX and TensorRT - omit H2D and D2H memory copy between runners execution
  • new: nav.pacakge.profile API method to profile generated models on provided dataloader
  • change: ProfilerConfig replaced with OptimizationProfile:
    • new: OptimizationProfile impact the conversion for TensorRT
    • new: batch_sizes and max_batch_size limit the max profile in TensorRT conversion
    • new: Allow to provide separate dataloader for profiling - first sample used only
  • new: allow to run nav.package.optimize on empty package - status generation only
  • new: use torch.inference_mode for inference runner when PyTorch 2.x is available
  • fix: Missing model in config when passing package generated during nav.{framework}.optimize directly to nav.package.optimize command
  • Other minor fixes and improvements

0.5.6

  • fix: Load samples as sorted to keep valid order
  • fix: Execute conversion when model already exists in path
  • Other minor fixes and improvements

0.5.5

  • new: Public nav.utilities module with UnpackedDataloader wrapper
  • new: Added support for strict flag in Torch custom config
  • new: Extended TensorRT custom config to support builder optimization level and hardware compatibility flags
  • fix: Invalid optimal shape calculation for odd values in max batch size

0.5.4

  • new: Custom implementation for ONNX and TensorRT runners
  • new: Use CUDA 12 for JAX in unit tests and functional tests
  • new: Step-by-step examples
  • new: Updated documentation
  • new: TensorRTCUDAGraph runner introduced with support for CUDA graphs
  • fix: Optimal shape not set correctly during adaptive conversion
  • fix: Find max batch size command for JAX
  • fix: Save stdout to logfiles in debug mode

0.5.3

  • fix: filter outputs using output_metadata in ONNX runners

0.5.2

  • new: Added Contributor License Agreement (CLA)
  • fix: Added missing --extra-index-url to installation instruction for pypi
  • fix: Updated wheel readme
  • fix: Do not run TorchScript export when only ONNX in target formats and ONNX extended export is disabled
  • fix: Log full traceback for ModelNavigatorUserInputError

0.5.1

  • fix: Using relative workspace cause error during Onnx to TensorRT conversion
  • fix: Added external weight in package for ONNX format
  • fix: bugfixes for functional tests

0.5.0

  • new: Support for PyTriton deployment
  • new: Support for Python models with python.optimize API
  • new: PyTorch 2 compile CPU and CUDA runners
  • new: Collect conversion max batch size in status
  • new: PyTorch runners with compile support
  • change: Improved handling CUDA and CPU runners
  • change: Reduced finding device max batch size time by running it once as separate pipeline
  • change: Stored find max batch size result in separate filed in status

0.4.4

  • fix: when exporting single input model to saved model, unwrap one element list with inputs

0.4.3

  • fix: in Keras inference use model.predict(tensor) for single input models

0.4.2

  • fix: loading configuration for trt_profile from package
  • fix: missing reproduction scripts and logs inside package
  • fix: invalid model path in reproduction script for ONNX to TRT conversion
  • fix: collecting metadata from ONNX model in main thread during ONNX to TRT conversion

0.4.1

  • fix: when specified use dynamic axes from custom OnnxConfig

0.4.0

  • new: optimize method that replace export and perform max batch size search and improved profiling during process
  • new: Introduced custom configs in optimize for better parametrization of export/conversion commands
  • new: Support for adding user runners for model correctness and profiling
  • new: Search for max possible batch size per format during conversion and profiling
  • new: API for creating Triton model store from Navigator Package and user provided models
  • change: Improved status structure for Navigator Package
  • deprecated: Optimize for Triton Inference Server support
  • deprecated: HuggingFace contrib module
  • Bug fixes and other improvements

0.3.8

  • Updated NVIDIA containers defaults to 22.11

0.3.7

  • Updated NVIDIA containers defaults to 22.10

0.3.6

  • Updated NVIDIA containers defaults to 22.09
  • Model Navigator Export API:
    • new: cast int64 input data to int32 in runner for Torch-TensorRT
    • new: cast 64-bit data samples to 32-bit values for TensorRT
    • new: verbose flag for logging export and conversion commands to console
    • new: debug flag to enable debug mode for export and conversion commands
    • change: logs from commands are streamed to console during command run
    • change: package load omit the log files and autogenerated scripts

0.3.5

  • Updated NVIDIA containers defaults to 22.08
  • Model Navigator Export API:
    • new: TRTExec runner use use_cuda_graph=True by default
    • new: log warning instead of raising error when dataloader dump inputs with nan or inf values
    • new: enabled logging for command input parameters
    • fix: invalid use of Polygraphy TRT profile when trt_dynamic_axes is passed to export function

0.3.4

  • Updated NVIDIA containers defaults to 22.07
  • Model Navigator OTIS:
    • deprecated: TF32 precision for TensorRT from CLI options - will be removed in future versions
    • fix: Tensorflow module was imported when obtaining model signature during conversion
  • Model Navigator Export API:
    • new: Support for building framework containers with Model Navigator installed
    • new: Example for loading Navigator Package for reproducing the results
    • new: Create reproducing script for correctness and performance steps
    • new: TrtexecRunner for correctness and performance tests with trtexec tool
    • new: Use TF32 support by default for models with FP32 precision
    • new: Reset conversion parameters to defaults when using load for package
    • new: Testing all options for JAX export enable_xla and jit_compile parameters
    • change: Profiling stability improvements
    • change: Rename of onnx_runtimes export function parameters to runtimes
    • deprecated: TF32 precision for TensorRT from available options - will be removed in future versions
    • fix: Do not save TF-TRT models to the .nav package
    • fix: Do not save TF-TRT models from the .nav package
    • fix: Correctly load .nav packages when _input_names or _output_names specified
    • fix: Adjust TF and TF-TRT model signatures to match input_names
    • fix: Save ONNX opset for CLI configuration inside package
    • fix: Reproduction scripts were missing for failing paths

0.3.3

  • Model Navigator Export API:
    • new: Improved handling inputs and outputs metadata
    • new: Navigator Package version updated to 0.1.3
    • new: Backward compatibility with previous versions of Navigator Package
    • fix: Dynamic shapes for output shapes were read incorrectly

0.3.2

  • Updated NVIDIA containers defaults to 22.06
  • Model Navigator OTIS:
    • new: Perf Analyzer profiling data use base64 format for content
    • fix: Signature for TensorRT model when has uint64 or int64 input and/or outputs defined
  • Model Navigator Export API:
    • new: Updated navigator package format to 0.1.1
    • new: Added Model Navigator version to status file
    • new: Add atol and rtol configuration to CLI config for model
    • new: Added experimental support for JAX models
    • new: In case of export or conversion failures prepare minimal scripts to reproduce errors
    • fix: Conversion parameters are not stored in Navigator Package for CLI execution

0.3.1

  • Updated NVIDIA containers defaults to 22.05
  • Model Navigator OTIS:
    • fix: Saving paths inside the Triton package status file
    • fix: Empty list of gpus cause the process run on CPU only
    • fix: Reading content from zipped Navigator Package
    • fix: When no GPU or target device set to CPU optimize avoid running unsupported conversions in CLI
    • new: Converter accept passing target device kind to selected CPU or GPU supported conversions
    • new: Added support for OpenVINO accelerator for ONNXRuntime
    • new: Added option --config-search-early-exit-enable for Model Analyzer early exit support in manual profiling mode
    • new: Added option --model-config-name to the select command. It allows to pick a particular model configuration for deployment from the set of all configurations generated by Triton Model Analyzer, even if it's not the best performing one.
    • removed: The --tensorrt-strict-types option has been removed due to deprecation of the functionality in upstream libraries.
  • Model Navigator Export API:
    • new: Added dynamic shapes support and trt dynamic shapes support for TensorFlow2 export
    • new: Improved per format logging
    • new: PyTorch to Torch-TRT precision selection added
    • new: Advanced profiling (measurement windows, configurable batch sizes)

0.3.0

  • Updated NVIDIA containers defaults to 22.04
  • Model Navigator Export API
    • Support for exporting models from TensorFlow2 and PyTorch source code to supported target formats
    • Support for conversion from ONNX to supported target formats
    • Support for exporting HuggingFace models
    • Conversion, Correctness and performance tests for exported models
    • Definition of package structure for storing all exported models and additional metadata
  • Model Navigator OTIS:
    • change: run command has been deprecated and may be removed in a future release
    • new: optimize command replace run and produces an output *.triton.nav package
    • new: select selects the best-performing configuration from *.triton.nav package and create a Triton Inference Server model repository
    • new: Added support for using shared memory option for Perf Analyzer
  • Remove wkhtmltopdf package dependency

0.2.7

  • Updated NVIDIA containers defaults to 22.02
  • Removed support for Python 3.7
  • Triton Model configuration related:
    • Support dynamic batching without setting preferred batch size value
  • Profiling related:
    • Deprecated --config-search-max-preferred-batch-size flag as is no longer supported in Triton Model Analyzer

0.2.6

  • Updated NVIDIA containers defaults to 22.01
  • Removed support for Python 3.6 due to EOL
  • Conversion related:
    • Added support for Torch-TensorRT conversion
  • Fixes and improvements
    • Processes inside containers started by Model Navigator now run without root privileges
    • Fix for volume mounts while running Triton Inference Server in container from other container
    • Fix for conversion of models without file extension on input and output paths
    • Fix using --model-format argument when input and output files have no extension
  • Known issues and limitations
    • missing support for stateful models (ex. time-series one)
    • no verification of conversion results for conversions: TF -> ONNX, TF->TF-TRT, TorchScript -> ONNX
    • possible to define a single profile for TensorRT
    • no custom ops support
    • Triton Inference Server stays in the background when the profile process is interrupted by the user
    • TF-TRT conversion lost outputs shapes info

0.2.5

  • Updated NVIDIA containers defaults to 21.12
  • Conversion related:
    • [Experimental] TF-TRT - fixed default dataset profile generation
  • Configuration Model on Triton related
    • Fixed name for onnxruntime backend in Triton model deployment configuration
  • Known issues and limitations
    • missing support for stateful models (ex. time-series one)
    • no verification of conversion results for conversions: TF -> ONNX, TF->TF-TRT, TorchScript -> ONNX
    • possible to define a single profile for TensorRT
    • no custom ops support
    • Triton Inference Server stays in the background when the profile process is interrupted by the user
    • TF-TRT conversion lost outputs shapes info

0.2.4

  • Updated NVIDIA containers defaults to 21.10
  • Fixed generating profiling data when dtypes are not passed
  • Conversion related:
    • [Experimental] Added support for TF-TRT conversion
  • Configuration Model on Triton related
    • Added possibility to select batching mode - default, dynamic and disabled options supported
  • Install dependencies from pip packages instead of wheels for Polygraphy and Triton Model Analyzer
  • fixes and improvements
  • Known issues and limitations
    • missing support for stateful models (ex. time-series one)
    • no verification of conversion results for conversions: TF -> ONNX, TF->TF-TRT, TorchScript -> ONNX
    • possible to define a single profile for TensorRT
    • no custom ops support
    • Triton Inference Server stays in the background when the profile process is interrupted by the user
    • TF-TRT conversion lost outputs shapes info

0.2.3

  • Updated NVIDIA containers defaults to 21.09
  • Improved naming of arguments specific for TensorRT conversion and acceleration with backward compatibility
  • Use pip package for Triton Model Analyzer installation with minimal version 1.8.0
  • Fixed model_repository path to be not relative to <navigator_workspace> dir
  • Handle exit codes correctly from CLI commands
  • Support for use device ids for --gpus argument
  • Conversion related
    • Added support for precision modes to support multiple precisions during conversion to TensorRT
    • Added --tensorrt-sparse-weights flag for sparse weight optimization for TensorRT
    • Added --tensorrt-strict-types flag forcing it to choose tactics based on the layer precision for TensorRT
    • Added --tensorrt-explicit-precision flag enabling explicit precision mode
    • Fixed nan values appearing in relative tolerance during conversion to TensorRT
  • Configuration Model on Triton related
    • Removed default value for engine_count_per_device
    • Added possibility to define Triton Custom Backend parameters with triton_backend_parameters command
    • Added possibility to define max workspace size for TensorRT backend accelerator using argument tensorrt_max_workspace_size
  • Profiling related
    • Added config_search prefix to all profiling parameters (BREAKING CHANGE)
    • Added config_search_max_preferred_batch_size parameter
    • Added config_search_backend_parameters parameter
  • fixes and improvements
  • Known issues and limitations
    • missing support for stateful models (ex. time-series one)
    • missing support for models without batching support
    • no verification of conversion results for conversions: TF -> ONNX, TorchScript -> ONNX
    • possible to define a single profile for TensorRT

0.2.2

  • Updated NVIDIA containers defaults to 21.08
  • Known issues and limitations
    • missing support for stateful models (ex. time-series one)
    • missing support for models without batching support
    • no verification of conversion results for conversions: TF -> ONNX, TorchScript -> ONNX
    • possible to define a single profile for TensorRT

0.2.1

  • Fixed triton-model-config error when tensorrt_capture_cuda_graph flag is not passed
  • Dump Conversion Comparator inputs and outputs into JSON files
  • Added information in logs on the tolerance parameters values to pass the conversion verification
  • Use count_windows mode as default option for Perf Analyzer
  • Added possibility to define custom docker images
  • Bugfixes
  • Known issues and limitations
    • missing support for stateful models (ex. time-series one)
    • missing support for models without batching support
    • no verification of conversion results for conversions: TF -> ONNX, TorchScript -> ONNX
    • possible to define a single profile for TensorRT
    • TensorRT backend acceleration not supported for ONNX Runtime in Triton Inference Server ver. 21.07

0.2.0

  • comprehensive refactor of command-line API in order to provide more gradual pipeline steps execution
  • Versions of used external components:
    • Triton Model Analyzer: 21.05
    • tf2onnx: v1.8.5 (support for ONNX opset 13, tf 1.15 and 2.5)
    • Other component versions depend on the used framework and Triton Inference Server containers versions. See its support matrix for a detailed summary.
  • Known issues and limitations
    • missing support for stateful models (ex. time-series one)
    • missing support for models without batching support
    • no verification of conversion results for conversions: TF -> ONNX, TorchScript -> ONNX
    • issues with TorchScript -> ONNX conversion due to issue in PyTorch 1.8
      • affected NVIDIA PyTorch containers: 20.12, 21.02, 21.03
      • workaround: use PyTorch containers newer than 21.03
    • possible to define a single profile for TensorRT

0.1.1

  • documentation update

0.1.0

  • Release of main components:
    • Model Converter - converts the model to a set of variants optimized for inference or to be later optimized by Triton Inference Server backend.
    • Model Repo Builder - setup Triton Inference Server Model Repository, including its configuration.
    • Model Analyzer - select optimal Triton Inference Server configuration based on models compute and memory requirements, available computation infrastructure, and model application constraints.
    • Helm Chart Generator - deploy Triton Inference Server and model with optimal configuration to cloud.
  • Versions of used external components:
    • Triton Model Analyzer: 21.03+616e8a30
    • tf2onnx: v1.8.4 (support for ONNX opset 13, tf 1.15 and 2.4)
    • Other component versions depend on the used framework and Triton Inference Server containers versions. Refer to its support matrix for a detailed summary.
  • Known issues
    • missing support for stateful models (ex. time-series one)
    • missing support for models without batching support
    • no verification of conversion results for conversions: TF -> ONNX, TorchScript -> ONNX
    • issues with TorchScript -> ONNX conversion due to issue in PyTorch 1.8
      • affected NVIDIA PyTorch containers: 20.12, 21.03
      • workaround: use containers different from above
    • Triton Inference Server stays in the background when the profile process is interrupted by the user