In recent years, YOLO (You Only Look Once) has become a very popular object detection and segmentation model. Because of its remarkable accuracy and speed, Yolo has gained a lot of attention since its release in 2015. You can readily locate blogs that go into great detail about the operation of YOLO and other computer vision models, as well as articles and tutorials on how to train the model. Still, not many discuss the practical applications of YOLO, the prerequisites, or the difficulties that come with implementing it in industrial settings.
I've had the chance to use and develop YOLOv8 for object recognition systems in my role as an AI engineer. This is an example of how I used the YOLOv8 model to assess a phone using a system for object detection.
Defined problem
The system sequentially processes input images, performing preprocessing steps before YOLOv8 detects and classifies objects. If a phone is detected, the system will crop the image to include the identified phone and then proceed to the next steps.
This system is based on the Linux operating system and has already been deployed on several client machines. My goal is to add detection and classification functionality for additional accessories such as watches, headphones, and Netgear devices so that supplementary analyses can be performed without significantly increasing processing time for each device or interfering with the system's current functions. Since the system's hardware configuration has been fixed, with approximately 50 machines deployed at client sites, upgrading would be costly and time-consuming.
The current processing time for each device is around 10-12 seconds. When adding new functionality, it is important to ensure that the processing time does not increase too much, preventing a significant reduction in the number of devices processed. For example, if a system runs for 8 hours and processes 10 seconds per device, it can support approximately 2,880 devices. However, if the average processing time is increased to 15 seconds, the system will only be able to process 1,920 devices over the same 8-hour period, resulting in a significant drop in throughput. Before optimizing the YOLOv8 inference model, we must first understand how the system makes calls to these components.
Integrating AI Models
There are many ways to integrate Machine Learning and Deep Learning models into software. In this case, the system will be built using two main languages: C++ and Python. C++ will handle the core of the system, while the AI models will be written in Python, utilizing popular frameworks like PyTorch or TensorFlow.
Typically, each AI model is separated and stored in a corresponding Python file (.py), referred to as the main executable file. When the system needs to use an AI model, it triggers this file from C++. The process of calling the model is similar to how you would run an inference Python file. All necessary operations and conditions for the model to function are prepared in advance, ensuring smooth and accurate communication between C++ and Python.
Â
When an executable file is called to perform inference, the process involves several stages: loading the model, performing the prediction, and processing the output. Inference time is calculated based on the duration from the start of calling the executable file until all these steps are complete. This means that the total time includes not just the time taken to load the model into memory, but also the time required to load all the imported libraries used by the inference file.
Therefore, to effectively optimize the model’s inference time, the first step is to focus on optimizing the model itself. This involves refining the model architecture, reducing its complexity, or using more efficient algorithms. Additionally, optimizing the libraries and frameworks used for inference can significantly reduce the time required, contributing to overall performance improvements.
Using Openvino
A tool for deploying and optimizing AI inference models is called OpenVINO, which stands for Open Visual Inference and Neural Network Optimization Toolkit. High performance is achieved by OpenVINO by utilizing the capabilities of FPGAs, integrated and discrete GPUs, and Intel CPUs. A model optimizer is also available with OpenVINO, which can import, convert, and optimize models from a variety of well-known deep learning frameworks, including PyTorch, TensorFlow, TensorFlow Lite, Keras, ONNX, and others.
We need to convert a pretrained model to the IR (Intermediate Representation) format in order to use it with OpenVINO. There are some files in the IR format:
best_model.xml
: This XML file describes the network architecture, defining the layers of the model, also known as the network graph.
best_model.bin
: This file contains binary data for the model's weights and biases, which can be converted to formats such as FP32, FP16, or INT8.
After training the model, we will proceed with converting the PyTorch model to the IR format using the code snippet provided below. Ultralytics also supports conversion to the OpenVINO format.
import argparse from ultralytics import YOLO parser = argparse.ArgumentParser(description='Export YOLO model to OpenVINO format.') parser.add_argument('--model-path', type=str, required=True, help='Path to the YOLO model file.') args = parser.parse_args() model = YOLO(args.model_path) model.export(format='openvino')
Â
Pytorch to Numpy
I discovered that YOLOv8 is built using the PyTorch library after going over the inference components of Ultralytics. You may have noticed that there are many functions that are similar between PyTorch and Numpy if you are familiar with both libraries. But, what's interesting is that Numpy requires far less installation and usage than PyTorch. I therefore made the decision to stop using PyTorch and start using Numpy for all the functions I was previously using.
""" Ultralytics YOLO: https://github.com/ultralytics/ultralytics """ import time import cv2 import torch import torch.nn.functional as F import torchvision from typing import Tuple import numpy as np
import numpy as np from typing import Tuple import cv2
#from numpy n, h, w = masks.shape x1, y1, x2, y2 = np.array_split(boxes[:,:, None], 4, axis=1) # x1 shape(n,1,1) r = np.arange(w, dtype=x1.dtype)[None, None, :] # rows shape(1,1,w) c = np.arange(h, dtype=x1.dtype)[None, :, None] # cols shape(1,h,1) #from torch n, h, w = masks.shape x1, y1, x2, y2 = torch.chunk(boxes[:, :, None], 4, 1) # x1 shape(n,1,1) r = torch.arange(w, device=masks.device, dtype=x1.dtype)[None, None, :] # rows shape(1,1,w) c = torch.arange(h, device=masks.device, dtype=x1.dtype)[None, :, None] # cols shape(1,h,1)
Result
[2024-05-10 21:23:56,214] [INFO] --------------- Start cropped image accessory ----------------- [2024-05-10 21:23:56,214] [INFO] Time load libraries: 7.4224889278411865 [2024-05-10 21:23:57,759] [INFO] Successfully loaded IR model [2024-05-10 21:23:57,760] [INFO] [INFO] Time to load model: 1.5450589656829834 [2024-05-10 21:23:58,851] [INFO] [INFO] Model output shape: boxes: (1, 40, 8400), masks: (1, 32, 160, 160) [2024-05-10 21:23:58,977] [INFO] [INFO] Class id: phone, Scores: 0.9205898642539978 [2024-05-10 21:23:58,984] [INFO] Processing mask 0 [2024-05-10 21:23:58,984] [INFO] Mask shape: (724, 2) [2024-05-10 21:23:58,995] [INFO] Bounding box coordinates: x=186, y=149, w=769, h=1035 [2024-05-10 21:23:59,051] [INFO] [INFO] Saving image_mask_debug to ./debug/result\image_with_masks.jpg [2024-05-10 21:23:59,134] [INFO] [INFO] Saving image_bbox_debug to ./debug/result\image_with_bbox.jpg [2024-05-10 21:23:59,183] [INFO] [INFO] Phone detected in the image. [2024-05-10 21:23:59,183] [INFO] [INFOR] Time taken: 10.392009973526001 [2024-05-10 21:23:59,184] [INFO] [INFOR] Time taken: 10.39300799369812 [2024-05-10 21:23:59,184] [INFO] --------------- Processing completed successfully.---------------
[2024-05-10 21:31:11,367] [INFO] --------------- Start cropped image accessory ----------------- [2024-05-10 21:31:11,369] [INFO] Time load libraries: 1.1310088634490967 [2024-05-10 21:31:12,501] [INFO] Successfully loaded IR model [2024-05-10 21:31:12,502] [INFO] [INFO] Time to load model: 1.1336066722869873 [2024-05-10 21:31:12,949] [INFO] [INFO] Model output shape: boxes: (1, 40, 8400), masks: (1, 32, 160, 160) [2024-05-10 21:31:12,968] [INFO] [INFO] Class id: phone, Scores: 0.9205898642539978 [2024-05-10 21:31:12,973] [INFO] Processing mask 0 [2024-05-10 21:31:12,973] [INFO] Mask shape: (724, 2) [2024-05-10 21:31:12,982] [INFO] Bounding box coordinates: x=186, y=149, w=769, h=1035 [2024-05-10 21:31:13,012] [INFO] [INFO] Saving image_mask_debug to ./debug/result_faster\image_with_masks.jpg [2024-05-10 21:31:13,058] [INFO] [INFO] Saving image_bbox_debug to ./debug/result_faster\image_with_bbox.jpg [2024-05-10 21:31:13,085] [INFO] [INFO] Phone detected in the image. [2024-05-10 21:31:13,085] [INFO] [INFOR] Time taken: 2.847306966781616 [2024-05-10 21:31:13,085] [INFO] [INFOR] Time taken: 2.847306966781616 [2024-05-10 21:31:13,085] [INFO] --------------- Processing completed successfully.---------------
Method | Run 1 Time (s) | Run 2 Time (s) |
openvino + torch | 10.393 | 4.290 |
openvino + numpy | 2.847 | 1.134 |
In the table above, it can be seen that the first device takes about 10 seconds to start when using OpenVINO combined with PyTorch. However, for the second device in the same run, since the libraries have already been loaded from the first device, the startup time decreases to 4.290 seconds. Therefore, if there are 10 devices, it would take approximately 46 seconds.
Meanwhile, when using OpenVINO with Numpy, the first device only takes about 2.847 seconds, and for the subsequent devices, the time decreases to 1.134 seconds per device. The total time for 10 devices is only around 13 seconds. This difference is quite significant compared to using OpenVINO with PyTorch. Based on the log file, it can be seen that accuracy is not affected with either setup.
Considerations when using inference models in a production environment:
- Execution time in a production environment will be quite different from when deploying the model in a development environment. When running inference in a development environment, you often dedicate all resources to that task. For instance, when running on a personal computer, resources are prioritized for the inference process, allowing it to run faster. However, when the model is deployed in a production environment, resources are allocated to multiple tasks or managed to optimize the entire system, which may result in longer execution times.
- Logging and monitoring the model's performance is an extremely important step. Logging helps track the model's efficiency and assess how well it functions. At the same time, monitoring allows for quick detection of arising issues, such as recognizing new patterns or prediction errors, enabling timely adjustments.
- Additionally, ensure that our code is scalable and maintainable so that future updates or upgrades can be done more easily.
Â
There are several ways to optimize your YOLOv8 models:
- Convert to TensorRT: TensorRT provides highly optimized, low-latency inference for deep learning models on NVIDIA GPUs. It allows you to achieve faster inference times by optimizing the computation graph and reducing memory usage.
- Use Lower Precision: Running models at FP16 or even INT8 precision can dramatically reduce the computational load without significant accuracy loss. By lowering the precision, you reduce both memory bandwidth and computational requirements, which is especially helpful in environments with limited resources or where speed is critical.
- Use Alternative Inference Engines: While YOLOv8 is built for easy integration and deployment, Python can introduce some inefficiencies in production. By switching to inference engines that don't rely on Python—such as those based on C++, CUDA, OpenCV DNN, LibTorch, or even Rust—you can reduce the overhead associated with Python's runtime. These engines are designed for performance and low-level optimization, making them better suited for high-performance environments.
- C++ & CUDA: Leveraging these gives you the flexibility and power to customize inference pipelines for specific hardware, especially when using NVIDIA GPUs.
- OpenCV DNN: OpenCV's deep neural network module can handle real-time computer vision tasks effectively while being more lightweight than TensorFlow or PyTorch in certain situations.
- LibTorch: This is a C++ front-end for PyTorch and offers faster inference when avoiding Python's overhead.
- Rust: If you're concerned with memory safety and performance, Rust is a good choice for inference workloads, although it's still relatively new in this field.
Thanks lqh for this information.
Conclusion
The process of deploying and optimizing an AI model in practice doesn't stop at completing the code or optimizing the model. There are still many challenges that need to be addressed to ensure the system runs smoothly and efficiently. This includes close collaboration with software engineers to integrate the model into larger systems, hardware engineers to optimize resources and performance, and testers to ensure the model performs as expected in all situations. Additionally, scalability, maintenance, and continuous monitoring are critical factors to consider so that the model can meet the demands of a production environment. Multidisciplinary collaboration and ongoing monitoring and improvement will be key to ensuring that your solution not only works but is highly effective in the long run.
Thank you for taking the time to read my blog. I hope the information and insights shared will be helpful to you. If you have any questions or feedback, feel free to reach out via email at tranminhhai1506@gmail.com.