YOLOv7 Architecture Explanation
YOLOv7 outperforms other object detectors in terms of speed and accuracy. The author of YOLOv7 optimized architecture and proposed strategies for practical training and inference. The comparison of YOLOv7 with other state-of-the-art object detectors is shown in the following figure.
Online demo for YOLOv7
If you wish to try out YoloV7, there is no need for installation. Simply upload an image below to test it out. However, if you plan on using it for your demo or production, you will need an API key to access it. To obtain your API key, register on the platform.
In the above figure, we can see that at 13 (ms) YOLOv7 gives approximately 55AP while YOLOv5 (r6.1) shows the same AP at approximately 27 (ms), which makes YOLOv7 120% faster than YOLOv5 (r6.1) on V100 GPU with a batch size of 1. In addition, from the figure, we can see that it has a higher AP than all the state of art detectors shown in the figure.
All real-time object detectors aim to minimize inference time and maximize accuracy for achieving the optimal tradeoff between speed and accuracy. YOLOV7 has the following features,
- Faster than other state of the art object detectors and 120+ faster than YOLOv5
- Higher AP on COCO dataset than other object detectors
- Optimized architecture and loss function
- The YOLOv7 repository supports instance segmentation, classification, object detection, and pose estimation.
- Provides multiple variants of YOLOv7 models for different speed and accuracy requirements
Let's see how this was obtained with YOLOv7 by studying its architecture.
When designing an efficient network, the designers often consider optimizing no more than several parameters, the number of computations, and the computational density. YOLOv7 architecture design is based on ELAN (efficient layer aggregation network). ELAN considers designing an efficient network by controlling the shortest and the longest gradient path, so that deeper networks can converge and learn effectively. The Module diagram of ELAN is shown below.
In the above diagram, input is directly fed into the bottom block with 1x1 convolution, while two other connections to the block were obtained by convolving the input with 2 and 4 blocks of 3x3 convolution with the same channel multiplier. At the bottom block, all the features are concatenated, and 1x1 convolution is applied to learn rich information.
YOLOv7 modifies the ELAN architecture, called E-ELAN (extended efficient layer aggregation network). ELAN uses expand, shuffle, and merge cardinality to improve the model learning ability without destroying gradient flow paths. As for the architecture, it only modifies the architecture in the computational block, while the architecture remains the same as ELAN in the transition layer. E-ELAN uses group convolution to expand the channels and cardinality of the computational block. It applies the same channel multiplier and group parameter to all the computational blocks in a computation layer. The feature map from each computational block will be shuffled into group size g and then concatenated together. Finally, a shuffled group feature map will be added to perform merge cardinality.
Model scaling is an important concept. Using model scaling, one can increase the depth, resolution of an image, and width of the model. Scaling depth means you want to increase or decrease the number of layers in the model. Similarly width corresponds to scaling the number of channels in the model architecture. Both depth and width have scaling factors defined in the model architecture files. YOLOv7 has architecture that is concatenated with the other layers; therefore, when we perform scaling the depth parameter of a computational block, we must calculate the change in the output kernels. Then, we will consider scaling width the same amount as the change in kernels calculated. Thus, this compound scaling approach will preserve the properties that the architecture had and start design, and it will maintain the optimal structure. The compound scaling method is illustrated in Figure 03.
In the architecture diagram in figure 4, the depth of the model is scaled up, which increased the number of convolutions. In figure 5, the depth of the model needs to be scaled, and then the transmission layer will be scaled with the corresponding width scaling factor.
Do you want to try it online? Here is the YoloV7 family;