Transformers: An Ultimate Solution for All Your Needs?

In recent years, machine learning and image recognition have experienced growing demand and complexity. With continuous efforts from modern developers to refine methods and algorithms, the pursuit of better solutions remains constant. However, the question arises: does newer always mean better? To address this inquiry, we consulted our expert in machine learning models for computer vision. Not only did he outline contemporary approaches to standard tasks, but he also conducted research to gauge the actual effectiveness of the latest methodologies and support his conclusions. In this article, we will delve into the results of this study and assess the practicality of transformer-based models in image recognition tasks.


Today, a search for the term “transformer” is more likely to yield results related to a machine learning model that has significantly improved the performance of natural language processing (NLP) and computer vision (CV) rather than giant robots from cartoons. When building an image classification model, the recommended approach is to use a state-of-the-art transformer model in conjunction with popular training frameworks such as Keras, FastAI, or Lightning.

However, is it always the case that transformers are the optimal solution, and is it time to abandon all previous approaches?

This article compares the effectiveness of transformers with older computer vision techniques in various categories, aiming to determine whether there is a clear winner. However, as the research progresses, the outcomes are more varied than initially expected.


Computer vision, with its typical tasks, first emerged in the 1950s, marking a significant milestone in its development. However, one-gigabyte models on high-performance GPUs still needed to be in use. The field has come a long way since then.

For my test, I began with an approach that has been used since the 1990s and is still relevant today. This approach, known as feature extraction, involves reducing image dimensionality (number of pixel values) while retaining the most essential information. Numerous algorithms, including Haar Cascades, SIRF, SURF, and HOG, can be used for this purpose.

The following approach I considered was a Convolutional Neural Network (CNN), which gained popularity in the 2010s and revolutionized the field of computer vision. CNNs can identify image patterns regardless of their position and are adaptable to different resolutions. Most importantly can be parallelized for use on GPU.

Finally, the last I experimented with was based on a transformer. Transformers use an attention mechanism to capture relationships between different parts of an image, making them very accurate and efficient for computation. To date, the performance of transformer models has yet to be surpassed.

Models and Pipelines

For simplicity, I used the HOG algorithm as a feature extractor. As this algorithm does not classify images independently, I utilized the Support Vector Machine (SVM) algorithm to convert the extracted features into actual predictions. Furthermore, I inserted the Principal Component Analysis (PCA) algorithm in the middle, which helped to reduce the number of features without impacting accuracy.

pca = PCA(n_components=75) 
svc = SVC() 
model = Pipeline(steps=[('pca', pca), ('svm', svc)]) 
X_train = hog(image), y_train) 
y_pred = model.predict(X_test)

For the CNN approach, I chose the EfficientNet (EFN) model, which is still one of the top performers in its class.

As for the transformer models, I was selecting a single model as the best option took more work. However, after considering the ratio of performance and accuracy, I opted for the SWIN model.

I used the FastAI framework for both the CNN and Transformer models, as it is easy to use and powerful for training. Additionally, the models were initialized using the Timm library.

dls = ImageDataLoaders.from_df(df_train, …) 
model = create_model(model_name, …) 
learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(), …) 
predictions = learner.get_preds(dl=test_dl) 


It’s worth noting that all computer vision algorithms depend highly on the data they are trained on, especially in the case of machine learning approaches. Both naïve datasets, such as cats and dogs, and domain-specific ones, such as medical images, can skew general trends. To minimize these biases, I selected a practical yet universal dataset for my experiments – the cars-and-buses one.

Experiment Setup

It’s important to note that real-world computer vision projects often have many specific details, conditions, and limitations to be considered. However, many educational projects oversimplify the task and take many things for granted. To bias toward real-life scenarios, I designed my experiments focusing on practicality and realism. As a result, I incorporated the following limitations in my initial conditions:

  • Limited sample size: I used datasets with a limited number of samples (250, 500, 1000 and 2000 samples)
  • Pre-trained vs. untrained models: I experimented with both models both (not applicable for HOG+SVM)
  • Smaller model variants: I chose model variants with a smaller number of parameters (EfficientNet-b0, SWIN-tiny)

The abovementioned restrictions of limited data, computing resources, and production environment reflect the real-world limitations engineers and data scientists often face daily. These limitations can be due to various reasons, such as budget constraints, hardware limitations, and time constraints. By working with the data at hand, available GPUs, and accessible frameworks and toolsets, engineers can learn to develop still efficient, robust, and optimized models for real-world scenarios. The hyper-parameters for training are determined through separate optimization tests. Due to the limitations of pre-trained SWIN, I have chosen to use only 224×224 image samples. While I don’t consider my experiment as a benchmark, it provides valuable insights, and therefore, the resulting tables contain rounded values without any fractional part.

The data used in this project were obtained from the Kaggle environment.


To facilitate understanding and interpretation of the results, I have provided three tables: Accuracies, Timings and Model Sizes.

Dataset SizeHOG+SVM
Table 1. Accuracy of the models trained with datasets of different sizes
Platform and PhaseHOG+PCA+SVMEFN
Table 2. Time in seconds of training and inference of the models on CPU and GPU
10 mb17 mb112 mb
Table 3. Model sizes

As expected, transformers outperform other models in terms of accuracy. However, some of the results are particularly noteworthy and warrant further analysis. The following insights can help to better understand the performance of different models in specific contexts and to identify areas for further improvement:

  1. Despite having the highest accuracy, the pre-trained transformer model has the most extensive model size and underperforms on the CPU. Additionally, it has a fixed input image resolution, unlike the other two models, which could be a limitation in specific contexts. 
  2. Regarding accuracy, both untrained CNN and Transformer models consistently perform worse than the HOG+PCA+SVM method.
  3. The untrained CNN model outperforms the untrained Transformer model, indicating that the latter may require more training data to generalize well.
  4. The HOG+PCA+SVM method performs comparably with pre-trained CNN for the largest dataset of 2000 samples.
  5. The HOG+PCA+SVM method has the smallest model size and the shortest training time on both CPU and GPU, which could make it a preferable option in scenarios with limited computing resources.

Final thoughts

The results from the experiments demonstrate that older computer vision approaches may outperform newer ones in specific categories, such as training from scratch with a small dataset. They may be a good choice for environments with limited memory, storage or computing resources. As in other engineering fields, having a broad understanding of the available approaches in computer vision can help to build effective and performant solutions.

It is worth noting that these experiments were conducted primarily out of curiosity and with a limited number of hypotheses checks. Therefore, the conclusions drawn from these experiments may differ if different datasets, models or frameworks are used. In real-world scenarios, additional technics are often applied to improve performance, such as dataset augmentation to increase accuracy or model quantization to reduce its size. Here is my notebook that you can use to conduct your own experiments and discover interesting findings.

If you have any questions for our experts or would like to discuss your computer vision project, please get in touch with us through the contact form on our website. We will gladly assist you in your work and help you find optimal solutions for your tasks. Thank you for your interest in our research, and we look forward to a fruitful collaboration!