Implementing the Multi-Layer Perceptron Algorithm on NVidia GPUs

Laszlo Marak

doi:10.14419/r2hvcq88

Authors

Laszlo Marak
J. Selye University, BratislavskÃ¡ cesta 3322, 945 01, KomÃ¡rno, Slovakia

Received date: November 11, 2024

Accepted date: November 28, 2024

Published date: December 26, 2024

DOI:

https://doi.org/10.14419/r2hvcq88

Keywords:

MLP, Multi-Layer Perceptron, GPU Implementation, OpenCL, Parallel Implementation

Abstract

With the adoption of machine learning algorithms for image processing tasks and the ever growing need for embedded device applications, the developers use several methods to optimize the computational efficiency of their applications. Optimization of algorithms can be challenging and developers must apply non-trivial strategies to exploit the computational resources of computer architectures more efficiently. In this article we are describing an efficient GPU implementation for the Multi-Layer Perceptron (MLP) algorithm. The MLP is a basic algorithm for machine learning and artificial intelligence, and is an excellent example of the difficulties surrounding GPGPU programming and optimization. As an independent observation we discuss the memory management of GPU-s and methods to simplify the memory allocation process.

References

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner,Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
Viola Anselmi, Giovanni Conti, and Francesco Di Renzo. Gpu computing for 2-d spin systems: Cuda vs opengl, 2008.
Apple. Deploying transformers on the apple neural engine. Apple.
Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer, 2006.
Gary Bradskim and Adrian Kaehler. Learning OpenCV. O’Reilly Media, 2nd edition, September 2013.
André R. Brodtkorb, Trond R. Hagen, and Martin Saetra. Graphics processing unit (gpu) programming strategies and trends in gpu computing. Journal
of Parallel and Distributed Computing, 73 (2013)(1):4–13, January 2013.
André R. Brodtkorb, Trond R. Hagen, and Martin L. Sætra. Graphics processing unit (gpu) programming strategies and trends in {GPU} computing.
Journal of Parallel and Distributed Computing, 73(1):4 – 13, 2013. Metaheuristics on GPUs.
Don Coppersmith and Shmuel Winograd. Matrix multiplication via arithmetic progressions. Journal of Symbolic Computation, 9(3):251 – 280, 1990.
Computational algebraic complexity editorial.
NVIDIA Corporation. Nvidia gh200 grace hopper superchip architecture. Technical report, NVidia, 2024.
Paresh Dave. Nvidia chip shortages leave ai startups scrambling for computing power. WIRED.
Erin Griffith. The desperate hunt for the a.i. boom’s most indispensable prize. The New York Times.
Gaël Guennebaud and Benoit Jacob. Eigen, May 2013. Presentation.
Simon Haykin. Neural Networks: A Comprehensive Foundation. Prentice Hall PTR, Upper Saddle River, NJ, USA, 2nd edition, 1998.
Matthijs Hollemans. The neural engine — what do we know about it? Github.
Intel(tm) Corporation. Math kernel library.
Intel(tm) Corporation. Intel(tm) Integrated Performance Primitives Reference Manual, Volume 3: Small Matrices and Realistic Rendering, 319375-022us
edition, 2012.
Itseez. The OpenCV Reference Manual, 2.4.9.0 edition, April 2014.
Itseez. Open source computer vision library. https://github.com/itseez/opencv, 2015.
Amand Joshi. Surprise, surprise! nvidia owns two-thirds of the data center ai chip market and 97 Linkedin.
Vahid Kazemi and Josephine Sullivan. One millisecond face alignment with an ensemble of regression trees. 1(1):1867–1874, 2014.
Davis E. King. Dlib-ml: A machine learning toolkit. Journal of Machine Learning Research, 10:1755–1758, 2009.
Davis E. King. Max-margin object detection. CoRR, abs/1502.00046, 2015.
Raja Koduri. No transistor left behind, 8 2020. At Hot Chips 2020, Raja Koduri, senior vice president, chief architect and general manager of
Architecture, Graphics and Software at Intel, delivered a keynote presentation.
Raúl Nozal and José Luis Bosque. Exploiting co-execution with oneapi: heterogeneity from a modern perspective. CoRR, abs/2106.01726, 2021.
NVidia. OpenCL Programming Guide for the CUDA Architecture, 2.3 edition, August 2009.
NVidia. OpenCL Best Practices Guide, February 2011.
NVidia. CUDA C Best Practices Guide, dg-05603-001-v5.0 edition, October 2012.
NVidia. CUDA C Programming Guide, pg-02829-001-v5.0 edition, October 2012.
Kyoung-Su Oh and Keechul Jung. Gpu implementation of neural networks. Pattern Recognition, 37(6):1311–1314, 2004.
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca
Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang,
Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library, 2019.
Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. CoRR, abs/1804.02767, 2018.
Jason M. Saragih, Simon Lucey, Simon, and Jeffrey F. Cohn. Deformable model fitting by regularized landmark mean-shift. International Journal of
Computer Vision, 91(2):200–215, 2011.
Tristan Udby and Yun Tian. A generic neural network implementation on gpu and its performance benchmark. In Kohei Arai, editor, Proceedings of the
Future Technologies Conference (FTC) 2022, Volume 3, pages 138–154, Cham, 2023. Springer International Publishing.
Virginia Vassilevska Williams. Breaking the Coppersmith-Winograd barrier. 2011.
Strassen Volker. Gaussian elimination is not optimal. Numerical Mathematics, 13(4):354–356, 1969.