Porting of Finite Element Integration Algorithm to Xeon Phi Coprocessor-based HPC Architectures


In the present article, we describe the implementation of the finite element numerical integration algorithm for the Xeon Phi coprocessor. The coprocessor was an extension of the many-core specialized unit for calculations, and its performance was comparable with the corresponding GPUs. Its main advantages were the built-in 512-bit vector registers and the ease of transferring existing codes from traditional x86 architectures. In the article, we move the code developed for a standard CPU to the coprocessor. We compare its performance with our OpenCL implementation of the numerical integration algorithm, previously developed for GPUs. The GPU code is tuned to fit into a coprocessor by our auto-tuning mechanism. Tests included two types of tasks to solve, using two types of approximation and two types of elements. The obtained timing results allow comparing the performance of highly optimized CPU and GPU codes with a Xeon Phi coprocessor performance. This article answers whether such massively parallel architectures perform better using the CPU or GPU programming method. Furthermore, we have compared the Xeon Phi architecture and the latest available Intel’s i9 13900K CPU when writing this article. This comparison determines if the old Xeon Phi architecture remains competitive in today’s computing landscape. Our findings provide valuable insights for selecting the most suitable hardware for numerical computations and the appropriate algorithmic design.


CPU, Optimization, GPGPU, Intel Xeon Phi, FEM, Numerical Integration,


1. N.M. Atallah, C. Canuto, G. Scovazzi, The second-generation shifted boundary method and its numerical analysis, Computer Methods in Applied Mechanics and Engineering, 372(1): 113341, 2020, doi: 10.1016/j.cma.2020.113341.
2. K. Banas, F. Kruzel, Comparison of Xeon Phi and Kepler GPU performance for finite element numerical integration, [in:] High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS), 2014 IEEE Intl Conf on, pp. 145–148, Aug 2014, doi: 10.1109/HPCC.2014.27.
3. K. Banas, F. Kruzel, OpenCL performance portability for Xeon Phi coprocessor and NVIDIA GPUs: A case study of finite element numerical integration, [in:] Euro-Par 2014: Parallel Processing Workshops, volume 8806 of Lecture Notes in Computer Science, pp. 158–169, Springer International Publishing, 2014, doi: 10.1007/978-3-319-14313-2_14.
4. K. Banas, F. Kruzel, J. Bielanski, K. Chłon, A comparison of performance tuning process for different generations of NVIDIA GPUs and an example scientific computing algorithm, [in:] R. Wyrzykowski, J. Dongarra, E. Deelman, K. Karczewski [Eds.], Parallel Processing and Applied Mathematics, pp. 232–242, Cham, Springer International Publishing, 2018, doi: 10.1007/978-3-319-78024-5_21.
5. K. Banas, F. Kruzel, J. Bielanski, Optimal kernel design for finite element numerical integration on GPUs, Computing in Science and Engineering, 22(6): 61–74, 2020, doi: 10.1109/MCSE.2019.2940656.
6. E.B. Becker, G.F. Carey, J.T. Oden, Finite Elements. An Introduction, Prentice Hall, Englewood Cliffs, 1981, doi: 10.1002/nme.1620180613.
7. L. Buatois, G. Caumon, B. Levy, Concurrent number cruncher: A GPU implementation of a general sparse linear solver, International Journal of Parallel, Emergent and Distributed Systems, 24(3): 205–223, 2009, doi: 10.1080/17445760802337010.
8. F.L. Cabral, C. Osthoff, G.P. Costa, D. Brandao, M. Kischinhevsky, S.L. Gonzaga de Oliveira, Tuning Up TVD HOPMOC Method on Intel MIC Xeon Phi Architectures with Intel Parallel Studio Tools, [in:] 2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW), pp. 19–24, 2017, doi: 10.1109/SBAC-PADW.2017.12.
9. P.G. Ciarlet, The Finite Element Method for Elliptic Problems, North-Holland, Amsterdam, 1978, doi: 10.1137/1.9780898719208.
10. B. Cockburn, G. Karniadakis, C. Shu [Eds.], Discontinuous Galerkin Methods: Theory, Computation and Applications, Vol. 11 of Lecture Notes in Computational Science and Engineering, Springer, Berlin, 2000, doi: 10.1007/978-3-642-59721-3.
11. B. Cockburn, G. Karniadakis, C. Shu, The development of discontinuous Galerkin methods, [in:] Discontinuous Galerkin Methods: Theory, Computation and Applications, Vol. 11 of Lecture Notes in Computational Science and Engineering, pp. 1–14, Springer, Berlin, 2000, doi: 10.1007/978-3-642-59721-3_1.
12. B. Cockburn, C.W. Shu, The local discontinuous Galerkin finite element method for convection diffusion systems, SIAM Journal on Numerical Analysis, 35: 2440–2463, 1998, doi: 10.1137/S0036142997316712.
13. I. Cutress, Intel’s Xe for HPC: Ponte Vecchio with Chiplets, EMIB, and Foveros on 7nm, Coming 2021, AnandTech, 2019.
14. R. Devine, Intel Core i9-13900K review: Retaking the performance crown for team blue, XDA Developers, 2022.
15. J. Dongarra, Frequently Asked Questions on the Linpack Benchmark and Top500, 2007.
16. J. Fang, A.L. Varbanescu, H. Sips, L. Zhang, Y. Che, Ch. Xu, Benchmarking Intel Xeon Phi to guide kernel design, 2013.
17. M. Geveler, D. Ribbrock, D. Göddeke, P. Zajac, S. Turek, Towards a complete FEM-based simulation toolkit on GPUs: Unstructured grid finite element geometric multigrid solvers with strong smoothers based on sparse approximate inverses, Computers & Fluids, 80: 327–332, 2013, doi: 10.1016/j.compfluid.2012.01.025.
18. D. Göddeke, H. Wobker, R. Strzodka, J. Mohd-Yusof, P. McCormick, S. Turek, Co–processor acceleration of an unmodified parallel solid mechanics code with FEASTGPU, International Journal of Computational Science and Engineering, 4(4): 254–269, 2009, doi: 10.1504/IJCSE.2009.029162.
19. R. Goodwins, Intel unveils many-core Knights platform for HPC, ZdNet, 2010.
20. A. Howes, L. Munshi, The OpenCL Specification, Khronos OpenCL Working Group, 2014, version 2.0, revision 26.
21. Intel, OpenCL Design and Programming Guide for the Intel Xeon Phi Coprocessor, Intel Corporation, 2014.
22. Intel, Intel C++Compiler 16.0 User and Reference Guide, Intel Corporation, 2015.
23. Intel, Dane techniczne produktu [Intel products specifications], Intel Corporation, 2017.
24. Intel, Intel Unveils New GPU Architecture with High-Performance Computing and AI Acceleration, and oneAPI Software Stack with Unified and Scalable Abstraction for Heterogeneous Architectures, Intel Newsroom, 2019.
25. Intel, Product change notification 116378 – 00, July 23, 2018.
26. J. Jeffers, J. Reinders, Intel Xeon Phi Coprocessor High Performance Programming, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1st edition, 2013.
27. C. Johnson, Numerical Solution of Partial Differential Equations by the Finite Element Method, Cambridge University Press, 1987, doi: 10.1007/BF00046566.
28. Y. Kallinderis, Adaptive hybrid prismatic-tetrahedral grids, International Journal for Numerical Methods in Fluids, 20: 1023–1037, 1995, doi: 10.1002/fld.1650200820.
29. F. Kruzel, Vectorized implementation of the FEM numerical integration algorithm on a modern CPU, [in:] European Conference for Modelling and Simulation, Vol. 33, pp. 414–420, 2019, doi: 10.7148/2019-0414.
30. F. Kruzel, K. Banas, Finite element numerical integration on PowerXCell processors, [in:] PPAM’09: Proceedings of the 8th International Conference on Parallel Processing and Applied Mathematics, pp. 517–524, Berlin, Heidelberg, Springer-Verlag, 2010, doi: 10.1007/978-3-642-14390-8_54.
31. F. Kruzel, K. Banas, Vectorized OpenCL implementation of numerical integration for higher order finite elements, Computers and Mathematics with Applications, 66(10): 2030–2044, 2013, doi: 10.1016/j.camwa.2013.08.026.
32. F. Kruzel, K. Banas, Finite element numerical integration on Xeon Phi coprocessor, [in:] M. Paprzycki M. Ganzha, L. Maciaszek [Eds.], Proceedings of the 2014 Federated Conference on Computer Science and Information Systems, Vol. 2 of Annals of Computer Science and Information Systems, pp. 603–612, IEEE, 2014, doi: 10.15439/2014F222.
33. F. Kruzel, K. Banas, AMD APU systems as a platform for scientific computing, Computer Methods in Materials Science, 15(2): 362–369, 2015.
34. F. Kruzel, K. Banas, M. Nytko, Implementation of numerical integration to high-order elements on the GPUs, Computer Assisted Methods in Engineering and Science, 27(1): 3–26, 2020, doi: 10.24423/cames.264.
35. F. Kruzel, M. Nytko, Intel Iris Xe-LP as a platform for scientific computing, [in:] M. Ganzha [Ed.], Communication Papers of the 17th Conference on Computer Science and Intelligence Systems, September 4–7, 2022, Sofia, Bulgaria, Vol. 32 [in:] Annals of Computer Science and Information Systems, pp. 121–128, Warszawa, PTI, 2022 doi: 10.15439/2022F132.
36. J.N. Lyness, Quadrature methods based on complex function values, Mathematics of Computation, 23(107): 601–619, 1969, doi: 10.2307/2004388.
37. J. Mamza, P. Makyla, A. Dziekonski, A. Lamecki, M. Mrozowski, Multi-core and multiprocessor implementation of numerical integration in Finite Element Method, [in:] Microwave Radar and Wireless Communications (MIKON), 2012 19th International Conference, Vol. 2, pp. 457–461, 2012, doi: 10.1109/MIKON.2012.6233633.
38. J.D. McCalpin, Memory bandwidth and machine balance in current high performance computers, IEEE Technical Committee on Computer Architecture (TCCA) Newsletter, pp. 19–25, December 1995.
39. K. Michalik, K. Banas, P. Płaszewski, P. Cybułka, ModFEM – a computational framework for parallel adaptive finite element simulations, Computer Methods in Materials Science, 13(1): 3–8, 2013.
40. S. Muralikrishnan, M.-B. Tran, T. Bui-Thanh, An improved iterative HDG approach for partial differential equations, Journal of Computational Physics, 367: 295–321, 2018, doi: 10.1016/j.jcp.2018.04.033.
41. S. Naik, Best Known Method: Estimating FLOP/s for workloads running on the Intel Xeon Phi coprocessor using Intel VTune Amplifier XE, September 2013.
42. T. Olas, W.K. Mleczko, R.K. Nowicki, R. Wyrzykowski, A. Krzyzak, Adaptation of RBM Learning for Intel MIC Architecture, [in:] L. Rutkowski, M. Korytkowski, R. Scherer, R. Tadeusiewicz, A.L. Zadeh, M.J. Zurada [Eds.], Artificial Intelligence and Soft Computing: Proceedings of the 14th International Conference. ICAISC 2015. Part I, Zakopane, Poland, June 14–18, pp. 90–101, Cham, Springer International Publishing, 2015, doi: 10.1007/978-3-319-19324-3_9.
43. OpenMP Architecture Review Board, OpenMP Application Programming Interface, version 4.5 edition, November 2015.
44. F. Roth, System Administration for the Intel Xeon Phi Coprocessor, Intel Corporation, 2013.
45. S. Rul, H. Vandierendonck, J. D’Haene, K. De Bosschere, An experimental study on performance portability of OpenCL kernels, [in:] Application Accelerators in High Performance Computing, 2010 Symposium, p. 3, Knoxville, TN, USA, 2010.
46. W.C. Schneck, E.D. Gregory, C.A.C. Leckey, Optimization of elastodynamic finite integration technique on Intel Xeon Phi Knights Landing processors, Journal of Computational Physics, 374: 550–562, 2018, doi: 10.1016/j.jcp.2018.07.049.
47. L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, Larrabee: a manycore x86 architecture for visual computing, SIGGRAPH 08: ACM SIGGRAPH 2008 papers, pp. 1–15, 2008, doi: 10.1109/MM.2009.9.
48. E. Strohmaier, J. Dongarra, S. Horst, M. Meuer, H. Meuer, Top500 The List, 2020, http://www.top500.org.
49. Ł. Szustak, K. Rojek, P. Gepner, Using Intel Xeon Phi coprocessor to accelerate computations in MPDATA algorithm, [in:] R. Wyrzykowski, J. Dongarra, K. Karczewski, J. Wasniewski [Eds.], Parallel Processing and Applied Mathematics: 10th International Conference, PPAM 2013. Part I, Warsaw, Poland, September 8–11, 2013, pp. 582–592, Berlin, Heidelberg, Springer, 2014, doi: 10.1007/978-3-642-55224-3_54.
50. Ł. Szustak, K. Rojek, T. Olas, Ł. Kuczynski, K. Halbiniak, P. Gepner, Adaptation of MPDATA heterogeneous stencil computation to Intel Xeon Phi coprocessor, Scientific Programming, 2015: 642705, 2015, doi: 10.1155/2015/642705.
51. S. Williams, A. Waterman, D. Patterson, Roofline: An insightful visual performance model for multicore architectures, Communications of the ACM, 52(4): 65–76, 2009, doi: 10.1145/1498765.1498785.
Aug 29, 2023
How to Cite
KRUŻEL, Filip; BANAŚ, Krzysztof; IACOMO, Mauro. Porting of Finite Element Integration Algorithm to Xeon Phi Coprocessor-based HPC Architectures. Computer Assisted Methods in Engineering and Science, [S.l.], v. 30, n. 4, p. 427–459, aug. 2023. ISSN 2956-5839. Available at: <https://cames.ippt.pan.pl/index.php/cames/article/view/578>. Date accessed: 21 june 2024. doi: http://dx.doi.org/10.24423/cames.578.