If you’re into computational finance, you might have heard of FinanceBench.
It’s a benchmark developed at the University of Deleware and is aimed at those who work with financial code to see how certain code paths can be targeted for accelerators. It utilizes the original QuantLib software framework and samples to port four existing applications for quantitative finance. It contains codes for Black-Scholes, Monte-Carlo, Bonds, and Repo financial applications which can be run on the CPU and GPU.
The problem is that it has not been maintained for 5 years and there were good improvement opportunities. Even though the paper was already written, we think it can still be of good use within computational finance. As we were seeking a way to make a demo for the financial industry that is not behind an NDA, this looked like the perfect starting point for that. We have emailed all the authors of the library, but unfortunately did not get any reply. As the code is provided under an permissive license, we could luckily go forward.
The first version of the code will be released on Github early next month. Below we discuss some design choices and preliminary results.
Work done before
Ofcourse the first step was selecting good algorithms that had both needed a lot of compute and were representable within the finance industry. We could not have done this as well as the research team did. This was the main reason to choose the project.
The work done for making the code, according to the team itself:
The original QuantLib samples were written in C++. QuantLib is a C++ library. Unfortunately, languages like OpenCL, CUDA, and OpenACC cannot directly operate on C++ data structures, and virtual function calls are not possible. Because of this problem, all of the existing code had to be “flattened” to C code. We used a debugger is used to step through the code paths of each application and see what lines of QuantLib code are executed for each application, and manually flattened all of the QuantLib code.
This is typical work when porting CPU-code to the GPU. Complex C++ code can be very hard to bend in directions it was not intended to bend to. Simplification is then the first thing to do, so it can be split in manageable parts. As this can be time-consuming, we were happy it was already done, though it is often easier to also have the original simplified CPU-code for reference.
We have another focus than a research group, and logically this results in code changes. We’re now in phase 1.
Phase 1: Focus on OpenMP and CUDA/HIP
As it’s difficult to focus on multiple languages while improving performance and project quality, we focused on OpenMP and CUDA first. Then we ported the project to HIP and made sure that the translation from CUDA to HIP could be fast. This way we could make sure CPUs and the fastest GPUs could be benchmarked, leaving OpenCL and OpenACC out for now. We have no intentions to keep HMPP in, and have chosen for introducing SYCL to prepare for Intel Xe. We have more interest in benchmarking different types of algorithms on all kinds of hardware than to compare programming languages.
Also the project has been cleaned up, cmake was introduced, and google-benchmark was added, to make it easier for us to work on. We did not look into Quantlib for improvements or read papers on the latest advancements. So the goal was really to get it started.
We picked a few broadly available AMD and Nvidia GPUs, and choose a dual socket Xeon (40 cores in total) for the CPU benchmarks. The times are INCLUDING transfer times for the GPUs. The original benchmark unfortunately showed compute-times only, so we might get some Nvidia Kepler GPUs back in a server to re-benchmark these.
QMC (Sobol) Monte-Carlo method (Equity Option Example)
Monte Carlo is seen often when HPC is applied in the finance domain. A good part is the easy interpretation and straightforward implementation, making it easy to explain to HPC-developers while showing the performance advantage to quants. It returns a distribution of future prices of assets by doing thousands to millions of simulations.
The below results were from the code as provided, with a direct port to HIP to include AMD GPUs. As you can see, the Titan V and GTX980 numbers don’t look good.
|2x Intel Xeon E5-2650 v3 OpenMP||215.311||437.140||877.425|
|MI25 (Vega 10)||13.694||26.733||52.971|
Here are the results after fixing the obvious problems and low-hanging fruit. This benefited the Nvidia GPUs a lot, but also the AMD GPUs. There was no low hanging fruit in the OpenMP-code, so no speedup there.
|2x Intel Xeon E5-2650 v3 OpenMP||215.311||437.14||877.425|
|MI25 (Vega 10)||10.857||21.147||41.214|
Black-Sholes-Merton Process with Analytic European Option engine
Black Scholes is used for estimating the variation over time of financial instruments such as stocks, and using the implied volatility of the underlying asset to derive the price of a call option. Again, it is compute intensive.
The performance of the original code looked good at first sight, but the transfers took 95% of the time. That’s for the next phase.
|2x Intel Xeon E5-2650 v3 OpenMP||5.005||22.194||43.994|
|MI25 (Vega 10)||7.827||36.947||71.499|
On some projects it’s better to focus on the largest bottleneck – for this project we chose to go through the project in a structured way. It sometimes is difficult to explain the improvements are only to be “activated” very late in the project – luckily the explanation “experience” is often accepted.
So the applied fixes had good influence, but are hardly noticeable right now.
|2x Intel Xeon E5-2650 v3 OpenMP||5.005||22.194||43.994|
|MI25 (Vega 10)||7.037||31.737||64.898|
Fixed-rate bond valuation with flat forward curve
Only ported to HIP. We did not do any optimisations yet, as we have stability-problems with 2 AMD GPUs to focus on. With the current code Vega 20 is faster than Titan V.
|2x Intel Xeon E5-2650 v3 OpenMP||186.928||369.718||732.446|
Securities repurchase agreement
Only ported to HIP. The code for Bonds benchmark needs some attention still. You see that FTX980 is too slow in comparison.
|2x Intel Xeon E5-2650 v3 OpenMP||241.248||482.187||952.058|
As you see this is really work in progress. Why show it already? Reason is that you can see how a project goes. Cleaning up the code is always done in every project, to avoid delays later on. Adding good tests and benchmarks is another foundational step. Most time has gone into these preparations, and limited time into the improvements.
Milestones we have planned for now:
- Get it started + low-hanging fruit in the kernels (WIP)
- Looking for structural problems outside the kernels + fixes
- High-hanging fruit for both CPU and GPU
- OpenCL / SYCL port
- Optional: extending algorithms (by request?)
Feel free to contact us with any question.