In its latest move to distinguish itself in a field that tightened up years ago, Oracle Cloud last week formally launched its first series of VM-based and bare-metal compute instances based on Arm processors — specifically, the 80-core Ampere Altra. Oracle had been hosting these new “A1” instances for select customers.
“Oracle is the only cloud provider offering Arm instances in what I’m calling a ‘penny-core,’” said Bev Crair, Oracle Cloud Infrastructure’s senior VP for compute, speaking with DCK. “It’s one cent per core-hour, with our flexible VM sizing from 1 to 80 OCPUs, or bare-metal servers, with 160 cores and a terabyte of memory.”
Oracle Cloud Infrastructure (OCI) compute commodities are based on the metric to which Crair referred as OCPU. Typically, an OCPU correlates to a single processor core, and OCI is capable of subdividing VMs along core boundaries, making available virtual CPUs with very flexible boundaries. Crair’s “penny-core” refers to one core of an Altra processor made available at $0.01 per hour. (There’s actually an additional charge of $0.0015 per hour for each gigabyte of memory consumed.) It is indeed possible to stand up a single-core job, although OCI is also making small A1 instances available on its free tier for no charge.
Why would customers of OCI, or any other cloud service provider, want to consider moving their workloads from x86 into an Arm environment? Crair pointed to what she described as increasingly predictable performance from a single-threaded core. That predictability lends itself to a more regular scaling factor, where four cores can be presumed to have pretty much four times the performance of one core rather than around three and a half. She also mentioned an absence of the security issues that have cropped up around hyperthreading, which is Intel’s method for splitting the thread in one core into two.
Ampere v Intel and AMD
DCK sought to put these assertions to the test. We asked Oracle for its latest available performance data for a real-world job, pitting Ampere A1 instances against similarly equipped x86 VM instances. OCI obliged us by providing exclusive benchmark data for a video transcoding job: the speeds, in frames per second, for a video transcoding job.
The OCI team pitted its new A1 instances against VM instances hosted by AMD second-generation “Rome” processors (E3) and third-generation “Milan” processors (E4), plus Intel second-generation “Skylake” processors (X7). Thread counts from 1 to 8 were tested separately to demonstrate scalability.
A single A1 core transcodes video at a rate of 4.93 frames per second, according to OCI tests. That’s compared to 6.34 FPS for the E3, 6.8 FPS for the E4, and an eye-openingly low 3.55 FPS for the Intel X7.
Using Oracle’s data for thread counts from 1 to 8, we projected the transcoding times for a typical 30-minute digital video, which runs at 29.97 FPS and therefore has a total frame count of 107,892.
At a single-core level, A1’s performance starts out in-between that of the Intel and AMD processors. But Skylake’s scalability appears to bottom out at 7 cores, while Ampere Altra slides below the AMD instances’ numbers and looks to be capable of performance improvements up until as many as 12 threads.
What we really wanted to know, though, was whether A1 instances were truly economical. It’s worth noting that a single x86-based OCPU counts for two threads, not just one (as Oracle confirmed to us) on account of built-in hyperconvergence. So, a four-OCPU x86 instance would get you 8 threads, for which you would need an 8-OCPU A1 instance. That might affect the scalability factor somewhat, as Skylake, Rome, and Milan should theoretically double Altra’s scalability.
Oracle suggested we examine a metric it called frames-per-second per dollar (FPS/$). Using OCI’s frames-per-second measurements, it appears A1 yields a remarkably predictable cost as OCPUs scale up. The even- and odd-numbered thread counts are staggered for the other three instance types because of OCI’s two-threads-per-OCPU factor. Nevertheless, while AMD-based instances appear to offer the best performance for dollar at 2 threads, that peak declines steadily up to 8 threads. And Skylake costs — at least on Oracle Cloud — don’t appear justifiable.
The real tale of the tape is when we estimate the dollars-and-cents cost of transcoding that half-hour video. If you were to reserve 8 threads on OCI, you would pay as much as 81¢ on an X7 instance for a job that would cost you about a nickel on an A1 and 8¢ on a Rome- or Milan-based instance.
Oracle’s Crair touted the virtue of predictable single-threaded performance. We asked, what’s the value proposition for running single-threaded tasks on a VM hosted by an 80-core processor such as Ampere Altra A1, unless the customer runs a plurality of tasks in parallel?
“The value proposition is predictable performance and a huge reduction in the noisy neighbor effect of running a single-threaded core,” responded Matt Leonard, OCI’s VP of compute, in a note to DCK.
“In a multi-tenant environment,” Leonard continued, “you can have multiple tenants competing for the same resources via a multithread environment and this will result in unpredictable performance. To a certain extent, this applies to all multicore processors, [as] cloud usages are predominantly multi-threaded and are able to take advantage of multiple cores by default. Task parallelism is one way of taking advantage of lots of cores; thread-level parallelism is another. If someone wants to rent a VM to run a single-threaded task, they would not rent out an 80-core VM; they’d stick to one with 1 vCPU. In that case, the A1 provides predictable performance at a much lower cost than comparable x86 VMs.”
Leonard also confirmed for us that it is indeed possible for an OCI customer to spin up Kubernetes clusters for both A1 and x86-based OCPUs and manage both classes from a single hub. He noted that this is possible due to each cluster being capable of supporting what Oracle calls its own “shape,” which is the OCI term for a configuration whose template sets aside given amounts of OCPUs and memory.
In June 2020, as Ampere was premiering the 128-core edition of its Altra processor, its senior VP at the time, Jeff Wittich (since promoted to Chief Product Officer) presented the first test result data that appeared to show Altra’s performance for any given number of cores was a nearly perfect multiple of its single-core performance.
“We’re using single-threaded cores,” explained Wittich. “That’s because we want to ensure we don’t end up with noisy neighbor problems [and] best possible performance without having a lot of resource contention. And we don’t want to open up the attack surface for things like side-channel attacks. So, for us, when you scale threads, you’re scaling physical cores. And you can see, we are almost at ideal scaling. As you scale up to 160 cores across two sockets, you’re getting 98 percent of the performance you’d expect by the time you get to 160 cores.”
OCI has also begun making available bare-metal Ampere instances on dual-processor systems (for 160 cores) with 1 TB of memory. Oracle’s move comes in the wake of Ampere’s surprise announcement that it will begin developing its own Arm-based cores for future models rather than rely on Arm Neoverse N1 designs.