Baseball data analysis website FanGraphs adopted the MariaDB SkySQL cloud database recently to work with fluctuating and ever-growing information coming out of the sport. FanGraphs, which gathers granular data including the velocity of pitches thrown during games, is using the cloud database to process statistics, complex queries, projections, and models of playoff odds.
“Anything that’s baseball, we’re taking a look at,” says David Appelman, CEO and founder of FanGraphs.
Now that the 2021 season of Major League Baseball is underway, he says there is new Statcast data introduced by the league that must be accommodated. “The data can be pretty wide,” Appelman says. “There’s a lot of records for each individual event that happens in baseball. On a season-level, there’s something in the realm of a million records a season for data for every individual pitch thrown.”
There is also data from minor league teams as well as baseball leagues overseas to be ingested by FanGraphs, he says. “It’s a fairly sizeable amount of data.” FanGraphs tends to run thousands of queries per second on its database to serve its audience, Appelman says. Adding more international data is a priority for FanGraphs, he says, along with more Statcast data from MLB.
Founded in 2005, Appelman says he personally managed the FanGraphs database until 2019. Over the years his company has tried to work with different resources to improve its efficiency with varied results. FanGraphs first migrated to MariaDB about seven years ago, Appelman says, then considered exploring a migration to Linux, but that brought up several potential headaches. “I didn’t want to deal with migration,” he says. “Optimizing the database for Windows is one thing. Optimizing it on a Linux box is a completely different thing.”
Appelman says he did not have time to devote to sort that out while other operations required attention. FanGraphs considered other options, such as moving the database to a turnkey solution. “I looked at Amazon Relational Database Service and Cloud SQL,” he says.
About the time FanGraphs was looking to move and offload all its database administration, Appelman got a tech briefing for MariaDB SkySQL that opened up new possibilities. “It was fast. It seemed it would handle all my needs,” he says.
FanGraphs entered a contract with MariaDB to migrate first to Linux, and then in February of this year migrated to SkySQL. This also led to FanGraphs moving from dedicated servers to the Google Cloud Platform. “We just needed more flexibility,” Appelman says. The infrastructure migration to GCP included app servers and data loading servers.
This was not FanGraphs first attempt at taking advantage of the cloud. In 2017, the company tried to migrate to a smaller cloud provider, Appelman says, trying to match exact resources such as RAM and processing power. “We ran into big problems,” he says. “The next morning, I had to migrate back. What I didn’t quite realize was that with the service I moved to, the hypervisor was causing really bad I/O. The database became this huge bottleneck.”
Appelman says he was also reluctant to move his infrastructure to AWS because of the learning curve he faced with its resources. He needed another option. “GCP fit a nice middle ground,” Appelman says. “I found it a little bit easier to set up than AWS.”
There were still performance questions raised with the move. The migration of FanGraphs from a 4xSSD RAID 10 array in a dedicated machine to the cloud, Appelman says, seemed at first to be a downgrade in raw power. “That doesn’t seem to be the case anymore,” he says. “Things are running great. We had no problems migrating to SkySQL and GCP this time.”
FanGraphs is now considering additional SkySQL resources it might tap into, Appelman says, such as its data warehousing technology. “We need second or low-second or sub-second responses for a lot of our queries,” he says. “We want people to be able to do very fast, ad hoc data analysis. With certain types of MLB data, there’s now a lot more than it used to be -- we’re hoping to take advantage of that to bring our users a lot more granular and customizable analysis without having to wait a while to get the results.” Other resources from SkySQL might be leveraged in the future to run multithreaded, single queries for more efficient processing time, Appelman says.
There are a few wish-list items he wants to explore now that FanGraphs has committed to the cloud. Appelman says he has yet to scratch the surface with GCP’s resources that might be of interest, such as machine learning. So far, he is eager to see continued development of reporting tools on the SkySQL database. “Knowing exactly where the bottlenecks are in our application makes a big difference for me,” Appelman says. “I’ve used some third-party tools to figure out which queries I’ve botched. Having that available in the reporting section would be useful.”