The prevalence of open source software (OSS) offers organizations the ability to develop data science projects quickly and affordably, although a lack of talent and concerns over open source security are challenges.
These were among the findings from Anaconda's 2022 State of Data Science Report, which found that the most valued benefits of open source are speed of innovation and affordability.
However, open source security concerns resulted in 40% of respondents pulling back on usage, with 31% citing it as the top concern.
Another key survey finding was that the lack of skill professionals is one of the biggest barriers to the successful enterprise adoption of data science, cited by more than half (56%) of respondents.
Anaconda CEO and co-founder Peter Wang said that, overall, the findings in this year's report match the broader conversations he's seeing in the data scientist community.
"It was interesting to learn that 65% of respondents cited insufficient investment in data engineering and tooling to enable the production of good models as the biggest barrier to successful enterprise adoption of data science," he said.
Although poor models and inputs are something they know cause friction in data science and machine learning (ML), the fact that this response was the highest ranked was unexpected considering obstacles like insufficient data science skills are at play, Wang said.
OSS, and particularly the Python programming language, plays an enormous role in data science today.
For decades, developers and data scientists were building and open sourcing the tools they used to analyze large sets of data — often in Python.
Once organizations discovered that they were sitting on a mountain of data that could trigger a new wave of growth, OSS became a mainstay thanks to an unmatched community of innovators and a lower lifetime cost to the business, Wang explained.
"Just look at Python," he said. "Over the last decade, Python has grown to become the most popular programming language used by data scientists, coders, and hobby developers alike and continues to translate into new use cases."
Wang added that he expects open source and data science to continue to push each practice further.
"I'm incredibly hopeful to see more open source involvement as it relates to bias in the data science field, as well as AI and ML," he said.
The survey also uncovered that 32% of students rarely or never have been taught bias in AI/ML/data science classes.
Related: 5 Ways to Prevent AI Bias
"As we move forward, this should be a major focus for those shaping the future of data science," Wang said. "We'll begin to see priorities shift toward reinvesting in the open source community and its infrastructure, and I'm optimistic we'll see this from education institutions."
Open Source Security Concerns Grow
Security concerns are growing because instances like the Log4j exploit have become a major wake-up call across the board, according to Wang.
"We live in a world where open source is now embedded in nearly every piece of software and technology, and up until recently, there were those at a management level who didn't even realize they were using open source," he said.
Now that things have shifted to place more emphasis on securing software supply chains, open source security has become a top priority.
From Wang's perspective, the main challenge is these conversations around open source security are still somewhat new within the data science community.
"Vetting secure code is something developers and IT are familiar with, but it's a newer territory for data science," Wang explained. "We're starting to see some IT teams building out strict open source posture, leaving data scientists, who aren't expert developers, to do their analysis with whatever they can download."
Organizations Can Help Build Data Science Skills
Wang pointed out that data science is a form of literacy, and it's something that can be taught to any employee.
While some professionals will specialize in becoming data scientists, requiring strong skills and an understanding of math and statistics, other professionals only need to understand a few specific things to understand how data can improve their work.
"Realizing that not everyone is a data scientist, but everyone can use data science is an important step for building data literacy skills," Wang said. "Today, non-programmers are increasingly picking up Python not just to analyze data, but also to build applications, games, and other projects."
The best way for organizations to start building these skills within their employees is to teach them the fundamentals of data science, he added.
About the authorNathan Eddy is a freelance writer for ITPro Today. He has written for Popular Mechanics, Sales & Marketing Management Magazine, FierceMarkets, and CRN, among others. In 2012 he made his first documentary film, The Absent Column. He currently lives in Berlin.