Machines have long been more efficient and accurate at generating data than humans. It’s this basic observation that led to concepts such as the Internet of Things and smart connected devices. But just because your organization generates mountains of data doesn’t automatically translate to increased intelligence. The term “big data” itself, likely coined in the 1990s, initially referred to the difficulty of traditional relational databases to process and unify growing and diversifying data streams efficiently.
The problem of making sense of vast quantities of data is leading to skyrocketing salaries for data scientists and some organizations are throwing up their hands in desperation and renting armies of khaki-clad high-paid consultants to help them make sense of their data deluge. Other companies are seeking to hire their own data scientists, but amidst a white-hot market, such experts can be difficult to find, expensive and tricky to retain.
There is a middle ground between those extremes. Organizations can enlist the help of a partner that “will teach them how to fish,” said Nisha Muktewar, a data scientist at Cloudera, referring to the proverb recommending learning how to fish as a metaphor for self-sufficiency.
“We are a big supporter of helping organizations build in-house talent,” Muktewar said. “Sometimes, that can involve initially some hand-holding, and stuff like that,” depending on the central problem the organization is working to solve and the experience of their team. But because the number of data science skills is vast and evolving, helping a company build a data science practice often starts with identifying the most logical core strategies for parsing data, whether its machine learning, deep learning, natural language processing (NLP) and so forth to help them know which branches of the discipline are most important. An organization’s specific needs will dictate the type of data science skills that will be useful for a given data science project. There can be a misconception that data scientists “know everything about everything, but there’s always a limit to the extent that any one person can master one domain, which is why I think you see a lot of organizations with Ph.D.s in a specific area like NLP,” Muktewar said.
“When we first come in, we try to understand what kind of data there is, what kind of solutions the organization has already built, and what’s working and not working for them, and why,” said Muktewar, who works within a Cloudera division known as Fast Forward Labs, which focuses on advising clients on data science and applied machine intelligence research. “We try to get a sense of what their limits are — not to judge them, but just to understand what they are capturing.”
When asked about the so-called 80/20 data science dilemma, which holds that roughly 80 percent of a data scientist’s time is spent finding, cleaning and preprocessing it before it can be used for modeling, Muktewar demurred. “If your data is really bad, you can spend maybe 50 percent of your time prepping the data.” It’s a given that certain data formats are more challenging to work with, she said. “Is the data stored in JSON files? Is it in EBCDIC format? Most of that time [spent with the data in the beginning] is not cleaning it, but actually getting to understand it.”