Dear Commons Community,
The New York Times has an article today entitled, “For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights”, that describes the efforts needed in cleaning up large amounts of raw digital data in order to realize any benefits. Specifically, it examines what data scientists call “data wrangling,” “data munging” and “data janitor work” . Here is an excerpt:
“The field known as “big data” offers a contemporary case study. The catchphrase stands for the modern abundance of digital data from many sources — the web, sensors, smartphones and corporate databases — that can be mined with clever software for discoveries and insights. Its promise is smarter, data-driven decision-making in every field. That is why data scientist is the economy’s hot new job.
Yet far too much handcrafted work — what data scientists call “data wrangling,” “data munging” and “data janitor work” — is still required. Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets…
“It’s an absolute myth that you can send an algorithm over raw data and have insights pop up,” said Jeffrey Heer, a professor of computer science at the University of Washington and a co-founder of Trifacta, a start-up based in San Francisco.
Timothy Weaver, the chief information officer of Del Monte Foods, calls the predicament of data wrangling big data’s “iceberg” issue, meaning attention is focused on the result that is seen rather than all the unseen toil beneath.”
As someone who has followed and written about data-driven decision making and the use of big data in education, the comments provided in this article are very much on target and should be heeded for any organization planning to invest in big data applications.
The article goes on to describe several small start-up companies that are trying to develop software to ease some of the big data janitorial work issues but acknowledges that much progress is still needed. “We really need better tools so we can spend less time on data wrangling and get to the sexy stuff,” said Michael Cavaretta, a data scientist at Ford Motor, which has used big data analysis to trim inventory levels and guide changes in car design.