Why are spreadsheet champions considered to be expert data analysts while computer science majors take a back seat?

← Quora archive · 2011 Feb 14, 2011 10:56 AM PST

Question

Answer

Let me illustrate what's happening with an example. I have no skin in either the CS or data science games (I am a control theory guy with amateur yellow-belt status in both fields). A former manager had been massaging some financial data and she had the intuition that a general LP model could be built. She'd built a very simple 2-variable example and hand-optimized the answer. She didn't know how to build the general LP. I didn't either, initially, when she brought me in to systematize what she was doing.

She dumped a big spreadsheet of data on me. I can't get into the details, but basically I spent about 3-4 days playing with the toy example at spreadsheet level, then went away and did a good deal of pen-and-paper math to cast the general data into the form of an LP, and then wrote Matlab code to pull the data from Excel and pop out the optimal results. It was a general solution that depended on some elegant linear algebra manipulations to get to an efficient representation. Not path-breaking, but I was nevertheless very proud of it because the math turned out to be very pretty. It was what my PhD advisor used to call "romantic math" (all sequences and summations and stuff). I wrote it up as an internal paper primarily because the 4-5 pages of LaTeX equations were so darn pretty.

The key in this little vignette is the "jump" from spreadsheet to programmed models, and the fact that some "elegant" insights were needed to do the jump. I could not smoothly segue from one regime to another, or brute force my way to a programmed model. I had to make a discontinuous gear shift (supported by pen-paper math derivations in this case). The shift was enabled by a problem-specific insight about the right representation.

The heart of the answer to your question is: "Why was the jump necessary? Why couldn't I gradually refine the toy, hand-computed spreadsheet model to become the generic LP-solving model?"

The gap is the reason the two communities are separate. It is also the reason why the CS types take a backseat.

At the current state of the art, you cannot avoid the jump. And I don't think you ever can, in the general case. The reason the CS types take a backseat is that the "elegant insight" needed to make the jump starts on the spreadsheet side of the gap, which too many CS types are contemptuous about. Here's why.

Spreadsheets and programming bring fundamentally different conceptual metaphors to thinking about numbers, and support different ways of wrangling them. The gap between the two "regimes" until recently had no technology tools in it. You can only leap across the gap with problem-specific elegant insights.

Spreadsheets are based on our direct intuitions about sorting and ordering. They help you detect basic patterns in raw data, and structure relationships efficiently for human presentation rather than db schemas. By playing with tabular data directly, you get the clues that allow you to do very basic slices and dices and graphs that help you develop an intuition for the dataset. It helps you improvise and start "jamming" in a jazz/raga sense with the numbers, directly.

At some point you get an Aha! experience ("hmm... ARIMA??") about the data, and the spreadsheet UX metaphor starts to fail. But the Aha! is also the strategic insight that gets the programming ball rolling.

But there's is a dark gap between the intuition and the first line of code, that we'll get to. Beyond the gap is programming.

Think of the programming regime as the very opposite to jazz/raga style music. It's like composing a symphony for an orchestra. Lots of moving parts that have to be precisely synchronized. Get one piece in the wrong place and the whole thing collapses. But get it right and you get beauty.

Programming deals with numbers procedurally, like an assembly line process. If it follows an initial "jazz jamming with the data" period, you get elegant code symphonies. If you make up a data model in the abstract without looking for patterns and potential representational efficiencies first, you get cacophonies.

Many CS types jump straight to cacophonous designs without doing enough "data jazz" to create the elegant insights that can lead to symphonies instead. They'd rather crunch data in bureaucratic "one size fits all" ways than LOOK at it.

The exceptions tend to be people with a sense of aesthetics around NP-complete problems, who know about things like phase-transitions in solvability etc. They tend to hunt for the elegant insights at spreadsheet level before diving in.

I have a rule: I NEVER build mathematical models or programming models without getting a hands-on sense for the data first, through play/improvisation. And I don't build models at all if I can't get an "Aha" insight to attack the problem in an elegant way.

The gap is also the gap between data mining and analytics. It is why you cannot just stare at your Google Analytics data and keep using the automated filters/tools to get to truly deep insights. I've NEVER had a web traffic question that Google Analytics was able to answer out of the box.

The gap is NOT necessary. In Newtonian mechanics for example, you can play with a problem, get a "creative insight" about the right set of coordinate frames to solve for the equations of motion elegantly (quite tough).

But you don't HAVE to. Newtonian mechanics is a domain where you can get to the general solution in efficient brute force ways. Starting with Lagrangian instead of Newtonian formulations (actually a vector-math descendant of Lagrange called Kane's equations), you can skip the insight part. Just describe your mechanical model and hit "go" and the thing will simulate. That's what CAD packages do. They don't try to get clever with coordinate frames. I used to love being clever till I learned Lagrange. My advanced dynamics professor explicitly told us: don't try to be clever, let the math do the work for you.

But until Lagrange's innovations, you HAD to be pretty clever at "elegant math" to solve tricky mechanics problems. It was Kepler getting "clever" with orbital data (ellipses over epicycles) that allowed Newton to get to the BIG insight that solved all classical mechanics problems at one shot.

Now what about general data analysis?

There ARE attempts to create "gap" technology that could potentially allow you to slide smoothly from spreadsheet to custom-coded models smoothly. Microsoft's Pivot is a great example. I have a lot of hope that it will put a lot of Matlab or R-based coders out of work. I don't believe in creative intuition in domains where it can be engineered away efficiently.

But in general, data wrangling is NOT Newtonian mechanics. Most problems end up being NP-complete, which means you are forced to rely on "local conditions" type insights to solve the problem elegantly for a particular regime of data.

To take a completely trivial example, the Hamilton circuit problem is NP-complete. But for both densely connected and sparsely-connected graphs, the solution is trivial: in the former case, pretty much any random graph traversal will give you a Hamilton circuit. For the latter, pretty much any path will take you down a cul de sac that proves a Hamilton circuit does not exist.

So long as you stay away from the middle "phase transition" part where the hard instances live (look up the papers by Cook, Cheeseman etc. if you don't know what I am talking about), you can finesse NP-completeness. But the key is that you need to look at the actual data to figure out which "elegant heuristic" to apply. There is no mechanical way to come up with the right heuristic for a given regime before/after a phase transition. In fact, there is no general way to parametrize a problem space to find the phase transition automatically.

In the Hamilton circuit case, one way to get to the insight that sparsity is the right variable to use to parametrize the space is to, well, look at your graphs in the form of adjacency matrices or lists in Excel. The sparseness/denseness of your graphs will leap out at you.

Obviously, "real" data problems are rarely this trivial (though the toy examples I've seen for MapReduce/Hadoop type decompositions tend to be as trivial as the Hamilton circuit situation). But you still need to look for that insight.

So to summarize: if you are a CS guy looking to beat the Excel-wranglers at their own game, get your hands dirty playing with RAW data sets. Otherwise they'll have the monopoly on the "elegant insight" part of the game, and you'll simply be outsourcing contractors who implement the design.