Data Analytics Definition: Harder To Find Than The Treasure Of Oak Island
It's like Omertà for data scientists.
There are lots of discussions and papers and products for data analytics, including electric power grid analytics. Oddly, the actual definition of an analytic seems to be missing in action. So how to define a data analytic? I don’t mean a list of popular ones or description of what they are used for, but a statement that says precisely what an analytic is. Wikipedia defines analytics this way:
Analytics is the systematic computational analysis of data or statistics. It is used for the discovery, interpretation, and communication of meaningful patterns in data, which also falls under and directly relates to the umbrella term, data science. Analytics also entails applying data patterns toward effective decision-making.
OK, but still, what exactly is an analytic? The Wikipedia definition did not really get at that. It more or less says that an analytic could be a statistic, or an analysis of data. Or maybe it is a data mining algorithm. So, is anything that processes data an analytic? It all seems very metaphysical.
If you have been reading my substack, you probably realize I place a lot of value on having solid definitions because definitions are tools for both communication and reasoning. So, how to define what an analytic is? To answer this, let’s turn to information theory (don’t worry, we are not going to get too mathy here).
We are interested in how to quantify information content and changes in information content and so we turn to the work of the brilliant and shy Claude Shannon, who laid down the fundamental principles of information theory, including Shannon entropy, the Shannon Sampling Theorem, and many other powerful ideas.
Using the concept of Shannon entropy, we can quantify the information content of a data set or data stream. We won’t go into the math here (see? I promised) but this idea makes it possible to quantify information content in bits (or other units as well, like nats or hartleys, depending on the logarithmic base used). When it comes to data manipulation, reduction in entropy results in information gain. In fact, information gain is defined exactly as the reduction in entropy due to a data transformation. Hence our definition of an analytic:
An analytic is a data transformation that decreases Shannon entropy, thus increasing information.
This definition applies to processing of data records or data sets. For data streams such as digital signals or telemetry, we can extend the definition by using information rate - the product of the entropy of the source with the average number of symbols emitted by the source per second.
The definition makes it clear that exactly reversible transformations are not analytics. Specifically, the Fast Fourier Transform (FFT) is often called an analytic, but the time domain and frequency domain representations of a signal are exactly equivalent, so no reduction of Shannon entropy occurs. Likewise, the symmetric components of unbalanced three phase power system phasors are not analytics - they are exactly equivalent representations connected by the Fortescue Transformation. Such transformations are often useful for making information extraction easier to accomplish and so such transformations will often precede analytics, but they are not analytics in themselves.
How about “visual analytics?” Qlik.com gives a description:
Visual analytics is the use of sophisticated tools and processes to analyze datasets using visual representations of the data. Visualizing the data in graphs, charts, and maps helps users identify patterns and thereby develop actionable insights.
Great, but do such tools fit our definition of an analytic? Is a bar chart an analytic? One might argue that a data set can be reduced to a bar chart and therefore Shannon entropy must have been reduced.
One might say that the preparation of the data to be charted is the analytic. One might also argue that the analytic is in the mind of the person who looks at the diagram. Either way, the graphic itself is not an analytic. And who is this “One” guy, anyway?
A measure of the value of a definition is whether it enables reasoning about relevant problems, so let’s try some. The definition in this posting leads to a view of a sequence or stack or hierarchy of analytics as creating an information entropy funnel – as we move through or up the processing chain, entropy should decrease at each stage. An implication of this is that analytics may enable decreases in data volume or data rate and so can also serve as data quantity reduction tools (and not by compression, another topic into which Shannon delved deeply). If data volume per unit entropy is bounded to be monotonically decreasing as we progress upward through an analytic hierarchy, then analytics can also serve to provide scalability of the information management structure.
The foregoing has significant implications for the structure of analytics systems, especially distributed grid analytics, which are presently not common but should become so. The definition of an analytic offered here deepens the architectural principles underlying analytics networks and distributed analytics systems. More about that another time.