By Weiwei Cui, Shi Han, Qingwei Lin, Jian‑Guang Lou, Yong Xu, Dongmei Zhang, Haidong Zhang, and Bin Zhu, Microsoft Research Asia
Data intelligence is a new interdisciplinary field that synthesizes areas such as big data management, data mining, machine learning, human-computer interaction, and data visualization. Research in data intelligence aims to provide theories, methodologies, technologies, and systems for obtaining insightful and actionable intelligence from data, to ultimately support data-driven decision making and task completion.
If data is a new “oil,” then data intelligence is the “oil refinery.” With the ability to understand data, extract hidden information and insights from data, and provide intelligence for data-driven decisions and actions, data intelligence is becoming increasingly essential in this era of digital transformation. As a result, data intelligence has emerged as an increasingly important area that has experienced rapid growth in recent years.
Data intelligence enables us to explore unknown areas in the data space and, in turn, creates enormous opportunities in different domains. Internet-based businesses, such as search engines, e-commerce applications, and social media applications, are intrinsically enabled by and built on data intelligence.
The traditional realm of business intelligence (BI) is being reshaped and disrupted by data intelligence. According to Gartner, augmented analytics—a new paradigm powered by data intelligence that combines natural language query and narration, augmented data preparation, automated advanced analytics, and visual-based data discovery capabilities—will be a dominant driver of new BI purchases in the near future.
Here, we present recent technical advances in the data intelligence area and look toward its future.
Technological progress of data intelligence
In general, the enabling technologies employed in data intelligence can be divided into the following pillars: data infrastructure, data preparation, data analytics, data interaction, and visualization. Compared with traditional data processing and data analysis, data intelligence has additional challenges in terms of data, analytics, and human aspects. To solve these challenges, various innovative technologies have recently been developed.
Big data systems and platforms
To support large scale data processing and analysis tasks, new data storage systems are designed to promote high data throughput, high scalability, and fault tolerance. While traditional databases, such as OLTP, were designed for transactional scenarios, they cannot meet the query requirements of statistical analysis for big data. Current big data systems emphasize read-write efficiency, data capacity, and system scalability. Specifically, the data is divided into blocks, and each block is copied and distributed to different physical machines for storage. Such redundant data blocks can reduce data loss issues caused by the failure of individual machines. The redundant storage of data not only improves the reliability of the system but also improves the concurrent performance of the system when data is read. In addition, modern big data systems often run on relatively cheap, ordinary servers to reduce costs and these machines are connected via a high-speed network for efficient data transmission.
Many distributed NoSQL data processing systems have emerged to fulfill the big computing requirement brought by big data processing. In terms of computational models, the introduction of MapReduce has revolutionized the parallel processing of big data with many real industry deployments. Recently, Spark takes advantage of in-memory computing and greatly optimizes the efficiency of the Shuffle in MapReduce. Spark now has replaced Hadoop MapReduce as the most important big data processing framework in the industry, complete with a rich application ecology.
In addition, a computational model based on streaming has been developed to support applications of continuously changing data. In the streaming model, each data event is processed to achieve more real-time updates. Spark Streaming, Storm, and Flink are popular streaming computing platforms.
To support online interactive query and analysis of big data, technologies from different fields are rapidly integrated to build a real-time and efficient big data interactive query platform, as represented by ElasticSearch technology. Based on the index structure and technology of a search system, large-scale unstructured or semi-structured data are partitioned and indexed to support fast queries. Another kind of technology represented by Apache Kylin extends the traditional data cube technology to the field of big data, which significantly improves the query speed at runtime by caching some pre-calculated data cubes.
With the development of these technologies, it becomes more and more critical to carry out automatic analysis of data at a high level of semantics. Automated analysis techniques often require intensive calculations of aggregated results under different query conditions, e.g., an analysis query may involve hundreds or thousands of simple aggregation operations. This puts forward higher requirements for query performance. Considering the characteristics that the vast majority of big data analysis tasks are not very sensitive to the integrity of the data, some researchers have proposed technologies and systems, such as BlinkDB and BigIN4, to boost the query performance by sampling technologies and pre materialled Cubes. BlinkDB tries to use a stratified sampling technique to reduce the estimation error, while BigIN4 tries to optimize the estimation error of user query through a Bayesian estimation method.
Natural language conversational data analysis
Natural language is an ideal solution to democratize data analysis. It can greatly facilitate the ability of ordinary users to quickly and effectively conduct data exploration and analysis. In recent years, with rapid advances in natural language processing and artificial intelligence technologies, it has become possible to use natural language to query and analyze data.
Semantic parsing technology based on relational database/data tables is an important approach to solving interactive queries in natural language. In the early days, most approaches were based on pattern matching. Later, the second generation approaches based on syntax and semantic analysis appeared.
In recent years, with the development of deep learning technology, a series of end to end semantic parsing models have emerged. An end to end model often uses a sequence to sequence method, which encodes natural language sequentially, and then produces SQL statements step by step. But, the downside of this approach is that it tends to produce invalid or inexecutable SQL statements. Therefore, on the basis of the end to end approach, various kinds of knowledge and constraints are fused to reduce the search space and guarantee the validity of generated SQL statements, such as embedding SQL syntax knowledge and integrating tabular information, and even external knowledge bases, such as WordNet.
Intelligent data analysis
Data analysis plays a key role in data intelligence. In general, it includes descriptive analysis, explanative analysis, predictive analysis, and prescriptive analysis. These kinds of analyses provide different levels of insights, with increasing levels of values and technical challenges. Looking back on the evolution of intelligence tools for data analysis, there have been mainly four stages.
Stage 1: Data intelligence experts conduct a deep dive study of the target domain, to understand the existing practice, identify pain point problems, and learn essential knowledge towards solutions. Afterward, they build domain and task-specific data intelligence tools, used by domain experts to improve their productivity.
Stage 2: Out of the experiences of building multiple data intelligence tools for many different domains or tasks, data intelligence experts summarize and generalize some common technical building blocks, such as analysis of driving factors, clustering analysis, time series forecasting, and more. Afterward, they build modularized analysis platforms based on these technical building blocks. Practitioners across different domains can conduct their own customized analysis by leveraging these technical building blocks.
Stage 3: To further release the power of machine intelligence, at each step of the analysis flow, automated recommendations of insights and analyses change the analysis paradigm from unidirectional to bi-directional. That is, from human-driven analysis to human-machine co-driven and collaborative analysis.
Stage 4: In the three stages mentioned above, the design and construction of key technical parts—such as data processing, feature engineering, and model selection and optimization—rely heavily on the machine learning (ML) expertise of data intelligence experts. With the rapid development of machine learning theory and application, auto machine learning (AutoML) technology has emerged. Essentially, AutoML makes systematic abstractions of key parts in a machine learning process and tries to automate the process, enabled by rapidly increasing computation power. AutoML potentially lowers the entry barrier of data intelligence and is available to a broader group of non-expert users and long tail tasks, thus further advancing the fusion of human-machine intelligence.
Intelligent data integration
The integration of various data sets and types is critical to the success of data intelligence. Inherent in that effort are the following challenges.
First, before acquiring knowledge from data, the machine needs to interpret it correctly. In general, the most machine-friendly data is structured data such as relational databases. However, there exists large amounts of unstructured data, such as text-based documents. Other than structured and unstructured, there are also semi-structured data, such as spreadsheets. As of now, it is still challenging for machines to interpret such unstructured or semi-structured data—and this is a necessary step toward data intelligence.
Second, data is not isolated and requires commonsense knowledge. Data intelligence needs to access external knowledge that is outside the given dataset and mimic humans’ associative thinking processes.
Third, data has defects, such as data absence and noise. It is critically important to discover and correct such data defects, to ensure accurate data intelligence.
Data visualization
Data visualization is essential for perceiving and communicating data. It is a multidisciplinary field that involves human-computer interaction, graphics, perception, and more. In the era of big data, data visualization has grown in importance and is now an indispensable part of data intelligence.
Data visualization heavily relies on user interfaces to support various exploration operations, such as interactive search, selection, and filtering. Recently, modern visualization tools have become available for general users; and many advanced visualization technologies, such as word cloud, TreeMap, parallel coordinates, FlowMap, and ThemeRiver are moving into the mainstream of data visual analysis.
In decision-making processes, visualization also plays a vital role. It can dramatically improve the efficiency of communication by providing more accurate, contextual, digestible, and memorable facts. This is generally called visual storytelling, which aims to extract and present the most concise and impactful analytical results so that they can be shared externally and efficiently. We have seen modern BI platforms, such as Power BI, provide a rich set of visual storytelling features. However, the research into visual storytelling is still at an early stage. Scientists are still exploring various aspects, such as visual forms, narrative flows, interactions, context, memorability, and evaluation.
Privacy-preserving data analytics
In recent years, data privacy has become a focus of attention and relevant data privacy protection legislation, such as the General Data Protection Regulation (GDPR), has been enacted. As a result, researchers have been actively exploring privacy-preserving data analysis techniques that allow data collection and processing, while also preserving data privacy.
There are several main research areas underway. One direction is to provide a trusted computing environment for sensitive data operations. In this setting, user data is always encrypted except when actively being processed.
Another direction is to allow data processing directly on encrypted data. Selective encryption, which allows certain operations after encryption, has been widely used for multimedia protection but has proven difficult to extend to other types of data. An alternative approach is to use homomorphic encryption, which allows certain operations, such as addition and/or multiplication, to be carried out on ciphertexts, with the generated encrypted result, when decrypted, matching the result of the corresponding operations performed on the corresponding plaintexts.
Future trends in data intelligence technology
Data intelligence research is highly aligned with the urgent market demand for digital transformation and finding additional value in data. With data intelligence being used in more domains, new scenarios, and applications, as well as applied to new problems and challenges, data intelligence research needs move towards more automated, more intelligent, more reliable, more ubiquitous, and more effective technologies.
Trend 1: Analysis at a higher semantic level
To analyze data intelligently, we need a rich semantic understanding of the data. The most commonly used model in data analysis is the relational data model, which is optimized for query and storage performance rather than semantic information.
How to obtain semantic information automatically from table data and other easily available text data (such as web pages), to enhance and enrich table data, is an important research direction. For example, how to determine the entity types of rows or columns in the table, such as people’s names or data types such as currency. Tables often do not have rich contextual information in text, so entity recognition in tables is more challenging than entity recognition in other natural language processing tasks. In addition to entity recognition, the mining and analysis of entity relationships in data tables is also crucial to address the questions of data analysis and achieve answers that have a high level of semantic understanding.
Trend 2: Framework for representing and reusing common knowledge and models
Human beings are able to draw inferences about knowledge and methods across different tasks. Specific to the data analysis domain, the knowledge and models used in the analysis need to be transferred or migrated between different data objects and analysis tasks. In the field of machine learning, there have been a lot of efforts and methods proposed, such as transfer learning, multi-task learning, and a pre-learning model. To achieve the goal of drawing inferences from one example to another, it is necessary to study a unified framework of models that are suitable for the field of data analysis and that can support knowledge migration and sharing.
Trend 3: High-quality training and benchmark datasets
Further research progress in applying deep learning technologies in the data intelligence area is impeded due to the lack of high-quality training datasets. Just as ImageNet data played a significant role in computer vision research progress, in data intelligence research, we need to build a set of large scale, high quality, and standard training and benchmark datasets. When they are available, many data intelligence-related research topics, such as automated analysis, NLP interaction, and visualization recommendation, will achieve promising breakthroughs.
Trend 4: Explainable data intelligence
Users will be no longer be satisfied by black box intelligence or end to end automation. In contrast, they demand finer grained, more targeted, and more transparent intelligence. For example, a finance auditing system can leverage data intelligence to help prioritize high-risk transactions to be reviewed first, in order to minimize risk to the company and maximize auditing efficiency. In the design and development of such an intelligence system, models with better explainability are preferred. These models provide more explanations about what the high-risk conclusion is based on, according to specific criteria. Making the existing black box models more transparent through more explanative information will be a trend.
Trend 5: More seamless integration of human-machine intelligence
In essence, existing machine intelligence is still grounded on human programmed learning frameworks. Machines have not had a significant breakthrough in terms of creativity, an important aspect of intelligence. We argue that this limitation will remain to be true for the foreseeable future. As a result, data intelligence will rely on human and machine collaboration, a topic that requires additional development work.
Trend 6: Prescriptive analysis for actionable intelligence
A key value of data analysis is to guide actions. Therefore, data intelligence should provide actionable recommendations via prescriptive analysis.
For example, intelligence might suggest that the sales of a particular brand are likely to drop by 10 percent in the next quarter. While it is useful information, it would be much more helpful with the addition of prescriptive recommendations, such as how the brand can maintain the same sales level. We believe that prescriptive analysis is a fruitful area of future research.
Trend 7: Better and more mature privacy-preserving data analysis
Through a full range of efforts, from legislation and technology to user participation, privacy protection will be integrated intrinsically into future data analysis efforts. Technologies should enable people to control how their personal data is collected, managed, processed, and shared throughout the data lifecycle. Privacy-preserving data processing technologies should be developed and deployed to that result.
Trend 8: Prevalence of intelligent analysis assistants
The fusion of intelligent agent and data analysis technologies is an important direction. In the coming years, intelligent data analysis assistants will become a must-have feature of many data analysis tools, to help humans to analyze and explore data more efficiently. These intelligent data analysis agents can communicate with people by means of natural language, understand the background and context of a task, and complete data analysis tasks including basic analysis commands and advanced data mining tasks with different semantic levels. They can also recommend useful facts to support business decisions through automatic insight mining from data, and provide intelligent and appropriate responses to certain data events (such as automatic alerts for observed changes). Such agents will also be able to learn through dialogue and communication with human analysts so that they can become more and more intelligent over time.
Trend 9: Collaborative visual analysis
Since new communication devices have emerged and become widespread, collaborative visualization is rising as a hot trend. In contrast to traditional face-to-face collaborations in small groups, the new collaborative analysis paradigm is often asynchronous and at large scale. People may analyze the same dataset by using different devices at different locations and times. This new paradigm raises various new technical challenges, such as how to coordinate many operations, avoid duplicated work, and ensure that different people perceive the same visual information on different devices. The key is to build an infrastructure to address these challenges and support such large-scale, asynchronous collaboration.
Trend 10: Ubiquitous visualization
In the future, visualization will become more natural and eventually become ubiquitous and transparent. Just like text and voice, it will be deeply integrated into our daily life. We see three critical prerequisites impacting visualization.
First, visualization needs to be generated efficiently and consumed quickly. Currently, visualizations still take a lot of time to prepare and generate. With the help of machine learning, we expect professional visualizations can be generated instantly. The cost of visualization generation will be reduced to a negligible level.
Second, we need a revolution in human-computer interaction, going beyond the traditional keyboard-mouse paradigm. More human-friendly interaction methods (such as gestures, styluses, haptics) will gradually mature and become dominant.
Third, display devices need to be universal and integrated into the environment around us, such as wearables, handhelds, or even surfaces of daily objects. Only when displays are always near at hand can visualizations truly become a fundamental means of communications.