Microsoft Research Asia Symposium on Collaborative Research

Speaker: Hailong Cao, Harbin Institute of Technology

In this talk, I would like to introduce a distribution based model to learn bilingual word embeddings from monolingual data. It is simple, effective and does not require any parallel data or any seed lexicon. We take advantage of the fact that word embeddings are usually in form of dense real-valued low-dimensional vector and therefore the distribution of them can be accurately estimated. A novel cross-lingual learning objective is proposed which directly matches the distributions of word embeddings in one language with that in the other language. During the joint learning process, we dynamically estimate the distributions of word embeddings in two languages respectively and minimize the dissimilarity between them through standard back propagation algorithm. Our learned bilingual word embeddings allow to group each word and its translations together in the shared vector space. We demonstrate the utility of the learned embeddings on the task of finding word-to-word translations from monolingual corpora. Our model achieved encouraging performance on data in both related languages and substantially different languages.

Speaker: Liang Jeff Chen, Microsoft Research Asia

Graph data and applications are becoming ubiquitous. Instead of following the research and industry trend of building graph databases and systems from scratch, we asked a simple yet fundamental question: what is the fundamental gap between graphs and old data formats and is it possible to bridge it? The answer to the question has since yielded a series of efforts and GraphView, a middleware that presents to applications a view of a graph database while internally compiles a graph program to data instructions that can be executed in existing database systems. As such, it re-uses the state-of-the-art technologies in today’s database products and co-evolves with them for many years to come. This design philosophy not only inspires SQL Server’s SQL Graph, but convinces Azure DocumentDB to take GraphView as the core and launch Azure Cosmos DB Graph, a multi-model database that empowers not only new graph customers but existing document-store and key-value-store customers. As of today, Azure Cosmos DB Graph is supporting a few key internal customers, as well as several external customers. We are expecting many more to come after its public announcement at //Build.

Speaker: Hong Cheng, University of Electronic Science and Technology of China (UESTC)

Recently, human Robot Hybrid systems have been designed and developed to provide functional motion assistance to disabled and elderly people in daily activities. This talk will discuss key techniques of human-robot hybrid systems, which include ergonomics, multisensor based physical human-robot interaction, wearable computing, human intention estimation, multimodal interaction and cooperation. The related advances in UESTC exoskeletons will also be introduced in this talk, which includes reinforcement learning in pHRI and our exoskeleton systems, AIDER system for walking assistance and HUALEX system for human augmentation.

Speaker: Jun Du, University of Science and Technology of China (USTC)

We will report our recent progress of deep learning based speech separation and recognition. We examine several critical problems in deep learning based聽 speech separation, including synthesizing the noise data to improve the noise generalization, designing the objective function optimization under the probabilistic framework rather than the conventional MMSE, etc. For the recognition part, we will focus on the multi-channel case and share our thinking on how to fully utilize the multi-channel information under the deep learning framework. Specifically, the iterative mask estimation approach will be introduced as the core technology of our champion system in CHiME-4 challenge.

Speaker: Nan Duan, Microsoft Research Asia

In this talk, I will briefly introduce how to build informational bots based on various genres of knowledge, using question generation (QG) and question answering (QA) technologies. Besides, I will present how infobot is applied in important Microsoft productions, such as Bing, Xiaoice, Customer Service Bot, and etc.

Speaker: Chengchen Hu, Xi’an Jiaotong University

Software Defined Networking (SDN) greatly simplifies network management and introduces unprecedented flexibility by decoupling control functions from the network data plane. However, such a decoupling also opens a box of various open questions, which are not well addressed. This talk will briefly describe the work in XJTU towards more flexible SDN with programmable data plane, less overhead, and easier interoperability.

Speaker: Yihua Huang, Nanjing University

Recommendation system that recommends a user with the needed items plays a vital role in modern intelligent web and mobile applications. On one hand, good recommendation accuracy should take into consideration the auxiliary information from multi-sources other than conventional behavioral information. On the other hand, this also requires the innovation on recommendation model. We present RecNN, a pure deep neuron network (DNN) based framework, that seamlessly integrates both collaborative filtering and content based recommendation. The training data from multiple sources is considered as multiple views accordingly and each view contains information from user and item respectively. RecNN first utilizes different DNN structures to model the information such as the item description and user demographics into distributed embeddings. The inherent interactions are then modeled with two levels, intra-view and inter-view, in which each interaction is modeled as a DNN, and the results of intra-view DNNs are stacked together, followed by an inter-view DNN to make the final recommendation. The intra-view interaction aims to model the local user-item interaction, e.g., the collaborative filtering interaction is captured from the behavioral view, and the inter-view interaction models the global interaction among views. The evaluation against multiple recommendation tasks validates the effectiveness of RecNN.

Speakers: Qiang Huo, Microsoft Research Asia

Optical Character Recognition (OCR) is an important enabling technology to empower people to do more and achieve more. In Microsoft Research Asia (MSRA), we have been developing Microsoft’s next generation OCR engines which can detect both printed and handwritten text in an image captured by a camera phone or glass, and recognize the detected text for follow-up actions. In this talk, I will give you a glimpse of what we have achieved so far for handwriting OCR and the road map ahead.

Speaker: Yu-Gang Jiang, Fudan University

In this sharing talk, I will introduce our latest work on dense video captioning 鈥?automatically generating a paragraph of textual description for a given video. This goes one-step further beyond the recent works on generating a single sentence for a video clip.

Speaker: Haifeng Li, Harbin Institute of Technology

Speakers: Chin-Yew Lin, Microsoft Research Asia

Speaker: Weiyao Lin, Shanghai Jiao Tong University

Action recognition is an important yet challenging task in computer vision. In this talk, I will introduce a novel deep-based framework for action recognition, which improves the recognition accuracy by: 1) deriving more precise features for representing actions, and 2) reducing the asynchrony between different information streams. We first introduce a coarse-to-fine network which extracts shared deep features at different action class granularities and progressively integrates them to obtain a more accurate feature representation for input actions. We further introduce an asynchronous fusion network. It fuses information from different streams by asynchronously integrating stream-wise features at different time points, hence better leveraging the complementary information in different streams. Moreover, I will also introduce our work on summarizing long surveillance videos.

Speaker: Zhouchen Lin, Peking University

Mismatch removal is a key step in many computer vision problems. In this paper, we handle the mismatch removal problem by adopting shape interaction matrix (SIM). Given the homogeneous coordinates of the two corresponding point sets, we first compute the SIMs of the two point sets. Then, we detect the mismatches by picking out the most different entries between the two SIMs. Even under strong affine transformations, outliers, noises, and burstiness, our method can still work well. Actually, this paper is the first non-iterative mismatch removal method that achieves affine invariance. Extensive results on synthetic 2D points matching data sets and real image matching data sets verify the effectiveness, efficiency, and robustness of our method in removing mismatches. Moreover, when applied to partialduplicate image search, our method reaches higher retrieval precisions with shorter time cost compared with the state-ofthe-art geometric verification methods.

Speaker: Xuanzhe Liu, Peking University

Web browsing is one of the most significant user requirements on mobile devices such as smartphones. However, the user experience of mobile web browsing is undesirably sluggish because of the slow resource loading. We made a comprehensive measurement study to uncover the resource update history and cache configurations at the server side, and analyze the cache performance in various time granularities. We investigate three main root causes: Same Content, Heuristic Expiration, and Conservative Expiration Time. Based on these findings, we have developed two solutions to mitigate the imperfect resource-loading performance from different aspects. At the programming abstraction level, we propose the ReWAP, which is based on an efficient mechanism of resource packaging where stable resources are encapsulated and maintained into a package, and such a package shall be loaded always from the local storage and updated by explicitly refreshing. Compared to the original mobile Web apps with cache enabled, ReWAP can significantly reduce the data traffic, with the median saving up to 51%. In addition, ReWAP can incur only very minor runtime overhead of the client-side browsers. At the system level, we propose the SWAROVsky, a dual-proxy system that comprises a remote cloud-side proxy and a local proxy on mobile devices. Our system can be used with existing Web browsers and Web servers, and does not break the normal semantics of a webpage. Evaluations with 50 websites show that on average our system can reduce the page load time by 43.1% and the network data transmission by 57.6%, while imposing marginal system overhead.

Speaker: Bao-Liang Lu, Shanghai Jiao Tong University

The field of affective computing aspires to narrow the communicative gap between the highly emotional human and the emotionally challenged computers by developing computational systems that recognize and respond to human emotions. The detection and modeling of human emotions are the primary studies of affective computing. Among various approaches to emotion recognition, the electroencephalography (EEG)-based model is more reliable because of its high accuracy and objective evaluation in comparison with other external appearance clues like facial expression and gesture. Various psychophysiology studies have demonstrated the correlations between聽 human emotions and EEG signals. In this talk, we will present our recent work on investigating critical frequency bands, critical channels, and the stable patterns over time, and developing emotion models with transfer learning and deep learning.

Speaker: Jiwen Lu, Tsinghua University

In this talk, we will briefly introduce some deep learning methods which are developed in our research group, including deep metric learning, deep hashing, multi-modal deep learning, and deep sharable feature learning. We also show their effectiveness in some vision applications such as face and person recognition, image and video search, and object tracking and recognition.

Speaker: Qi Liu, University of Science and Technology of China (USTC)

Though personalized services (like recommender systems) are useful for handling information overload, it is still very challenging for customers to make the final choice, because the items in one consumption session are usually quite similar to each other. In this talk, we will briefly report our attempts to enhance the customer decisions via modeling the preferences and psychological traits of the customer in a particular session or sessions.

Speaker: Xueming Qian, Xi’an Jiaotong University

In this talk, we will introduce our recent work on attribute based surveillance video retrieval systems. The main content of our talk includes the following parts: (1) Robust and efficient video objects detection and feature representation. The corresponding video objects include car, pedestrian, face. (2) Fast feature indexing and similarity measurement. (3) Demo.

Speaker: Tao Qin, Microsoft Research Asia

Deep learning has achieved tremendous successes in recent years. It also faces multiple challenges, such as big-data challenge, big-model challenge, big-computation challenge, and so on. In this talk, I will first introduce those challenges and then discuss possible solutions and opportunities to address them.

Speakers: Ruihua Song, Microsoft Research Asia

A World of Difference: Divergent Word Interpretations among PeopleAbstract: The divergent word usages reflect the differences among people. In this talk, we present a novel angle of studying word usage divergence – word interpretations. We propose an approach that quantifies semantic differences of interpretations among different groups of people. The effectiveness of our approach is validated by quantitative evaluations. The experiment results indicate that the divergences in word interpretations truly exist. We further apply the approach to two well studied types of differences between people – gender and region differences. The detected words with divergent interpretations reveal the unique features of specific groups of people. For the gender case, we discover that different interests, social attitudes, and characters between males and females result in their divergent interpretations of many words. For the region case, we find that specific interpretations of some words reveal the geographical and cultural features of different regions. Moreover, we further study the relation between word interpretation and frequency. Our results suggest that word interpretation and frequency, although both are effective indicators of word usages, are quite different.

Speaker: Frank Soong, Microsoft Research Asia

To match speech segments from different speakers in the same or different languages, we need to equalize the acoustic difference between speakers as to measure the phonetic similarity at a relatively short, sub-phonemic, segment level. With a well-trained, speaker-independent, neural net (NN) based acoustic model, a speech segment is stochastically characterized by its sub-phonemic, “senone” posterior probability vector. An information-theoretic measure, Kullback-Leibler Divergence (KLD), is chosen to measure the phonetic “distance” between two such derived posterior vectors. The proposed approach has many possible applications, including: 1. Voice Conversion，i.e., converting the voice timber from a source speaker to a target speaker but keeping the same word content of the sentence; 2. X-lingual TTS training, i.e., training TTS of a different (target) language by using the source speaker’s monolingual speech data. We will present our NN-KLD based algorithms along with the voice conversion and cross-language TTS demos.

Speaker: Shuai Ma, Beihang University

Graphs have more expressive power and are widely used today, and various applications of social computing trigger the pressing need of a new search paradigm. In this talk, we argue that graph search is the one filling this gap. We first introduce the application of graph search in various scenarios. We then formalize the graph search problem and briefly discuss its challenges. Finally, we introduce several useful query and data techniques towards efficient and effective big graph search.

Speaker: Tao Mei, Microsoft Research Asia

Visual recognition has been a fundamental challenge in computer vision for decades. Thanks to the recent development of deep learning techniques, researchers are striving to bridge vision (image and video) and natural language, which has become an emerging research area. We will present a few recent advances bridging vision and language with deep learning techniques, including image and video captioning, image and video chatting, storytelling, vision and language grounding, datasets, grand challenges, and open issues.

Speaker: Bin Shao, Microsoft Research Asia

Speaker: Guangyu Sun, Peking University

DNNs (Deep Neural Networks) have demonstrated great success in numerous applications such as image classification, speech recognition, video analysis, etc. However, DNNs are much more computation-intensive and memory-intensive than previous shallow models. Thus, it is challenging to deploy DNNs in both large-scale data centers and real-time embedded systems. Considering performance, flexibility, and energy efficiency, FPGA-based accelerator for DNNs is a promising solution. Unfortunately, conventional accelerator design flows make it difficult for FPGA developers to keep up with the fast pace of innovations in DNNs.

To overcome this problem, we propose FP-DNN (Field Programmable DNN), an end-to-end framework that takes TensorFlow-described DNNs as input, and automatically generates the hardware implementations on FPGA boards with RTL-HLS hybrid templates. FP-DNN performs model inference of DNNs with our high-performance computation engine and carefully-designed communication optimization strategies. We implement CNNs, LSTM-RNNs, and Residual Nets with FPDNN, and experimental results show the great performance and flexibility provided by our proposed FP-DNN framework.

Speaker: Jian Sun, Xi’an Jiaotong University

In this talk, I will show that several mathematical models in imaging sciences, such as the sparsity-based models and statistical models, can be reformulated as deep learning models. We formulated Markov random field model in image prior modeling, iterative shrinkage in signal processing, compressive sensing model in MRI to be deep learning problems. These induced deep architectures are non-conventional, task-specific and achieved state-of-the-art results for solving image inverse problems, e.g., image restoration, compressive sensing MRI, etc.

Speaker: Haisheng Tan, University of Science and Technology of China

In edge-cloud computing, a set of edge servers are deployed near the mobile devices such that these devices can offload jobs to the servers with low latency. One fundamental and critical problem in edge-cloud systems is how to dispatch and schedule the jobs so that the job response time (defined as the interval between the release of a job and the arrival of the computation result at its device) is minimized. To study this problem, we propose a general model, where the jobs are generated in arbitrary order and times at the mobile devices and offloaded to servers with both upload and download delays. Our goal is to minimize the total weighted response time over all the jobs. The weight is set based on how latency sensitive the job is. We derive the first online job dispatching and scheduling algorithm in edge-clouds, called OnDisc, which is scalable in the speed augmentation model. Moreover, OnDisc can be easily implemented in distributed systems. Extensive simulations on a real-world data-trace from Google show that OnDisc can reduce the total weighted response time dramatically compared with heuristic algorithms.

Speaker: Xin Tong, Microsoft Research Asia

在这个报告中，我将介绍我们最近在图形领域所做的一些研究工作，包括三维内容生成，几何处理，材质建模，增强现实，以及可视化方面的最新进展，并讨论和展望未来的研究方向。

Speaker: Jie Tang, Tsinghua University

Jie Tang is a Tenured associate professor with the Department of Computer Science and Technology at Tsinghua University, and was also visiting scholar at Cornell University, Hong Kong University of Science and Technology, and Southampton University. His interests include social network analysis, data mining, and machine learning. He has published more than 200 journal/conference papers and holds 20 patents. His papers have been cited by more than 8,400 times. He served as PC Co-Chair of CIKM鈥?6, WSDM鈥?5, ASONAM鈥?5, SocInfo鈥?2, KDD-CUP/Poster/Workshop/Local/Publication Co-Chair of KDD鈥?1-15, and Associate Editor-in-Chief of ACM TKDD, Editors of IEEE TKDE/TBD and ACM TIST. He leads the project AMiner.org for academic social network analysis and mining, which has attracted more than 8 million independent IP accesses from 220 countries/regions in the world. He was honored with the UK Royal Society-Newton Advanced Fellowship Award, CCF Young Scientist Award, and NSFC Excellent Young Scholar.

Speaker: Feng Xiong, Harbin Institute of Technology

Big graph computation becomes growingly popular since the explosion of data. To accomplish tradition tasks on it, researches start to design various parallel algorithms. However, the scalability and usability are circumscribed because algorithms can vary from different source schemas, processing techniques, tasks, etc. From solving this problem, we design a prototype system as a combination of distributed SQL engine and Trinity. In our system, big graphs are handled in a database way. Various common-used tasks are supported by our system. In this talk, I will introduce the system and its various applications such as community discovery, path matching and frequent subgraph mining.

Speaker: Jiao Wang, Northeastern University

With the development of science and technology, multi-spectral camera has a great application in many fields, such as defense, medical, aerospace and aviation, etc. But a fatal defect in technology is that it will consume too much time to reconstruct spectral data cube on CPU or GPU. In our research , we design a prototype camera based on FPGA which can reach up to 20 fps under 200MHz, @256*256*15. It is the first multi-spectral camera can provide real-time performance in the world.

Speaker: Yu Wang, Tsinghua University

近些年来，基于CNN的物体检测和追踪相较于传统方法已经取得了巨大的突破，以物体检测为例，传统检测主流算法DPM在VOC2007上的mAP只能达到43%，与之相对应的是基于CNN的faster R-CNN在同样的数据集上mAP达到了73%，整整提高提高了30个百分点。但是算法高精度往往意味着更大的计算量，更大的参数量，例如，faster R-CNN的计算量达到了100G，参数量达到了600M（估计），受限于片上有限的计算与存储资源，我们无法将faster R-CNN直接移植到片上，为了解决这两个主要的冲突，我们开发DPU提高片上的计算速率，采用定点压缩技术减少计算量，最终在片上成功部署了3fps的faster R-CNN，精度也达到了世界领先水平。在物体追踪方面，我们采用的是目前流行的KCF算法，并为了片上的高效实现对算法做了一部分改善，使之能够充分利用片上的计算存储资源，通过这种方式，我们将KCF在ARM上的单框5fps完善成片上的5框100fps。以此同时，为了更加高效地联合物体检测和追踪，我们设计了一种策略使得物体检测和追踪能够更加高效，协作地运行，最终实现的片上的实时物体检测与追踪系统。

Speaker: Liwei Wang, Peking University

Early detection of pulmonary cancer is the most promising way to enhance a patient鈥檚 chance for survival. Accurate pulmonary nodule detection in computed tomography (CT) images is a crucial step in diagnosing pulmonary cancer. In this talk, I will show our approach for pulmonary nodule detection based on DCNNs, which achieves the state of the art performance.

Speaker: Xinbing Wang, Shanghai Jiao Tong University

In this talk, we conceptualize and design a novel academic system, paperbook or AceMap, to analyze the big scholarly data and present the results through a 鈥渕ap” approach. AceMap integrates several algorithms in the eld of network analysis and data mining, and then displays the information in a clear and intuitive way, aiming to help the researchers facilitate their work. After describing the big picture, we present achieved results and our work in progress. By far, AceMap has implemented the following functions: dynamic citation network display, paper clustering, academic genealogy, author and conference homepage, etc. We have also designed and performed distributed network analysis algorithms in a cutting-edge Spark system and utilized modern visualization tools to present the results. Finally, we conclude our paper by proposing the future outlooks.

Speaker: Jiaotao Wen, Tsinghua University

We introduce an optimized system for real time, low latency stereoscopic panoramic video communications that is camera agnostic. After intelligent camera calibration, the system is capable of stitching inputs from different cameras using a real time, low latency optical flow based algorithm that intelligently learns input video features over time to improve stitch quality. Depth information is also extracted in the process. The resulted stereoscopic panoramic video is then encoded with content-adaptive temporal and/or spatial resolution to achieve low bitrate while maintaining good video quality. Various aspects of the system including the optimized stitching algorithm, parallelization and task scheduling, as well as encoding will be introduced with demos with conventional (non-panoramic) professional and consumer grade cameras as well as integrated panoramic cameras.

Speaker: Dongdong Weng, Beijing Institute of Technology

The existing optical tracking technology can be divided into active and passive systems. The active system is limited by its expensive cameras and not suitable for consumer market. Passive system, although the price is cheaper than the active one, but because of the signal interference between the base stations, its effective tracking area is not large. We proposed an extensible wide area tracking system which used the optical synchronous laser coding technology to prevent interference between scanning stations. In our system, the number of scanning base stations can be expanded from the current 2 ( HTC VIVE ) to dozens and the tracking area can reach 100 square meters.

Speaker: Yingcai Wu, Zhejiang University

The problem of formulating solutions immediately and comparing them rapidly for billboard placements has plagued advertising planners for a long time, owing to the lack of efficient tools for in-depth analyses to make informed decisions. In this talk, I will present our recent work that employs visual analytics combining the state-of-the-art mining and visualization techniques to tackle this problem using large-scale GPS trajectory data. In particular, we present SmartAdp, an interactive visual analytics system that deals with the two major challenges including finding good solutions in a huge solution space and comparing the solutions in a visual and intuitive manner. An interactive framework that integrates a novel visualization-driven data mining model enables advertising planners to effectively and efficiently formulate good candidate solutions.聽 The presented approach can be adapted for other location selection problems such as selecting locations of retail stores or restaurants using trajectory data. More information about this work can be found here: http://www.ycwu.org/projects/smartadp.html (opens in new tab)

Speaker: Yu Wang, Tsinghua University

近些年来，基于CNN的物体检测和追踪相较于传统方法已经取得了巨大的突破，以物体检测为例，传统检测主流算法DPM在VOC2007上的mAP只能达到43%，与之相对应的是基于CNN的faster R-CNN在同样的数据集上mAP达到了73%，整整提高提高了30个百分点。但是算法高精度往往意味着更大的计算量，更大的参数量，例如，faster R-CNN的计算量达到了100G，参数量达到了600M（估计），受限于片上有限的计算与存储资源，我们无法将faster R-CNN直接移植到片上，为了解决这两个主要的冲突，我们开发DPU提高片上的计算速率，采用定点压缩技术减少计算量，最终在片上成功部署了3fps的faster R-CNN，精度也达到了世界领先水平。在物体追踪方面，我们采用的是目前流行的KCF算法，并为了片上的高效实现对算法做了一部分改善，使之能够充分利用片上的计算存储资源，通过这种方式，我们将KCF在ARM上的单框5fps完善成片上的5框100fps。以此同时，为了更加高效地联合物体检测和追踪，我们设计了一种策略使得物体检测和追踪能够更加高效，协作地运行，最终实现的片上的实时物体检测与追踪系统。

Speaker: Yuanbin Wu, East China Normal University

Open domain machine comprehension is one of the major tasks in natural language processing. Being of great practical use, it attracts long lasting research interest. Both world knowledge and linguistic analysis are important for the task. In this talk, I will focus on answer reasoning with limited world knowledge following the setting of MCTest task. I will share our idea on using syntactic and semantic structures for answer inference, which helps us to better utilize prior linguistic structures and achieve competitive performances against popular deep learning models.

Speaker: Yongqiang Xiong, Microsoft Research

Virtual cloud network services let users have their own private networks in the public cloud. IPsec gateways are growing in importance accordingly since they provide VPN connections for customers to remotely access those private networks. Major cloud service providers (CSPs) offer IPsec gateway functions to tenants using virtual machines (VMs) running a software IPsec gateway inside. Those virtualized IPsec gateways enable CSPs to deploy a scalable and flexible VPN gateway service. However, dedicating individual IPsec gateway VMs to each tenant results in significant resource waste due to the strong isolation mechanism of VMs. We design Protego, a distributed IPsec gateway service designed for multitenancy. By separating the control plane and the data plane of an IPsec gateway, Protego achieves high availability with active redundancy. Furthermore, Protego can seamlessly migrate IPsec tunnels between the data nodes without compromising the throughput of them. Hence, it elastically scales in and out to adopt to the service traffic changes while guaranteeing an expected maximum throughput. Our evaluation, and simulation based on production data show that Protego together with a simple resource provisioning algorithm can save 84% of resources compared with allocating independent VMs.

Speaker: Jun Yan, Microsoft Research Asia

人工智能的学术研究与工业应用正引起越来越广泛的关注。计算机就像人一样，既需要聪明的大脑，又需要博学的知识，才能为人类提供真正智能的服务。本次报告将重点关注如何通过数据挖掘与自然语言处理技术让计算机掌握知识并在应用场景中使用知识，以此来解决实际应用中的问题。我们将从知识的定义，抽取方法，表示方法开始，进而简单介绍一些知识推理与语义计算的基本想法，最后扩展到如何通过知识计算赋予计算机一定的联想与创造能力。

Speaker: Yinghe Chen and Yongxia Shi, Beijing Normal University

本研究拟考察4-7岁儿童的亲子言语互动模式的发展特点及其影响因素。我们邀请了50个来自全国不同地区的家庭参加亲子言语互动的音频录制，谈话涉及生日派对、郊游、上学、看动画片等八个主题；此外还对父母教养方式和儿童行为特点进行了在线测评。届时将通过编码分析及互动音频和影像的现场展示，展示儿童的年龄、性别、家庭背景等对交流的主题侧重点、词语情感属性等方面的影响，并探讨父母教养方式、儿童行为特点对亲子言语互动模式的影响。希望基于我们的前期研究能为后期的数字化工作提供素材以及提示后期给予不同年龄段儿童在亲子在互动主题等方面差异的关注，并基于父母教养方式和儿童行为特点等考虑数字产品的分类性和多元性。

Speaker: Yong Hu, Beihang University

Study the problem of semantic object segmentation for RGBD cluttered scenes with a fusion deep CNN framework, which is composed of two different neural networks which is extended and fused from RGB to RGBD.The first neural network solves the problem of category-level semantic segmentation. The second region proposal network (RPN) solves the problem of object-level semantic segmentation.

Speaker: Yang Yu, Nanjing University

Many machine learning tasks involve non-convex optimization problems, which can be non-differentiable, non-smooth and have many local optima. Derivative-free methods are suitable for these difficult problems, but were weak in theoretical foundation and practical scalability. This talk will introduce our recent progress in making theoretical-grounded derivative-free methods towards practical size non-convex machine learning tasks.

Speaker: Zhiwen Yu, Northwestern Polytechnical University

Discovering the underlying motivation and regularity of job mobility is one of the primary goals in human resource research area. Traditional researches mainly rely on limited surveys to gain the resumes and personal data to drive their investigation, which makes it difficult while expanding the scale and time scope. Recently, the universalized Internet, especially online social and professional networks, have made that information digitized and publicly available. Online professional networks (OPNs) like LinkedIn maintain huge resume warehouses which are dynamically spanning career records from hundreds of industries and companies. Meanwhile, location-based social networks (LBSNs) like Foursquare traces the human trajectories from all over the world, carrying rich sentiment information about human daily activities including geographical, textual and social interaction data. The growing clues carried on heterogeneous social media (like OPNs and LBSNs) provide unprecedented opportunities to achieve spatial-temporal job mobility prediction in a meticulous way.

In this talk, I’ll give an introduction of our research work on predicting spatial-temporal job mobility. Specifically, I will introduce the designed job change prediction framework for predicting the job change occasion. Besides, I’ll talk about the proposed talent circle detection method for mining group level job transition patterns.

Speaker: Wenjun Zeng, Microsoft Research Asia

Recently computer vision and deep learning technologies have been significantly leveraged to turn raw video data into insights to facilitate various applications and services. Since human is the main subject in many videos, understanding human becomes a critical step in video understanding. In this talk, I will introduce Microsoft Research Asia’s recent efforts on skeleton-based human action recognition.

Speaker: Jidong Zhai, Tsinghua University

Speaker: Lan Zhang, University of Science and Technology of China

Recently, we have witnessed the rapid growth of personal data, which contains huge amounts of valuable information and also a lot of privacy. Zhang鈥檚 talk focuses on deep understanding and privacy protection of multi-source multi-modality personal data collected by mobile devices.

Speaker: Lintao Zhang, Microsoft Research Asia

Performance of in-memory key-value store (KVS) continues to be of great importance as modern KVS goes beyond the traditional object-caching workload and becomes a key infrastructure to support distributed main-memory computation in data centers. In this talk, I’ll introduce KV-Direct, a high performance KVS that leverages programmable NIC to extend RDMA primitives and enable remote direct key-value access to the main host memory. A single NIC KV-Direct is able to achieve up to 180 M key-value operations per second, equivalent to the throughput of tens of CPU cores. Compared with CPU based KVS implementation, KV-Direct improves power efficiency by 10∼20x, while keeping tail latency below 10 µs. Moreover, KV-Direct can achieve near linear scalability with multiple NICs. With 8 programmable NIC cards in a server we achieve over one billion KV operations per second per server node, which is almost an order-of-magnitude improvement over existing systems, setting a new milestone for a general distributed key-value store.

Speaker: Liqing Zhang, Shanghai Jiao Tong University

Tensor is a generalized data representation of vectors and matrices to higher dimensions based on multilinear algebra. It enables one to effectively capture the multilinear structures of the data, which is usually available as a priori information about the data. We present a generative model for robust tensor factorization in the presence of both missing data and outliers. The objective is to explicitly infer the underlying low-CP-rank tensor capturing the global information and a sparse tensor capturing the outliers, thus providing the robust predictive distribution over missing entries. To identify the model, we develop an efficient variational inference under a fully Bayesian treatment, which can effectively prevent the overfitting problem and scales linearly with data size. The extensive experiments and comparisons with many state-of-the-art algorithms on both synthetic and real-world datasets demonstrate the superiorities of our method from several perspectives.

Speaker: Xin Zhang, South China University of Technology

Hand gesture recognition in videos has been an important research topic due to its potential wide applications in human-computer interaction. One of the challenges is to design and extract discriminative features to represent spatial position and temporal dynamics. Path signature is a compact representation of any open loop trajectory, and it has been successfully applied to financial trend prediction, voice signal compression and hand-written character recognition, etc. Here, we further introduce path signature feature to encode trajectory information of hand gestures and incorporate it into the deep learning framework. Experiments on various dataset demonstrate better performance in comparison with state-of-art methods.

Speaker: Zhaoxiang Zhang, Institute of Automation, Chinese Academy of Sciences

Comparing to the biological neural networks, current deep neural networks show significant limitations on over-simplified neuron models, rigid structures, as well as poor adaptabilities. Bio-inspired neural networks, trying to combine the academic advantages of the biological evidences and machine Learning, show a promising prospect of further improvements on effectiveness, robustness and autonomy in various vision applications.

Recently, we carried out the preliminary exploration on combining biological evidences and deep neural networks. In this talk, I will introduce some achievements on this topic, includes “diverse neuron type selection for convolutional neural networks”; dynamic multi-task learning” and “random-shifting for effective receptive field selection”.

From the improvements achieved by these investigations, we try to demonstrate the significant potential of developing visual computing models and methods by seeking inspirations from the human visual system.

Speaker: Zhihua Zhang, Peking University

许多机器学习问题都可以被定义为一个优化问题，因此求解大规模的优化问题是机器学习一个非常具有挑战的方向。这个报告讨论一类近似二阶算法，它包含子采样牛顿方法，概略牛顿方法以及非精确牛顿方法等。这类方法具有标准牛顿方法的超线性收敛率，但同时它的计算复杂度则相对比较低。

Speaker: Zhou Zhao, Zhejiang University

Question Answering is a challenging task in natural language processing, computer vision and machine learning, which provides the accurate answer to the reference contents according to the given question. In this talk, I will briefly introduce the recent progress in textual question answering, visual question answering and dialogue learning.

Speaker: Yu Zheng, Microsoft Research Asia

In this talk I will introduce an urban big data platform that can empower people to manage and mine knowledge from big data using AI technology. Two recent examples will be presented. One is about using Mobike’s trajectory data to plan bike lanes in a city more effectively. The other is to predict the flow of crowds based on deep learning techniques.

Speaker: Jun Zhu, Tsinghua University

Deep generative models are effective for unsupervised and semi-supervised learning. In this talk, I will present some recent work on max-margin learning of deep generative models and triple generative adversarial networks, which provide a game-theoretical framework for semi-supervised learning, with state-of-the-art performance.

Speaker: Fuzhen Zhuang, Institute of Computing Technology, Chinese Academy of Sciences

Recommendation has provoked vast amount of attention and research in recent decades. Most previous works employ matrix factorization techniques to learn the latent factors of users and items. However, matrix factorization methods may not make full use of the limited information from rating or check-in matrices, and achieve unsatisfying results. Recently, deep learning has proven able to learn good representation, and thus we try to exploit deep learning for recommendation. Along this line, we propose a new representation learning framework called Recommendation via Dual-Autoencoder (ReDa). In which we simultaneously learn the new hidden representations of users and items using autoencoders, and minimize the deviations of training data by the learnt representations of users and items. Furthermore, we propose a collaborative ranking framework via REpresentAtion learning with Pair-wise constraints (REAP for short), in which autoencoder is used to simultaneously learn the latent factors of both users and items and pair-wise ranked loss defined by (user, item) pairs is considered. Extensive experiments demonstrate the effectiveness of the proposed models.

Microsoft Research Asia Symposium on Collaborative Research

Cross-lingual Word Pepresentation Learning Based on Distribution

GraphView and Azure Cosmos DB Graph

Multisensor based Interaction and Learning for Human-Robot Hybrid Systems

The Deep Learning Based Speech Separation and Recognition

Building Informational Bot (InfoBot) with Question Generation & Answering

Rethink SDN: abstraction, overhead and development

Deep Recommendation with Collaborative Filtering and Content Filtering

Advances in Microsoft’s Handwriting OCR Technology

Dense Video Captioning

Cognitive Principles for Audio Emotion Perception and Speech Emotion Recognition

From QA to Problem Solving

Action Recognition and Summarization for Videos

The Shape Interaction Matrix-Based Affine Invariant Mismatch Removal for Partial-Duplicate Image Search

Towards User-Friendly Mobile Web Computing

Emotion Recognition Using EEG and Eye Tracking Data

Deep Learning for Visual Analysis

Enhancing Customer decisions: A Choice Modeling Perspective

Object detection and retrieval in surveillance video

Deep learning: challenges and opportunities

A World of Difference: Divergent Word Interpretations among People

Matching Speech in the Phonetic Space - Application to Voice Conversion and X-lingual TTS

Towards Big Graph Search: Challenges and Techniques

Bridging Vision and Language with Deep Learning

Microsoft Graph Engine and Its Applications

FP-DNN: An Automated Framework for Mapping Deep Neural Networks onto FPGAs with RTL-HLS Hybrid Templates

Deep Learning Approach for Model Learning in Image Processing and Analysis

Online Job Dispatching and Scheduling in Edge-Clouds

MSRA图形研究新进展

AMiner: Mining Scientific Networks with AI

Big Graph Computation=Trinity+SQL

Real-time multi-spectral camera using FPGA

Realtime detection and tracking on embedded system

Machine Learning for Healthcare: Lung Cancer Detection via Deep Neural Networks

Paperbook: Design and Implementation

A universal low-latency real time optical flow based stereoscopic panoramic video communication system for AR/VR

Extensible Wide Area Tracking System

SmartAdp: Finding Optimal Billboard Locations from Large-Scale Taxi Trajectories

Realtime detection and tracking on embedded system

Inference on Syntactic and Semantic Structures for Machine Comprehension

Cloud-Scale Multitenant IPsec Gateway

知识挖掘及智能应用

4-7岁儿童亲子语言互动模式的发展特点及其影响因素

A RGBD Fusion Deep Neural Network Framework for Object Segmentation

Derivative-free Algorithms for Non-Convex Machine Learning

Predicting Spatial-temporal Job Mobility Leveraging Heterogeneous Social Media Data

High Performance Human Action Recognition from Pose

A Lightweight Performance Anomaly Online Detection Tool

Understanding and Protecting Big Personal Data

KV-Direct: High-Performance In-Memory Key-Value Store with Programmable NIC

Bayesian Tensor Decomposition for multiway data completion

Hand gesture recognition with path signature feature

Brain Inspired Convolutional Neural Network

大规模优化问题的近似牛顿方法

Intelligent Question Answering

Urban Computing: Enabling Intelligent Cities with Big Data and AI Technology

Learning with Deep Generative Models

The Research on Recommender System based on Deep Learning