Microsoft Academic increases power of semantic search by adding more fields of study

February 15, 2018

Share this page

In the video below, our colleague Darrin tells a personal story of unleashing the power of semantic search:

Let’s look at a few more examples of powerful search experiences that are unique to Microsoft Academic.

Imagine you met a scholar at a conference, but can’t quite remember his name, except that it was short and started with K. But, you do remember that the person’s area of work was technology acceptance, and that he worked at a university in Colorado. So, you type whatever you can remember in Microsoft Academic:

search box screenshot

As you type, you notice that University of Colorado Boulder appears in query suggestions. This means it is one of the top ranked entities containing the letters “col” in the “technology acceptance model” area. You click that query suggestion and see the following search results:

Search results for query

The search engine results page shows you all the papers about “technology acceptance model” authored by individuals affiliated with the University of Colorado Boulder. You recognize the name Kai Larsen among the filters on the left side and realize this is the name of the person you were looking for.

Or, assume you are a graduate student and, in one of your HCI (opens in new tab) courses, you heard the phrase “seven stages of action.” As you type the words into Microsoft Academic, you notice that the phrase is recognized as a research topic, because the beaker icon appears next to it in the query suggestion drop-down. You accept the suggestion by clicking it and find papers that refer specifically to Don Norman’s seven stages of action.

search results for 7 stages of action

To learn more about the topic, you click the topic’s card on the right-hand rail. On the topic detail page, you see a list of related topics that help you understand other concepts relevant to Norman’s seven stages of action.

7 stages of action topic detail page

So, how does Microsoft Academic make it possible to discover knowledge in such a powerful way? Three aspects contribute to the power of our semantic search:

1. Author entity disambiguation, addressed in a previous post (opens in new tab);
2. The recent increase in number of fields of study in our graph, and
3. The accuracy of tagging fields of study onto papers — both explained below.

Field of study increase

During the past few weeks, we have increased the number of topics, or fields of study (FoS), in our graph from about 50K to almost 200K. We leveraged Wikipedia content and used graph link analysis to expand the coverage of FoS. We started with a few thousand high quality seed FoS and iterated a few rounds between graph link analysis (opens in new tab) and entity filtering to help us identify more FoS. Then, we scanned through our 170 million academic publications’ meta information, such as the title, keywords, abstracts, to confirm the existence of the new FoS.

The table below shows a comparison of before and after numbers for each one of our 19 top-level topics.

Top level field of study	Before	After	Difference
Biology	4173	70019	1578%
Medicine	1675	27022	1513%
Geology	2120	15117	613%
Chemistry	3522	24333	591%
Psychology	2430	13291	447%
Philosophy	1665	9066	445%
Sociology	2047	9623	370%
Engineering	2689	12100	350%
Economics	2347	10439	345%
Computer Science	5180	21328	312%
Art	422	1703	304%
Physics	6618	24075	264%
History	656	2245	242%
Political Science	250	667	167%
Materials Science	945	2404	154%
Mathematics	8022	19540	144%
Geography	502	929	85%
Business	536	917	71%
Environmental Science	178	262	47%

Tagging fields of study on papers

Once the fields of study were extracted from papers, the next step was to stamp them appropriately onto the 170+ million papers in the Microsoft Academic Graph, the largest knowledge graph of scholarly publications in existence.

The machine applied the almost 200K fields of study onto papers in our graph and tagged them with the appropriate topics, according to its understanding. This was done with minimal human intervention. We began the tagging process by first taking into consideration metadata associated with each publication. However, publication metadata is neither complete nor accurate. Common pitfalls include but are not limited to:

Most papers about a topic like “artificial intelligence” do not actually mention these words explicitly in the paper (incomplete);
A large number of raw keywords from various data sources are noisy and irrelevant to the paper (inaccuracy, e.g. some websites assigned same sets of keywords to all papers published on it);
The same words refer to different concepts in different disciplines (ambiguity, e.g., “entropy”).

We applied several state-of-the-art natural language processing techniques to tackle these challenges. For example, we extended convolutional neural networks (opens in new tab) for short text classification and made it highly scalable for our 140M English papers, such that high-level disciplines such as: computer science, mathematics, artificial intelligence, etc., would be properly tagged. We also pre-trained word embedding (opens in new tab) vectors with text from more than 80M abstracts., used together with bag-of-words (opens in new tab) for text similarity calculation, this helped to eliminate noisy tagging effectively.

The results exhibit high accuracy, according to our observations. Take, for example, the paper below, which, according to its abstract, “re-examines the concept of ‘meme’ in the context of digital culture.” Our machines have appropriately tagged it with fields of study that include “user-generated content,” and even “Internet meme.”

Screenshot of paper detail page showing title Memes in a Digital World: Reconciling with a Conceptual Troublemaker and abstract
As a result of tagging papers with so many fields of study, you can now explore and locate scholarship a lot easier, in ways that are unique to Microsoft Academic.

For example, “microblogging” is now a field of study, which has been stamped onto 6977 publications as of the time of this writing.

topic detail page for microblogging

You can explore all publications tagged with “microblogging” by clicking the “See all publications” link on the topic’s detail page (shown above) and then sorting them and applying filters. Because Microsoft Academic is semantic search, when you explore papers tagged “microblogging,” you will find papers about this topic that might not even include the word microblogging but refer, for example, to Twitter.

Researchers in biology and medicine will notice that the fields of study in that domain area can be as specific as the names of various genes, as illustrated in the screenshot below:

search box showing query suggestions for several interleukin genes

We hope that our recent efforts to increase the number of fields of study and tag them onto papers help you explore and discover knowledge in more powerful ways than ever before.

How do you unleash the power of semantic search? As always, we would like to hear from you either through the feedback link at the bottom right of the website, or on Twitter. You can also find our project home page with this blog on the Microsoft Research site at aka.ms/msracad (opens in new tab).

Happy researching!