Open Academic Graph

Open Academic Graph (OAG) is a large knowledge graph unifying two billion-scale academic graphs: Microsoft Academic Graph (opens in new tab) (MAG) and AMiner (opens in new tab). In mid 2017, we published OAG v1, which contains 166,192,182 papers from MAG and 154,771,162 papers from AMiner (see below) and generated 64,639,608 linking (matching) relations between the two graphs. This time, in OAG v2, author, venue and newer publication data and the corresponding matchings are available.

Overview of OAG v2

The statistics of OAG v2 is listed as the three tables below. The two large graphs are both evolving and we take MAG November 2018 snapshot and AMiner July 2018 or January 2019 snapshot for this version.

Table 1: Statistics of OAG venue data

Data Set
#Pairs/Venues
Date
Linking relations
29,841
2018.12
AMiner venues
69,397
2018.07
MAG venues
52,678
2018.11

 

Table 2: Statistics of OAG paper data

Data Set
#Pairs/Papers
Date
Linking relations
91,137,597
2018.12
AMiner papers
172,209,563
2019.01
MAG papers
208,915,369
2018.11

 

Table 3: Statistics of OAG author data

Data Set
#Pairs/Authors
Date
Linking relations
1,717,680
2019.01
AMiner authors
113,171,945
2018.07
MAG authors
253,144,301
2018.11

 

Downloads

Please note that for author matching, we only consider authors whose paper count is not less than 5. After filtering those authors with small paper count, there are 6,855,193 authors in AMiner and 13,173,936 authors in MAG.

Data description

For linking relations, each pair is an “ID to ID” pair. More specifically, its JSON schema is:

{
“mid”: “xxxx”,
“aid”: “yyyy”
}

where “mid” is MAG entity ID and “aid” is AMiner entity ID.

Other entity attributes are also provided, which can be used to do different types of research. The data schemas of venues, papers and authors are described below.

Table 4: Venue schema

Field Name
Field Type
Description
Example
id
string
venue id
5bf574641c5a1dcdd96f817b
JournalId
string
journal id
137773608
ConferenceId
string
conference id
DisplayName
string
venue name
Nature
NormalizedName
string
normalized venue name
nature

 

Table 5: Paper schema

Field Name
Field Type
Description
Example
id
string
paper ID
53e9ab9eb7602d970354a97e
title
string
paper title
Data mining: concepts and techniques
authors.name
string
author name
Jiawei Han
author.org
string
author affiliation
Department of Computer Science, University of Illinois at Urbana-Champaign
author.id
string
author ID
53f42f36dabfaedce54dcd0c
venue.id
string
paper venue ID
53e17f5b20f7dfbc07e8ac6e
venue.raw
string
paper venue name
Inteligencia Artificial, Revista Iberoamericana de Inteligencia Artificial
year
int
published year
2000
keywords
list of strings
keywords
[“data mining”, “structured data”, “world wide web”, “social network”, “relational data”]
n_citation
int
citation number
40829
page_start
string
page start
11
page_end
string
page end
18
doc_type
string
paper type: journal, book title…
book
lang
string
detected language
en
publisher
string
publisher
Elsevier
volume
string
volume
10
issue
string
issue
29
issn
string
issn
0020-7136
isbn
string
isbn
1-55860-489-8
doi
string
doi
10.4114/ia.v10i29.873
pdf
string
pdf URL
//static.aminer.org/upload/pdf/1254/ 370/239/53e9ab9eb7602d970354a97e.pdf
url
list
external links
[“http://dx.doi.org/10.4114/ia.v10i29.873”, “http://polar.lsi.uned.es/revista/index.php/ia/ article/view/479”]
abstract
string
abstract
Our ability to generate…

(Note: MAG paper attributes are incomplete. If you would like more attributes, please refer to this link (opens in new tab) to get MAG on Azure.)

Table 6: Author schema

Field Name
Field Type
Description
Example
id
string
author id
53f42f36dabfaedce54dcd0c
name
string
author name
Jiawei Han
normalized_name
string
normalized author name
jiawei han
orgs
list of strings
author affiliations
[“Department of Computer Science, University of Illinois at Urbana-Champaign”]
org
string
last known affiliation
Department of Computer Science, University of Illinois at Urbana-Champaign
position
string
author position
professor
n_pubs
int
the number of author publications
1217
n_citation
int
author citation count
191526
h_index
int
author h-index
175
tags.t
string
research interests
“data mining”
tags.w
int
weight of interests
243
pubs.i
string
author paper id
53e9b9fbb7602d97045f7bb8
pubs.r
int
author order in the paper
0

Evaluation

We evaluated a small subset of matchings (around one thousand venue/paper/author pairs). The estimated accuracy is shown in Table 7.

Table 7: Accuracy of entity matching

Entity type
Venue
Paper (newly matched)
Author
accuracy
99.26%
99.10%
97.41%

 

Please continue to use the OAG group (opens in new tab) to discuss OAG related problems. If you have any issue or want to contribute to Open Academic Graph, feel free to share your ideas and thoughts at OAG Group.

Overview of OAG v1

This data set is generated by linking two large academic graphs: Microsoft Academic Graph (opens in new tab) (MAG) and AMiner (opens in new tab), and it is used for research purpose only. This version includes 166,192,182 papers from MAG and 154,771,162 papers from AMiner. We generated 64,639,608 linking (matching) relations between the two graphs. In the future, more linking results, like authors, will be published. It can be used as a unified large academic graph for studying citation network, paper content, and others, and can be also used to study integration of multiple academic graphs.

The overall data set includes three parts, which are described in the table below:

 Data Set
 #Paper
 #File
Total Size
 Date
64,639,608
 1
1.6GB
2017-06-22
MAG papers
166,192,182
9
104GB
2017-06-09
AMiner papers
 154,771,162
3
39GB
2017-03-22

 

Downloads

AMiner Papers:

MAG Papers:

Data Description

The detailed description of data is presented in this section.
For Linking relations, each linking pair is an “ID to ID” pair. More specifically, its JSON schema is:

{
"mid": "xxxx",
"aid": "yyyy"
}

where “mid” is MAG paper ID and “aid” is AMiner paper ID.

For data set MAG papers and AMiner papers, each paper is a JSON object. Its data schema is:

Field Name
Field Type
Description
Example
id
string
MAG or AMiner ID
53e9ab9eb7602d970354a97e
title
string
paper title
Data mining: concepts and techniques
authors.name
string
author name
Jiawei Han
author.org
string
author affiliation
department of computer science university of illinois at urbana champaign
venue
string
paper venue
Inteligencia Artificial, Revista Iberoamericana de Inteligencia Artificial
year
int
published year
2000
keywords
list of strings
keywords
[“data mining”, “structured data”, “world wide web”, “social network”, “relational data”]
fos
list of strings
fields of study
[“relational database”, “data model”, “social network”]
n_citation
int
number of citation
29790
references
list of strings
citing papers’ ID
[“53e99ef4b7602d97027c2346”, “53e9aa23b7602d970338fb5e”, “53e99cf5b7602d97025aac75”]
page_stat
string
start of page
11
page_end
string
end of page
18
doc_type
string
paper type: journal, book title…
book
lang
string
detected language
en
publisher
string
publisher
Elsevier
volume
string
volume
10
issue
string
issue
29
issn
string
issn
0020-7136
isbn
string
isbn
1-55860-489-8
doi
string
doi
10.4114/ia.v10i29.873
pdf
string
pdf URL
//static.aminer.org/upload/pdf/1254/ 370/239/53e9ab9eb7602d970354a97e.pdf
url
list
external links
[“http://dx.doi.org/10.4114/ia.v10i29.873”, “http://polar.lsi.uned.es/revista/index.php/ia/ article/view/479”]
abstract
string
abstract
Our ability to generate…

For example:

{
"id": "53e9ab9eb7602d970354a97e",
"title": "Data mining: concepts and techniques",
"authors": [
{
"name": "jiawei han",
"org": "department of computer science university of illinois at urbana champaign"
},
{
"name": "micheline kamber",
"org": "department of computer science university of illinois at urbana champaign"
},
{
"name": "jian pei",
"org": "department of computer science university of illinois at urbana champaign"
}
],
"year": 2000,
"keywords": [
"data mining",
"structured data",
"world wide web",
"social network",
"relational data"
],
"fos": [
"relational database",
"data model",
"social network"
],
"n_citation": 29790,
"references": [
"53e99ef4b7602d97027c2346",
"53e9aa23b7602d970338fb5e",
"53e99cf5b7602d97025aac75"
],
"doc_type": "book",
"lang": "en",
"publisher": "Elsevier",
"isbn": "1-55860-489-8",
"doi": "10.4114/ia.v10i29.873",
"pdf": "//static.aminer.org/upload/pdf/1254/370/239/53e9ab9eb7602d970354a97e.pdf",
"url": [
"http://dx.doi.org/10.4114/ia.v10i29.873",
"http://polar.lsi.uned.es/revista/index.php/ia/article/view/479"
],
"abstract": "Our ability to generate and collect data has been increasing rapidly. Not only are all of our business, scientific, and government transactions now computerized, but the widespread use of digital cameras, publication tools, and bar codes also generate data. On the collection side, scanned text and image platforms, satellite remote sensing systems, and the World Wide Web have flooded us with a tremendous amount of data. This explosive growth has generated an even more urgent need for new techniques and automated tools that can help us transform this data into useful information and knowledge. Like the first edition, voted the most popular data mining book by KD Nuggets readers, this book explores concepts and techniques for the discovery of patterns hidden in large data sets, focusing on issues relating to their feasibility, usefulness, effectiveness, and scalability. However, since the publication of the first edition, great progress has been made in the development of new data mining methods, systems, and applications. This new edition substantially enhances the first edition, and new chapters have been added to address recent developments on mining complex types of data? including stream data, sequence data, graph structured data, social network data, and multi-relational data."
}

Method and Evaluation

Method

We obtain linking relations of two publication graphs by two steps:

  1. Use Microsoft Graph Search API to query each AMiner paper’s title and obtain candidate matching papers for each AMiner paper.
  2. We match two papers if they have
    • very similar titles
    • similar author names and
    • same published year

Evaluation

We random sampled 100,000 linking pairs and evaluated the matching accuracy. The number of truly matching pairs is 99,699 and the matching accuracy can achieve 99.70%.

Reference

We kindly request that any published research that makes use of this data cites the following papers.