Discussion Graph Tool

Established: April 25, 2014

Common things you’ll want to do

DGT can load social media data in delimeter-separated TSV and CSV files, line-based JSON format (including the output of common twitter downloaders) and multi-line record formats.

TSV and CSV data

To load a TSV or CSV, use the following LOAD command. The path to a file is required. Also, either the hasHeader flag must be set to true (indicating the first row of the file is a header line) or the schema argument must be set.

LOAD TSV(path:"filename.txt",

         fieldSeparator:",", // optional: default is tab character

         ignoreErrors:"true", // optional: default is false

         hasHeader:"false", // optional: default is false

         schema:"col1,col2,..." // either hasHeader:"true" or a schema is required

);

Multi-line record data

A multi line record formatted file includes a single record field per-line, with a blank line separating records. For example:

name: Bob

text: hello world!

messagetime:5/4/2013

name: Alice

text: hello back!

messagetime:5/5/2013

…

To load a multiline record, use the following LOAD command. Only the path argument is required. The schema is implicit in the file itself.

LOAD Multiline(path:"filename.txt",

               fieldSeparator:":", // optional: default is : character

               ignoreErrors:"true" // optional: default is false

);

JSON file

DGT can read JSON line formatted files (where each line of a text file is a JSON object).

LOAD JSON(path:"filename.txt",

          ignoreErrors:"true",

          schema:"field1:jsonpath1,field2:jsonpath2,...");

The schema must specify both the fields to be extracted as well as their JSON paths. If multiple values in the JSON object match a given path, the field will be assigned multiple values.

Twitter data

DGT also includes a pre-defined data source for Twitter output of the twitter-tools utilities. (This is a JSON-line formatted file) To load the output of the twitter-tools utilities, use the following LOAD command. Only the path argument is required.

LOAD Twitter(path:"filename.txt");

This data source includes schema definitions for most of the common Twitter fields:

Field	JSON path
contextid	id_str
createdat	c
text	text
inreplytostatusid	in_reply_to_status_id
inreplytoscreenname	in_reply_to_screen_name
userid	user/id_str
username	user//name
userscreenname	user/screen_name
userlocation	user/location
lang	lang
userdescription	user/description
userfollowerscount	user/followers_count
userfriendscount	user/friends_count
userlistedcount	user/listed_count
usercreatedat	user/created_at
userfavouritescount	user/favourites_count
userutcoffset	userhttps://www.microsoft.com/en-us/research/?post_type=msr-project&p=171140tc_offset
usertimezone	user/time_zone
userverified	user/verified
userstatusescount	user/statuses_count
retweetcreatedat	retweeted-status/created_at
retweetid	retweeted_status/id_str
retweettext	retweeted_status/text
retweetuser	retweeted_status/user/id_str
retweetusername	retweeted_status/user//name
retweetuserscreenname	retweeted_status/user/screen_name
hashtag	entities/symbols/text
symbol	entities/symbols/text
url	entities/urls/url
urlexpanded	entities/urls/expanded_url
mentionuserid	entities/user_mentions/id_str
mentionusername	entities/user_mentions//name
mentionuserscreenname	entities/user_mentions/screenname
geopoint	geo/coordinates/$$

Sometimes a specific social media message is simply irrelevant to a specific analysis. For example, in a study about hashtag usage on Twitter, we might want to ignore messages that do not have hashtags. To do this, we can use the PRIMARY keyword of the EXTRACT command.
```
EXTRACT PRIMARY hashtag, PRIMARY mention, AffectDetector();
```
In this example, we have marked the hashtag and mention fields as PRIMARY fields (any field or feature extractor may be marked as PRIMARY). This PRIMARY flag tells the EXTRACT command that it must find either a hashtag or a mention value in a message in order to continue processing it. If a message has either a hashtag or a mention, EXTRACT will also run the AffectDetector() and pass the values along to the rest of the script. If a message does not have any hashtag value and does not have any mention value, then that message will be ignored.

The PRIMARY flag can be combined with the acceptfilter and rejectfilter arguments accepted by most feature extractors. If you want to only analyze social media messages by women, for example, you can use the acceptfilter argument to achieve this:
```
EXTRACT PRIMARY Gender(accept:”f”), hashtag, mention;
```
The Gender feature extractor understands the acceptfilter argument, and will only output feature values that match the list. The result in this case is that only messages where the author’s gender is identifiably female will be processed. (note that the hashtag and mention fields are no longer marked as PRIMARY fields).

If you have a long list of values you want to accept, you can use the acceptfilterfilename argument. The syntax and behavior for the acceptfilter and acceptfilterfilename is the same as for the Phrases() feature extractor.
Opens in a new tab
You can extend the Phrases() feature extractor with a different set of arguments to detect different phrases. Here is an adaptation of our “politician detector” from our simple example, but this time modified to detect super heroes. By default, phrase detection is case-insensitive.
```
EXTRACT ExactPhrases(domain:”parent”,accept:”dad,mom,father,mother”);
```
If you have a long list of phrases you want to detect, you can put them in a file and reference them in your processor. Note that you also have to specify the datafile as a resource, so that the framework knows to include that file as part of the job.
```
EXTRACT ExactPhrases(domain:”parent”,acceptfile:”parentphrases.txt”);
```
In its simple form, this file is simply a list of phrases to detect. You can also use this file to group or canonicalize detected phrases by adding a 2^nd tab-separated column that includes the canonical form. For example, if you used the following file, it would detect the nicknames for parents and map them to their canonical name. That is, whenever the phrase extractor finds “mommy” or “mom” the extracted feature will be emitted as “mother”.

mom mother

mother mother

mommy mother

dad father

…

Opens in a new tab
Often, you will want to perform further higher-level analyses (machine learning analyses, visualizations and/or statistical analyses) on the output of DGT. To do so, we provide utilities to convert from DGTs native output format to TSV and GEXF files that will let you load the data in R, Excel, Gephi and other tools.

To convert to TSV, use the dgt2tsv.exe command:
```
dgt2tsv.exe input.graph [outputfields] outputfilename.tsv
```
The list of outputfields may include “count”, any of the domains output by a feature extractor, a domain name followed by “.count”, or a domain name followed by a specific feature value.

For example, the following command will output a count of the number of messages seen for each edge in a discussion graph; the gender of the author; and the weight of the “fatigue” value in the Mood domain.
```
dgt2tsv.exe input.graph count,gender,Mood.fatigue output.tsv
```
To output a .gexf file that can be read by Gephi for graph analyses and visualizations, use the dgt2gexf.exe command:
```
dgt2gexf.exe input.graph [outputfields] outputfilename.gexf
Opens in a new tab				
```

mom	mother
mother	mother
mommy	mother
dad	father
…

Discussion Graph Tool

Common things you’ll want to do

Load data in different formats

TSV and CSV data

Multi-line record data

JSON file

Twitter data

Filter out irrelevant messages

Detect phrases and words in tweet text

Import results into R, Excel, Gephi or other tools