Discussion Graph Tool

Established: April 25, 2014

Common things you’ll want to do

  • DGT can load social media data in delimeter-separated TSV and CSV files, line-based JSON format (including the output of common twitter downloaders) and multi-line record formats.

    TSV and CSV data

    To load a TSV or CSV, use the following LOAD command. The path to a file is required. Also, either the hasHeader flag must be set to true (indicating the first row of the file is a header line) or the schema argument must be set.

    LOAD TSV(path:"filename.txt",
             fieldSeparator:",", // optional: default is tab character
             ignoreErrors:"true", // optional: default is false
             hasHeader:"false", // optional: default is false
             schema:"col1,col2,..." // either hasHeader:"true" or a schema is required
             );

    Multi-line record data

    A multi line record formatted file includes a single record field per-line, with a blank line separating records. For example:

    name: Bob

    text: hello world!

    messagetime:5/4/2013

    name: Alice

    text: hello back!

    messagetime:5/5/2013

    To load a multiline record, use the following LOAD command. Only the path argument is required. The schema is implicit in the file itself.

    LOAD Multiline(path:"filename.txt",
                   fieldSeparator:":", // optional: default is : character
                   ignoreErrors:"true" // optional: default is false
                   );

    JSON file

    DGT can read JSON line formatted files (where each line of a text file is a JSON object).

    LOAD JSON(path:"filename.txt",
              ignoreErrors:"true",
              schema:"field1:jsonpath1,field2:jsonpath2,...");

    The schema must specify both the fields to be extracted as well as their JSON paths. If multiple values in the JSON object match a given path, the field will be assigned multiple values.

    Twitter data

    DGT also includes a pre-defined data source for Twitter output of the twitter-tools utilities. (This is a JSON-line formatted file) To load the output of the twitter-tools utilities, use the following LOAD command. Only the path argument is required.

    LOAD Twitter(path:"filename.txt");

    This data source includes schema definitions for most of the common Twitter fields:

    Field JSON path
    contextid id_str
    createdat c
    text text
    inreplytostatusid in_reply_to_status_id
    inreplytoscreenname in_reply_to_screen_name
    userid user/id_str
    username user//name
    userscreenname user/screen_name
    userlocation user/location
    lang lang
    userdescription user/description
    userfollowerscount user/followers_count
    userfriendscount user/friends_count
    userlistedcount user/listed_count
    usercreatedat user/created_at
    userfavouritescount user/favourites_count
    userutcoffset userhttps://www.microsoft.com/en-us/research/?post_type=msr-project&p=171140tc_offset
    usertimezone user/time_zone
    userverified user/verified
    userstatusescount user/statuses_count
    retweetcreatedat retweeted-status/created_at
    retweetid retweeted_status/id_str
    retweettext retweeted_status/text
    retweetuser retweeted_status/user/id_str
    retweetusername retweeted_status/user//name
    retweetuserscreenname retweeted_status/user/screen_name
    hashtag entities/symbols/text
    symbol entities/symbols/text
    url entities/urls/url
    urlexpanded entities/urls/expanded_url
    mentionuserid entities/user_mentions/id_str
    mentionusername entities/user_mentions//name
    mentionuserscreenname entities/user_mentions/screenname
    geopoint geo/coordinates/$$
  • Sometimes a specific social media message is simply irrelevant to a specific analysis. For example, in a study about hashtag usage on Twitter, we might want to ignore messages that do not have hashtags. To do this, we can use the PRIMARY keyword of the EXTRACT command.

    EXTRACT PRIMARY hashtag, PRIMARY mention, AffectDetector();

    In this example, we have marked the hashtag and mention fields as PRIMARY fields (any field or feature extractor may be marked as PRIMARY). This PRIMARY flag tells the EXTRACT command that it must find either a hashtag or a mention value in a message in order to continue processing it. If a message has either a hashtag or a mention, EXTRACT will also run the AffectDetector() and pass the values along to the rest of the script. If a message does not have any hashtag value and does not have any mention value, then that message will be ignored.

    The PRIMARY flag can be combined with the acceptfilter and rejectfilter arguments accepted by most feature extractors. If you want to only analyze social media messages by women, for example, you can use the acceptfilter argument to achieve this:

    EXTRACT PRIMARY Gender(accept:”f”), hashtag, mention;

    The Gender feature extractor understands the acceptfilter argument, and will only output feature values that match the list. The result in this case is that only messages where the author’s gender is identifiably female will be processed. (note that the hashtag and mention fields are no longer marked as PRIMARY fields).

    If you have a long list of values you want to accept, you can use the acceptfilterfilename argument. The syntax and behavior for the acceptfilter and acceptfilterfilename is the same as for the Phrases() feature extractor.

  • You can extend the Phrases() feature extractor with a different set of arguments to detect different phrases. Here is an adaptation of our “politician detector” from our simple example, but this time modified to detect super heroes. By default, phrase detection is case-insensitive.

    EXTRACT ExactPhrases(domain:”parent”,accept:”dad,mom,father,mother”);

    If you have a long list of phrases you want to detect, you can put them in a file and reference them in your processor. Note that you also have to specify the datafile as a resource, so that the framework knows to include that file as part of the job.

    EXTRACT ExactPhrases(domain:”parent”,acceptfile:”parentphrases.txt”);

    In its simple form, this file is simply a list of phrases to detect. You can also use this file to group or canonicalize detected phrases by adding a 2nd tab-separated column that includes the canonical form. For example, if you used the following file, it would detect the nicknames for parents and map them to their canonical name. That is, whenever the phrase extractor finds “mommy” or “mom” the extracted feature will be emitted as “mother”.

    mom mother
    mother mother
    mommy mother
    dad father
  • Often, you will want to perform further higher-level analyses (machine learning analyses, visualizations and/or statistical analyses) on the output of DGT. To do so, we provide utilities to convert from DGTs native output format to TSV and GEXF files that will let you load the data in R, Excel, Gephi and other tools.

    To convert to TSV, use the dgt2tsv.exe command:

    dgt2tsv.exe input.graph [outputfields] outputfilename.tsv

    The list of outputfields may include “count”, any of the domains output by a feature extractor, a domain name followed by “.count”, or a domain name followed by a specific feature value.

    For example, the following command will output a count of the number of messages seen for each edge in a discussion graph; the gender of the author; and the weight of the “fatigue” value in the Mood domain.

    dgt2tsv.exe input.graph count,gender,Mood.fatigue output.tsv

    To output a .gexf file that can be read by Gephi for graph analyses and visualizations, use the dgt2gexf.exe command:

    dgt2gexf.exe input.graph [outputfields] outputfilename.gexf