Discussion Graph Tool

Established: April 25, 2014

Discussion Graph Tool Reference Guides

  • In the discussion graph tool framework, a co-occurrence analysis consists of the following key steps:

    Step

    Task DGT command

    1

    Reading from a social media data source. LOAD

    2

    Extracting low-level features from individual messages. EXTRACT

    3

    (optional)

    Declaring the feature that defines a co-occurrence. What defines the fact that two or more features have co-occurred?

    By default, two features are considered to co-occur if they both occur in the same social media message.

    RELATE BY
    Steps 2 and 3 implicitly define an initial discussion graph. All co-occurring feature values that were seen to co-occur in the raw social media data will be connected by hyper-edges to form a large, multi-dimensional hyper-graph. 

    4

    (optional)

    By default, each social media message is weighted equally.  We can change this so that the data is weighted by user, location, or other feature.  For example, we might want data from every user to count equally, regardless of how many social media messages each user sent.  This would prevent our analyses from being dominated by users who post too frequently. WEIGHT BY

    5

    We project the initial discussion graph to focus on those relationships we care about for our analysis. For this step, the task must specify the domains we care about. PROJECT

    6

    Output results OUTPUT

    7

    (optional)

    Often, we’ll want to further analyze our results with higher-level machine learning, network analyses, and visualization techniques. This is outside the scope of DGT.

    For more details on the core concepts behind discussion graphs, we recommend reading our ICWSM 2014 paper.

    A note on projecting weighted data

    Often, feature values are weighted. For example, the affect classifier produces a weighted feature value indicating how likely a message is to be expressing joviality, sadness, etc. (In other cases, the use of the WEIGHT BY command implicitly creates a weighted value).

    When it encounters a weighted feature value in its target domains, the PROJECT TO command treats the weights as probabilities of a feature value having occurred. For example, let’s continue our analysis of activity and location mentions such as in the following message:

    "I'm having fun hiking tiger mountain" tweeted by Alice on a Saturday at 10am

    Let’s say our mood analysis indicates that the message has joviality with a weight of “0.8”, serenity has a weight of “0.4” in this message, in addition to the other discrete features:

    Domain Feature Weighted value
    Mood Joviality 0.8
    Mood Serenity 0.4
    Activity hiking 1.0
    Location tiger mountain 1.0
    Author Alice 1.0

    The two weighted features are interpreted as independent probabilities. That is, there is an 80% likelihood of this message being jovial and a 20% likelihood of not being jovial. Independently, there is a 40% likelihood of the message being serene, and 60% chance of not being serene.

    If we project this single message to the relationship between location and mood (PROJECT TO Mood, Location;) this message will expand to the following 4 projected edges::

    Edge Weight Metadata
    Joviality and Tiger Mountain 0.48 hiking, Alice
    Serenity and Tiger Mountain 0.08 hiking, Alice
    Joviality and Serenity and Tiger Mountain 0.32 hiking, Alice
    (No mood) and Tiger Mountain 0.12 hiking, Alice

    Of course, when analyzing a larger corpus of social messages, each message will be expanded individually and the results aggregated.

  • The discussion graph tool’s scripting language currently supports the following commands.

    Note that square brackets [ ] indicate optional elements of the command. Italicizedterms indicate user-specified arguments, variable names, etc. of the command.

    LOAD

    Syntax: LOAD Datasource([arguments]);

    Example: LOAD MultiLine(path:”productreviews.txt”);

    The LOAD command loads social media data from some datasource. The required arguments are datasource-specific. Generally, datasources require a path to the input file as well as schema information to interpret the file. See the Common things you’ll want to do section below for examples of loading TSV, Multiline record, JSON and Twitter files.

    EXTRACT

    Syntax: EXTRACT [PRIMARY] field|FeatureExtractor([arguments]),… [FROM varname];

    Example: EXTRACT PRIMARY hashtag, Gender(), AffectDetector();

    The EXTRACT command runs a series of feature extractors against the raw social media messages loaded from a data source via the LOAD command.

    Extracting a field will pass through a field from the raw data unmodified.

    Extracting a feature using a FeatureExtractor() will run the specified feature extractor against the social media message. Feature extractors may generate 0, 1 or more feature values for each message they process, and the domain of the feature need not match the name of the feature extractor. For example, the AffectDetector() generates features in several domains (Subjective, Mood and PosNegAffect), and other feature extractors, such as Phrases() can generate features in custom domains.

    The PRIMARY flag acts as a kind of filter on the raw social media data. EXTRACT must find at least one PRIMARY field or feature in a message, otherwise the message will be ignored. If no fields or features are marked as PRIMARY, then EXTRACT will not filter messages.

    FROM varname tells the EXTRACT command where to get its input data. If not specified, EXTRACT will read from the output of the previous command.

    WEIGHT BY

    Syntax: WEIGHT BY featureDomain[, …] [FROM varname];

    Example: WEIGHT BY userid;

    The WEIGHT BY command reweights the data from social media messages. By default, every social media message counts as a single observation.  If we see a co-occurrence relationship occurring in 2 social media messages, then the co-occurrence relationship will have a weight of 2.  We can change this using the WEIGHT BY command so that every unique user (or location or other feature value) counts as a single observation.  So, for example, if a co-occurrence relationship is expressed by 2 unique users, then it will have a weight of 2.  Conversely, if a single user expresses 2 distinct co-occurrence relationships, each relationship will have a weight of only 0.5.

    Note that we can WEIGHT BY one feature but RELATE BY another feature.

    RELATE BY

    Syntax: RELATE BY featureDomain [FROM varname];

    Example: RELATE BY userid;

    The RELATE BY command declares the domain that defines a co-occurrence relationship. All features that co-occur with the same feature value in this domain are considered to have co-occurred.

    FROM varname tells the RELATE BY command where to get its input data. If not specified, RELATE BY will read from the output of the previous command.

    Note that we can WEIGHT BY one feature but RELATE BY another feature.

    PROJECT

    Syntax: PROJECT TO [featureDomain, …] [FROM varname];

    Variants: PLANAR PROJECT TO [featureDomain, …] [FROM varname];

    Variant: PLANAR BIPARTITE PROJECT TO [featureDomain, …] [FROM varname];

    Example: PROJECT TO hashtag;

    The PROJECT TO command will project an initial hyper-graph to focus on only relationships among the specified feature domains. That is, only edges which connect 1 or more nodes in the specified domains will be kept, and any nodes in other feature domains will be removed from the structure of the graph. By default, the PROJECT TO command generates a hyper-graph. This means that nodes that do not co-occur with other nodes will still be described by a degenerate 1-edge. Also, if many nodes simultaneously co-occur together, their relationship will be described by a k-edge (where k == the number of co-occurring nodes)

    Often, especially for ease of visualization, it is useful to restrict the discussion graph to be a planar graph (where every edge in the graph connects exactly 2 nodes). The PLANAR PROJECT TO command achieves this. All hyper-edges will be decomposed and re-aggregated into their corresponding 2-edges.

    Furthermore, it can be useful to restrict the graph to be bipartite, where only edges that cross domains are kept. For example, we may only care about the relationship between users and the hashtags they use, and not care about the relationship among hashtags themselves. The PLANAR BIPARTITE PROJECT TO command achieves this. Semantically, this is the equivalent of doing a planar projection and then dropping all edges that connect nodes are in the same domain.

    MERGE

    Syntax: MERGE varname1,varname2[,…];

    Example: MERGE MentionAndUserGraph,HashTagAndUserGraph;

    The MERGE command overlays two discussion graphs atop each other. Nodes with the same feature domain and values will be merged.

    OUTPUT

    Syntax: OUTPUT TO “filename.graph” [FROM varname];

    Example: OUTPUT TO “mentions.graph”;

    The OUTPUT TO command saves a discussion graph to the specified file.

    File’s are saved in DGT’s native format. This format consists of 3 tab-separated columns. The first column is the edge identifier: the comma-separated list of nodes connected by this edge. The second column is the count of the number of times this co-occurrence relationship was observed to occur. The third column is a JSON-formatted representation of the context of the relationship or, in other words, the distribution of feature values conditioned on the co-occurrence relationship.

    Naming variables

    We can assign the result of commands to variables, and use these variables in later commands:

    Syntax:

    var x = COMMAND1;
    COMMAND2 FROM x;

    Example:

    var reviewData = LOAD Multiline(path:”finefoods.tar.gz”);
    var reviewFeatures = EXTRACT AffectDetector(),reviewscore FROM reviewData;
  • Here’s a current list of feature extractors included in the discussion graph tool release.

    Feature Extractor Arguments Output Domain
    AffectDetector()

    Infers mood from text

    field: input field to analyze (default=’text’) Mood: weights for 7 moods (joviality, sadness, guilt, fatigue, hostility, serenity, fear)

    PosNeg: aggregation of positive/negative affects

    Gender()

    Infers gender from user names

    field: input field to analyze (default=’username’)

    discrete: whether to output discrete or weighted gender values (default=’true’)

    gender: m=male, f=female, u=unknown
    GeoPoint()

    explicit lat-lon coordinates

    field: input field to analyze (default=’geopoint’)

    rounding: number of decimal places to include

    geopoint: lat-lon value
    GeoShapeMapping()

    Maps lat-lon points to feature values via a user-specified GeoJSON formatted shapefile

    field: input field to analyze (default=’geopoint’).  this field should contain both lat and lon coordinates, separated by a space or comma.

    latfield: input field containing latitude value.

    lonfield: input field containing longitude value.

    shapefile: GeoJSON formatted shapefile

    propertynames: comma separated list of property:domain pairs.  The property names a property within the shapefile, and the domain specifies a custom domain name for that property.  If a lat-lon point falls within a shape specified in the shapefile, the feature extractor will output all the specified properties in the propertynames list.

    unknownvalue: value to assign to a lat-lon outside of given shapes

    Note: Please specify either the field argument or both the latfield and lonfield arguments.

    [custom domain name]
    Country()

    An instance of GeoShapeMapping that maps lat-lon to country/region two-letter codes and country/region names

    field: input field to analyze (default=’geopoint’). this field should contain both lat and lon coordinates, separated by a space or comma.

    latfield: input field containing latitude value.

    lonfield: input field containing longitude value.

    unknownvalue: value to assign to a lat-lon outside of countries/regions

    Note: Please specify either the field argument or both the latfield and lonfield arguments.

    fips_country:

    country:

    USAState()

    An instance of GeoShapeMapping that maps lat-lon to USA subregions and states

    field: input field to analyze (default=’geopoint’). this field should contain both lat and lon coordinates, separated by a space or comma.

    latfield: input field containing latitude value.

    lonfield: input field containing longitude value.

    unknownvalue: value to assign to a lat-lon outside of US states

    Note: Please specify either the field argument or both the latfield and lonfield arguments.

    USA_subregion:

    USA_state:

    USA_fips:

    CountyFIPS()

    An instance of GeoShapeMapping that maps lat-lon to US county names and FIPS codes

    field: input field to analyze (default=’geopoint’). this field should contain both lat and lon coordinates, separated by a space or comma.

    latfield: input field containing latitude value.

    lonfield: input field containing longitude value.

    unknownvalue: value to assign to a lat-lon outside of US counties

    Note: Please specify either the field argument or both the latfield and lonfield arguments.

    countygeoid:

    countyname:

    Time()

    Extracts various temporal features

    field: input field to analyze (default=’creationdate’)

    options: list of time features to extract: absoluteminute, absolutehour, absoluteday, absoluteweek, monthofyear, dayofweek, hourofday. (default is to output all fields)

    format: ‘unix’ or ‘ticks’ (default=’unix’)

    absoluteminute:

    absolutehour:

    absoluteday:

    absoluteweek:

    monthofyear:

    dayofweek:

    hourofday:

    ProfileLocation()

    Maps geographic regions from user profile locations with a user-specified mapping file

    field: input field to analyze (default=’userlocation’)

    domain: set custom output domain

    mappingfile: model for mapping from user location names to geographic locations. DGT comes with a mapping file for major international metropolitan areas, and United States country regions and divisions.

    unknownvalue: value to assign to unrecognized profile locations

    [custom domain name]
    ProfileLocationToCountry()

    Maps user profile locations to 2-letter country/region FIPS codes

    field: input field to analyze (default=’userlocation’)

    unknownvalue: value to assign to unrecognized profile locations

    country:
    ProfileLocationToCountryName()

    Maps user profile locations to country/region names

    field: input field to analyze (default=’userlocation’)

    unknownvalue: value to assign to unrecognized profile locations

    countryname:
    ProfileLocationToUSASubregion() 

    Maps user profile locations to subregions of USA (e.g., Pacific, Mid-Atlantic)

    field: input field to analyze (default=’userlocation’)

    unknownvalue: value to assign to unrecognized profile locations

    usa_subregion:
    ProfileLocationToUSAState()

    Maps user profile locations to US states

    field: input field to analyze (default=’userlocation’)

    unknownvalue: value to assign to unrecognized profile locations

    usa_state:
    ProfileLocationToUSACounty() 

    Maps user profile locations to US county FIPS codes

    field: input field to analyze (default=’userlocation’)

    unknownvalue: value to assign to unrecognized profile locations

    usa_county:
    ProfileLocationToUSACountyName()

    Maps user profile locations to US county names

    field: input field to analyze (default=’userlocation’)

    unknownvalue: value to assign to unrecognized profile locations

    usa_countyname:
    ProfileLocationToMetroArea()

    Maps user profile locations to major metropolitan areas

    field: input field to analyze (default=’userlocation’)

    unknownvalue: value to assign to unrecognized profile locations

    metroarea:
    ExactPhrases()

    Matches specific phrases in a given list or mapping file

    field: input field to analyze (default=’text’)

    domain: set custom output domain

    accept: a comma-separated list of phrases to match

    acceptfile: a text file listing phrases. Use a tab-separated two-column file to specify canonical forms for matched phrases

    [custom domain name]
    Regex()

    Matches regular expressions

    field: input field to analyze

    domain: set custom output domain

    regex: the regular expression to match against text

    [custom domain name]
    Tokens()

    Extracts unigram tokens

    field: input field to analyze

    domain: set custom output domain

    stopwordsfile: file of tokens to ignore (default=none)

    porter: use porter stemming (default=”false”)

    [custom domain name]