PROSE group header - blue background

PROSE – Text Splitting

The Split.Text APIs are accessed through the SplitSession class. The user can create a new SplitSession object, add input data and various constraints to the session, and then call the Learn() method to obtain a SplitProgram. This is the program that is learnt from the given input data and constraints. The SplitProgram’s key method is the Run() method which executes the program to perform a split on any given text input.

To use Split.Text, one needs to reference Microsoft.ProgramSynthesis.Split.Text.dll, Microsoft.ProgramSynthesis.Split.Text.Semantics.dll
and Microsoft.ProgramSynthesis.Split.Text.Learning.dll, Microsoft.ProgramSynthesis.Extraction.Text.Semantics.dll and Microsoft.ProgramSynthesis.Extraction.Text.Learning.dll.

The complete code for the scenarios described in this walk-through is available in the Sample Project (opens in new tab) which illustrates our API usage.

Initializing the session

The user can create a new Split session and add the input data as follows:

// create a new ProseSplit session
var splitSession = new SplitSession();

// add the input rows to the session
// each input is a StringRegion object containing the text to be split
var inputs = new List<StringRegion> {
       SplitSession.CreateStringRegion("PE5 Leonard Robledo (Australia)"),
       SplitSession.CreateStringRegion("U109 Adam Jay Lucas (New Zealand)"),
       SplitSession.CreateStringRegion("R342 Carrie Dodson (United States)")
};
splitSession.Inputs.Add(inputs);

Each row of text in the input data is added as a StringRegion object created from the text content in that row. If we want we can also add some constraints to the session to specify basic properties of the desired splitting, such as whether we want to include the delimiters in the resulting split or not. If we do not want delimiters in the output, we can specify with a constraint as follows:

splitSession.Constraints.Add(new IncludeDelimitersInOutput(false));

We can clear any constraints provided in the session at any time by calling the splitSession.RemoveAllConstraints() method.

Learning a new split program

Split.Text can learn a program using only the provided input data in a purely predictive fashion, without any examples or other output constraints. This can be done by simply calling the Learn() function after adding the inputs.

// call the learn function to learn a splitting program from the given input examples
SplitProgram program = splitSession.Learn();

// check if the program is null (no program could be learnt from the given inputs)
if (program == null)
{
    Console.WriteLine("No program learned.");
    return;
}

Serializing/Deserializing a program

The SplitProgram.Serialize() method serializes the learned program to a string. The SplitProgramLoader.Instance.Load() method deserializes the program text to a program.

// serialize the learnt program and then deserialize
string progText = program.Serialize();
program = SplitProgramLoader.Instance.Load(progText);

Executing the learnt program

The learnt split program can be executed on any input StringRegion to produce an array of SplitCells. For example, we can execute the learnt program on each of the inputs as follows:

SplitCell[][] splitResult =
inputs.Select(input => program.Run(input)).ToArray();

Each SplitCell object represents information about a single split cell. It’s CellValue field is the sub-region of the input that this split cell represents, and the IsDelimiter flag indicates whether this split cell is a field or delimiter value. The learnt program can be executed indepedently of the Session object on any new input text, and not just the inputs that have been entered into the session.

Executing the predictively learnt program on the three inputs given above, and having specified delimiters to not be included in the output, we get the following splitting:

PE5 Leonard Robledo Australia
U109 Adam Jay Lucas New Zealand
R342 Carrie Dodson United States

Providing examples constraints

If the user desires a different split, then they can provide examples constraints to specify what kind of split they would like. For instance, if the user wants to separate the first name into a different split cell, then they can provide examples on some of the input rows as follows:

splitSession.Constraints.Add(new NthExampleConstraint(inputs[0].Value, 0, "PE5"));
splitSession.Constraints.Add(new NthExampleConstraint(inputs[0].Value, 1, "Leonard"));
splitSession.Constraints.Add(new NthExampleConstraint(inputs[0].Value, 2, "Robledo"));
splitSession.Constraints.Add(new NthExampleConstraint(inputs[0].Value, 3, "Australia"));
splitSession.Constraints.Add(new NthExampleConstraint(inputs[1].Value, 0, "U109"));
splitSession.Constraints.Add(new NthExampleConstraint(inputs[1].Value, 1, "Adam"));
splitSession.Constraints.Add(new NthExampleConstraint(inputs[1].Value, 2, "Jay Lucas"));
splitSession.Constraints.Add(new NthExampleConstraint(inputs[1].Value, 3, "New Zealand"));

Each NthExampleConstraint takes three parameters: the input text on which the program will execute (the entire string), the index of the output split cell for which this example is being given, and the text value desired in that split cell. The examples constraints given above describe each of the four split cells that are desired for the first two inputs that have been given in this session. After calling Learn() with these constraints, we obtain a program that produces the following output splitting on the three inputs given in this session:

PE5 Leonard Robledo Australia
U109 Adam Jay Lucas New Zealand
R342 Carrie Dodson United States