Lexical Tools

  • Lvg (Lexical Variants Generation)
  • Java


Introduction

Lexical Variant Generation (lvg) is a suite of utilities that can generate, transform, and filter lexical variants from the given input. Lvg is intended to be used to create robust indexes and to transform user queries into retrievable entries from those indexes.

Since 2002, lvg has been developed and released in pure Java.

In 2004 release, lvg used UTF-8 as the default format of input and output.

In 2008 release, lvg contained 62 flow components and 37 command options.

In 2013 release, lvg enhanced derivations with 2 more options and increased command options to 39.

The design features of lvg are described as below:

Setup

Follow the installation instructions to install lexical tool and run lvg program. Check the following items only if you don't use the provided script to install the Lexical tools.

  • CLASSPATH:
    1. include Lexical tools distribution jar file, ${LVG_DIR}/lib/lvg${YEAR}dist.jar, in your CLASSPATH
    2. include lvg top directory, ${LVG_DIR}, in your CLASSPATH

  • Database Db: use the default DB, HSqlDb or your own DB (requires tables reloaded).

  • Configuration File: assign the full path of the top directory of lvg${YEAR} to a variable named LVG_DIR in configuration file, ${LVG_DIR}/data/config/lvg.properties.

Test Run

  • run java program

    Enter the command:

    
    shell> lvg -f:n -f:i -p
    Please input a term (type "Ctl-d" to quit) >
    sleep
    sleep|sleep|2047|16777215|n|1|
    sleep|sleep|128|1|i|2|
    sleep|sleep|128|512|i|2|
    sleep|sleep|1024|1|i|2|
    sleep|sleep|1024|262144|i|2|
    sleep|sleep|1024|1024|i|2|
    sleep|slept|1024|32|i|2|
    sleep|slept|1024|64|i|2|
    sleep|sleeps|1024|128|i|2|
    sleep|sleeping|1024|16|i|2|
    

    where:

    • lvg: lvg script to run Java class
    • -f:n: set a flow component to no operation (try -f:h option!).
    • -f:i: set a flow component to get inflectional variants.
    • -p: set Norm system option to show prompt (try -h option!).

Output Format

Lvg copies its input from standard input to standard output and appends 6 or more fields. In general the output consists of:

Field 1Field 2Field 3 Field 4Field 5 Field 6Field 7+
Input Output Term Categories Inflections Flow History Flow Number Additional Information

Field 1: Input Line
The input may have one or more fields.

Field 2: Output Term
The output term field contains the transformed term. Since the input may be fielded, this output term will be a transformation of only one of the input fields. The default field for transformation is the first field. This behavior may be changed with the -t:INT input filter option.

Field 3: Category
The category field contains the decimal representation of a bit vector representing all the possible categories that this output term may have. The bit vector is a compact way of representing multiple categories with one number. This data format is intended to be utilized by a program or parser. The -SC filter interprets the category information in humanly readable form.

Field 4: Inflection
The inflection field is the decimal representation of a bit vector representing all the possible inflection types the output term may have. As with the category field this compact format is intended to be used by a program or parser. The -SI filter interprets the inflection information in humanly readable form.

Field 5: Flow History
The flow history represents the flow component mnemonics of the flow options that were applied to produce the output. Generally, the symbols of flow components mnemonics reflect the flow options specified on the command line.

Field 6: Flow Number
The flow number field contains a number indicating which flow produced the output. Flows are composed of command line options starting with -f:. Lvg can transform terms in parallel flows. For instance, one may want to generate both synonyms and derivations for any given input. One would do this via two parallel flows, -f:y -f:d. This differs from -f:y:d, which would produce the derivations of the synonyms for any given input. In the above example, the synonyms generated would be produced by the first flow and the derivations would be generated in the second flow. The flow number field would indicate this.

Field 7+: Additional Information
The additional information field(s) contain additional information that is specific to the flow option applied. The contents of these fields are, generally, governed by -m global option.

The following example shows the command, input, and outputs for lvg:

  • Command:
    shell> lvg -t:2 -f:y -ti -m

  • Input:
    C0037313|sleep

  • Output:

    
    sleep|hypnic|1|1|y|1|FACT|sleep|sleep|noun|hypnic|adj|NLP_LVG|
    sleep|sleep|128|1|y|1|FACT|sleep|sleep|verb|sleep|noun|C0037313|
    sleep|sleep|1024|1|y|1|FACT|sleep|sleep|noun|sleep|verb|C0037313|
    

    Field Num Field 1 Field 2 Field 3 Field 4 Field 5 Field 6 Field 7+
    Field Input Output Category Inflection Flow History Flow Number Additional Information
    Result-1 sleep hypnic 1 1 y 1 FACT|sleep|sleep|noun|hypnic|adj|NLP_LVG|
    Result-2 sleep sleep 128 1 y 1 FACT|sleep|sleep|verb|sleep|noun|C0037313|
    Result-3 sleep sleep 1 1024 y 1 FACT|sleep|sleep|noun|sleep|verb|C0037313|

Flow Components

Please refer to design document

System Options

Please refer to design document