org.annolab.tt4j
Class TreeTaggerWrapper<O>

java.lang.Object
  extended by org.annolab.tt4j.TreeTaggerWrapper<O>
Type Parameters:
O - the token type.

public class TreeTaggerWrapper<O>
extends Object

Main TreeTagger wrapper class. One TreeTagger process will be created and maintained for each instance of this class. The associated process will be terminated and restarted automatically if the model is changed (setModel(String)). Otherwise the process remains running, in the background once it is started which saves a lot of time. The process remains dormant while not used and only consumes some memory, but no CPU while it is not used.

During analysis, two threads are used to communicate with the TreeTagger. One process writes tokens to the TreeTagger process, while the other receives the analyzed tokens.

For easy integration into application, this class takes any object containing token information and either uses its Object.toString() method or an TokenAdapter set using setAdapter(TokenAdapter) to extract the actual token. To receive the an analyzed token, set a custom TokenHandler using setHandler(TokenHandler).

Per default the TreeTagger executable is searched for in the directories indicated by the system propery treetagger.home, the environment variables TREETAGGER_HOME and TAGDIR in this order. A full path to a model file optionally appended by a : and the model encoding is expected by the setModel(String) method.

For additional flexibility, register a custom ExecutableResolver using setExecutableProvider(ExecutableResolver) or a custom ModelResolver using setModelProvider(ModelResolver). Custom providers may extract models and executable from archives or download them from some location and temporarily or permanently install them in the file system. A custom model resolver may also be used to resolve a language code (e.g. en) to a particular model.

A simple illustration of how to use this class:

 TreeTaggerWrapper tt = new TreeTaggerWrapper();
 try {
     tt.setModel("/treetagger/models/english.par:iso8859-1");
     tt.setHandler(new TokenHandler() {
         void token(String token, String pos, String lemma) {
             System.out.println(token+"\t"+pos+"\t"+lemma);
         }
     });
     tt.process(asList(new String[] {"This", "is", "a", "test", "."}));
 }
 finally {
     tt.destroy();
 }
 

Author:
Richard Eckart de Castilho

Field Summary
static boolean TRACE
           
 
Constructor Summary
TreeTaggerWrapper()
           
 
Method Summary
 void destroy()
          Stop the TreeTagger process and clean up the model and executable.
protected  void finalize()
           
 String[] getArguments()
           
 Model getModel()
          Get the currently set model.
 PlatformDetector getPlatformDetector()
          Get platform information.
 int getRestartCount()
          Get the number of times a TreeTagger process was started.
 String getStatus()
           
 void process(Collection<O> aTokenList)
          Process the given list of token objects.
protected  Collection<O> removeProblematicTokens(Collection<O> aTokenList)
           
 void setAdapter(TokenAdapter<O> aAdapter)
          Set a TokenAdapter used to extract the token string from a token objects passed to process(Collection).
 void setArguments(String[] aArgs)
          Set the arguments that are passed to the TreeTagger executable.
 void setEpsilon(Double aEpsilon)
          Set minimal tag frequency to epsilon
 void setExecutableProvider(ExecutableResolver aExeProvider)
          Set a custom executable resolver.
 void setHandler(TokenHandler<O> aHandler)
          Set a TokenHandler to receive the analyzed tokens.
 void setHyphenHeuristics(boolean hyphenHeuristics)
          Turn on the heuristics fur guessing the parts of speech of unknown hyphenated words.
 void setModel(String modelName)
          Load the model with the given name.
 void setModelProvider(ModelResolver aModelProvider)
          Set a custom model resolver.
 void setPerformanceMode(boolean performanceMode)
          Disable some sanity checks, e.g. whether tokens contain line breaks (which is not allowed).
 void setPlatformDetector(PlatformDetector aPlatform)
          Set platform information.
 
Methods inherited from class java.lang.Object
clone, equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

TRACE

public static boolean TRACE
Constructor Detail

TreeTaggerWrapper

public TreeTaggerWrapper()
Method Detail

setPerformanceMode

public void setPerformanceMode(boolean performanceMode)
Disable some sanity checks, e.g. whether tokens contain line breaks (which is not allowed). Turning this on will increase your performance, but the wrapper may throw exceptions if illegal data is provided.

Parameters:
performanceMode -

setArguments

public void setArguments(String[] aArgs)
Set the arguments that are passed to the TreeTagger executable. A call to this method will cause a running TreeTagger process to be shut down and restarted with the new arguments. Using this method can cause TT4J to not work any longer. TTJ4 expects that TreeTagger prints a set of line each containing three tokens separated by spaces.

Parameters:
aArgs - the arguments.

getArguments

public String[] getArguments()

setEpsilon

public void setEpsilon(Double aEpsilon)
Set minimal tag frequency to epsilon

Parameters:
aEpsilon - epsilon

setHyphenHeuristics

public void setHyphenHeuristics(boolean hyphenHeuristics)
Turn on the heuristics fur guessing the parts of speech of unknown hyphenated words.

Parameters:
hyphenHeuristics - use hyphen heuristics.

setModelProvider

public void setModelProvider(ModelResolver aModelProvider)
Set a custom model resolver.

Parameters:
aModelProvider - a model resolver.

setExecutableProvider

public void setExecutableProvider(ExecutableResolver aExeProvider)
Set a custom executable resolver.

Parameters:
aExeProvider - a executable resolver.

setHandler

public void setHandler(TokenHandler<O> aHandler)
Set a TokenHandler to receive the analyzed tokens.

Parameters:
aHandler - a token handler.

setAdapter

public void setAdapter(TokenAdapter<O> aAdapter)
Set a TokenAdapter used to extract the token string from a token objects passed to process(Collection). If no adapter is set, the Object.toString() method is used.

Parameters:
aAdapter - the adapter.

setPlatformDetector

public void setPlatformDetector(PlatformDetector aPlatform)
Set platform information. Also sets the platform information in the model resolver and the executable resolver.

Parameters:
aPlatform - the platform information.

getPlatformDetector

public PlatformDetector getPlatformDetector()
Get platform information.

Returns:
the platform information.

setModel

public void setModel(String modelName)
              throws IOException
Load the model with the given name.

Parameters:
modelName - the name of the model.
Throws:
IOException - if the model can not be found.

getModel

public Model getModel()
Get the currently set model.

Returns:
the current model.

destroy

public void destroy()
Stop the TreeTagger process and clean up the model and executable.


finalize

protected void finalize()
                 throws Throwable
Overrides:
finalize in class Object
Throws:
Throwable

process

public void process(Collection<O> aTokenList)
             throws IOException,
                    TreeTaggerException
Process the given list of token objects.

Parameters:
aTokens - the token objects.
Throws:
IOException - if there is a problem providing the model or executable.
TreeTaggerException - if there is a problem communication with TreeTagger.

removeProblematicTokens

protected Collection<O> removeProblematicTokens(Collection<O> aTokenList)
Parameters:
aTokenList -
Returns:

getStatus

public String getStatus()

getRestartCount

public int getRestartCount()
Get the number of times a TreeTagger process was started.

Returns:
the number of times a TreeTagger process was started.


Copyright © 2010. All Rights Reserved.