skip navigation.

Brill-NL : Brill's part-of-speech tagger for Dutch


Input   
   
Output


  




Part-of-speech tagging

Part-of-speech (POS) tagging is process of marking up the words of a sentence with their word class extern link. For example, in the sentence "The man walks.", the word 'man' would be tagged as a noun and the word 'walks' as a verb. The availability of POS tags with words has turned out to be very useful in natural language processing. Nowadays, there are various freely available POS taggers for English with reasonably high accuracy (from 90% to almost 98%), for example [1,2]. This webpage features online and offline POS tagging for Dutch.

The Brill parser trained for Dutch: Brill-NL

The rule-based POS-tagger developed by Eric Brill was slightly modified and has been trained on a subset of the Eindhoven corpus [3] using the WOTAN tagset [4]. The tagger is based on transformation-based error-driven learning [5], a technique that has also been effective in a number of other natural language applications that require tagging, such as word sense tagging, prepositional phrase attachment, or parsing . The Brill-NL version presented here has an average accuracy of about 92%.

Offline version

After having received several requests for using the Brill-NL tagger, I decided to make it available for offline non-commercial use. The system can be used on most platforms (Unix, GNU/Linux, MacOS, Windows using Cygwin) and comprises two components:
  • The original Brill tagger: RBT1_14.tgz (1.3 MB)
    (contains source code and English language models)
  • The Dutch language model: BRILL_NL.tgz (312 KB)

The language model consists of a lexicon, bigrams, lexical rules and context rules.

To compile the tagger, make sure that a C compiler (GCC) is available and follow the instructions below. As compilation for Linux and especially Windows may be a little tricky, I also distribute the compiled Brill tagger together with the Dutch language model:

Compilation

  1. Download the original Brill tagger implementation (1.3 MB) and unpack the source code:
    wget http://cosmion.net/jeroen/software/brill_pos/RBT1_14.tgz
    tar xzf RBT1_14.tgz
    This creates a directory RULE_BASED_TAGGER_V1.14/

  2. Compile the tagger as follows:
    cd RULE_BASED_TAGGER_V1.14/
    make
    
    During the compilation you may get some warnings, and after completion the necessary executables, tagger, start-state-tagger, and final-state-tagger can be found in the directory Bin_and_Data/

    Note: when compiling under Cygwin, make sure that the packages 'gcc', 'make', and 'tcsh' are installed. Change 'SHELL = /bin/csh' to 'SHELL = /bin/tcsh' in the file Makefile.

  3. Download and extract the Dutch language model (308 KB) and place it in the Bin_and_Data/ directory:
    cd Bin_and_Data/
    wget http://cosmion.net/jeroen/software/brill_pos/BRILL_NL.tgz
    tar xzf BRILL_NL.tgz
  4. With the text to be tagged, e.g. "de man loopt", in the file input.txt, the Brill-NL tagger should be executed as follows:
    ./tagger brill_LEXICON.jg input.txt brill_BIGBIGRAMS.jg \
    brill_LEXRULES.jg brill_CONTEXTRULES.jg 2>/dev/null
    which generates the following output:
    de/Art(bep,zijd_of_mv,neut) man/N(soort,ev,neut) loopt/V(intrans,ott,3,ev)

    Note: If ./tagger complains about not being able to find start-state-tagger and final-state-tagger, add the full path of the Bin_and_Data/ directory to the PATH environment variable. To add the full path in Bash, execute e.g.: export PATH=$PATH:~/RULE_BASED_TAGGER_V1.14/Bin_and_Data/

If the Brill-NL tagger is used in academic work, an acknowledgement would be appreciated. For any questions or comments, please do not hesitate to contact me.

References

[1] Thorsten Brants' TnT POS-tagger extern link.
[2] Kristina Toutanova's Log-linear POS tagger extern link.
[3] Uit den Boogaart (1975) Woordfrequenties in geschreven en gesproken Nederlands. Oosthoek, Scheltema & Holkema, Utrecht.
[4] Berghmans, J. (1994) WOTAN, een automatische grammatikale tagger voor het Nederlands. Dept. of Language and Speech, University of Nijmegen.
[5] Brill, E. (1995) Transformation-based error-driven learning and natural language processing: A case study in part of speech tagging. Computational Linguistics, 21(4):543--565.