5. I/O Format
6. Copyright and Licensing
Corbit is a text analyzer for Chinese, which provides various functionalities to handle word segmentation (chunking), part-of-speech (POS) tagging, and dependency parsing. It is built on an extension of incremental (transition-based) parsing algorithms, which enable to process each of the these tasks in linear time to the sentence length while still retaining state-of-the-art performance. Our system can either handle two of these tasks jointly, or handle the three tasks all in once using a full joint model. The use of a joint model usually leads to higher accuracies, particularly for upper-level tasks (i.e. POS tagging and dependency parsing), with some increase in computational time. See the following references for detailed analysis of the speed and accuracy for various joint combinations.
Corbit has several processing modes based on the degree of integration. You can freely combine different processing modes as building blocks into a pipeline or joint model. For example, by combining a joint word segmentation and POS tagging model (“sp” model) with a dependency parser (“d” model), you can process text with a two-layer partially-joint pipeline model.
Accuracy and speed comparison of various combinations are described in detail in Hatori et al. (2012). The complete description of algorithms used in this system is described in the references.
1. “sp” mode: Joint word segmentation and POS tagging
This mode is almost equivalent to the joint word segmentation and POS tagging model described in Zhang and Clark (2010). However, it provides several additional functions via options, such as the use of external lexicon and character-type features, which are considered effective to process noisy real-world text with good accuracies.
2. “d” mode: Dependency parsing
This mode simulates the dependency parser described in Huang and Sagae (2010), and its extension by the use of additional features from Zhang and Nivre (2011). The feature set to use can be specified by the argument of the “–feature-type” option.
3. “pd” mode: Joint POS tagging and dependency parsing
This mode performs joint POS tagging and dependency parsing using the algorithm described in Hatori et al. (2011). Since some minor bugs have been fixed, accuracies are slightly improved over the original implementation (i.e. what is described in the paper). Please contact us if you need the source code of the original implementation for an exact comparison.
4. “spd” mode: Joint word segmentation, POS tagging, and dependency parsing
This mode performs joint word segmentation, POS tagging, and dependency parsing using the algorithm describe in Hatori et al. (2012). Now, the use of delayed features is supported via the “–delay” option, which can further improve the accuracy of dependency parsing.
- Yue Zhang and Stephen Clark. A Fast Decoder for Joint Word Segmentation and POS-tagging Using a Single Discriminative Model. In Proceedings of 2010 Conference of Empirical Methods on Natural Language Processing (EMNLP-2010).
- Liang Huang and Kenji Sagae. Dynamic Programming for Linear-Time Incremental Parsing. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL-2010).
- Yue Zhang and Joakim Nivre. Transition-Based Dependency Parsing with Rich Non-Local Features. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT-2011).
- Jun Hatori, Takuya Matsuzaki, Yusuke Miyao, and Jun’ichi Tsujii. Incremental Joint POS Tagging and Dependency Parsing in Chinese. In the Proceedings of the 5th International Joint Conference on Natural Language Processing (IJCNLP-2011).
- Jun Hatori, Takuya Matsuzaki, Yusuke Miyao, and Jun’ichi Tsujii. Incremental Joint Approach to Word Segmentation, POS Tagging, and Dependency Parsing in Chinese. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL-2012).
Prerequisite: JRE 1.6 or higher is required. Java bin directory must be included in PATH.
Corbit, a Chinese text analyzer
- Version 0.01 (2012/7/1): [binary & source]
Off-the-shelf models trained with CTB-7, HowNet word list, and Wikipedia
1. sp: joint word segmentation and POS tagging model by Zhang & Clark (2010)
2. spd: joint word segmentation, POS tagging, and dependency parsing model by Hatori et al. (2012)
3. d: dependency parsing model by Huang & Sagae (2010) and Zhang & Nivre (2011)
4. pd: joint POS tagging and dependency parsing model by Hatori et al. (2011)
./corbit.sh Run (model-file) [options..] < input > output
Example: word segmentation & POS tagging
./corbit.sh sp Run models/chtb5d.sp.model < sample.plain_text > sample.tagged
Example: dependency parsing
./corbit.sh d Run models/chtb5d.d.model < sample.tagged > sample.parsed
./corbit.sh Train (train-file) (dev-file) (#iteration) (model-file-to-save) --dict (dict-file) (threshold) [options..]
Example: training joint segmentaiton & POS tagging model on CTB-5
./corbit.sh sp CreateDict chtb5.train.malt chtb5.train.malt.dict ./corbit.sh sp Train chtb5.train.malt chtb5.dev.malt 30 chtb5.sp.model --tagset ctb5 --dict chtb5.train.malt.dict 20 --char-type
Example: training dependency parsing model on CTB-5
./corbit.sh d CreateDict chtb5.train.malt chtb5.train.malt.dict ./corbit.sh d Train chtb5.train.malt chtb5.dev.malt 50 chtb5.d.model --dict chtb5.train.malt.dict 20
./corbit.sh Test (model-file) (test-file) [options..]
5. I/O Format
Corbit currently supports the following file formats. The Malt or CTB Format is used for “Training” and “Evaluation”, while the plain format is used for inputs and outputs of “Analysis.” The input format for “Training” and “Evaluation” can be switched using “–input-format” option. All files must be encoded with UTF-8.
5.1 Malt Format (default)
Each line represents a word. Sentences are separated by a single blank line.
[index] .... [word form] .... [POS] .... [head index] [index] .... [word form] .... [POS] .... [head index] [index] .... [word form] .... [POS] .... [head index] [index] .... [word form] .... [POS] .... [head index] [index] .... [word form] .... [POS] .... [head index] ...
5.2 CTB Format
Each line represents one sentence. Word indices start from 0, and the head index of -1 indicates a dependency to the root. Note that word forms, POS, and head indices must be put in parentheses.
[index]:([word form])_([POS])_([head index]) [index]:([word form])_([POS])_([head index]) ... [index]:([word form])_([POS])_([head index]) [index]:([word form])_([POS])_([head index]) ...
5.3 Plain Format (output of “sp” mode; input of “d” mode)
Each lines corresponds to one sentence.
[word form]/[POS] [word form]/[POS] ...
6. Copyright and Licensing Information
This software is released under the Modified BSD License.
Copyright (c) 2010-2012, Jun Hatori All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the names of the authors nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Portions of this software (patricia trie implementation) are based on the Apache License.
Copyright 2005-2009 Roger Kapsi, Sam Berlin Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.