Dictionaries are required by several components in ClearNLP. The general dictionary contains general morphology information and the global lexica contains knowledge-base as well as distributional semantics information.
export CLASSPATH=clearnlp-dictionary-3.2.jar:\\
clearnlp-global-lexica-3.1.jar:.
- Add the following lines to your
pom.xml
.
<dependency>
<groupId>edu.emory.clir</groupId>
<artifactId>clearnlp-dictionary</artifactId>
<version>3.2</version>
</dependency>
<dependency>
<groupId>edu.emory.clir</groupId>
<artifactId>clearnlp-global-lexica</artifactId>
<version>3.1</version>
</dependency>
<dependency>
<groupId>edu.emory.clir</groupId>
<artifactId>clearnlp-general-en-ner-gazetteer</artifactId>
<version>3.0</version>
</dependency>
The general models are trained on OntoNotes 5.0, English Web Treebank, and QuestionBank.
OntoNotes 5.0 |
Sentence Counts |
Token Counts |
Broadcasting conversations |
10,822 |
171,101 |
Broadcasting news |
10,344 |
206,020 |
News magazines |
6,672 |
163,627 |
Newswires |
34,434 |
875,800 |
Religious texts |
21,418 |
296,432 |
Telephone conversations |
8,963 |
85,444 |
Web texts |
12,447 |
284,951 |
Engilsh Web Treebank |
Sentence Counts |
Token Counts |
Answers |
2,699 |
43,916 |
Email |
2,983 |
44,168 |
Newsgroup |
1,995 |
37,714 |
Reviews |
2,915 |
44,337 |
Weblog |
1,753 |
38,770 |
QuestionBank |
Sentence Counts |
Token Counts |
Questions |
3,199 |
29,715 |
export CLASSPATH=clearnlp-general-en-pos-3.2.jar:\\
clearnlp-general-en-dep-3.2.jar:\\
clearnlp-general-en-ner-3.1.jar:\\
clearnlp-general-en-ner-gazetteer-3.0:\\
- Add the following lines to your
pom.xml
.
<dependency>
<groupId>edu.emory.clir</groupId>
<artifactId>clearnlp-general-en-pos</artifactId>
<version>3.2</version>
</dependency>
<dependency>
<groupId>edu.emory.clir</groupId>
<artifactId>clearnlp-general-en-dep</artifactId>
<version>3.2</version>
</dependency>
<dependency>
<groupId>edu.emory.clir</groupId>
<artifactId>clearnlp-general-en-ner</artifactId>
<version>3.1</version>
</dependency>
<dependency>
<groupId>edu.emory.clir</groupId>
<artifactId>clearnlp-general-en-ner-gazetteer</artifactId>
<version>3.0</version>
</dependency>
The medical models are trained on MiPACQ, SHARP, and THYME corpora.
MiPACQ |
Sentence Counts |
Token Counts |
Clinical questions |
1,600 |
30,138 |
Medpedia articles |
2,796 |
49,922 |
Clinical notes |
8,383 |
113,164 |
Pathological notes |
1,205 |
21,353 |
SHARP |
Sentence Counts |
Token Counts |
Seattle group health notes |
7,205 |
94,474 |
Clinical notes |
6,807 |
93,914 |
Stratified |
4,320 |
43,536 |
Stratified SGH |
13,668 |
139,424 |
THYME |
Sentence Counts |
Token Counts |
Clinical & patheological notes |
26,734 |
388,371 |
Braincancer |
18,700 |
225,486 |
export CLASSPATH=clearnlp-medical-en-pos-3.1.jar:\\
clearnlp-medical-en-dep-3.1.jar:.
- Add the following lines to your
pom.xml
.
<dependency>
<groupId>edu.emory.clir</groupId>
<artifactId>clearnlp-medical-en-pos</artifactId>
<version>3.1</version>
</dependency>
<dependency>
<groupId>edu.emory.clir</groupId>
<artifactId>clearnlp-medical-en-dep</artifactId>
<version>3.1</version>
</dependency>
The bioinformaitcs models are trained on CRAFT Treebank.
CRAFT |
Sentence Counts |
Token Counts |
Training data |
16,297 |
452,769 |
- Download the following models and add them to your Java classpath.
export CLASSPATH=clearnlp-bioinformatics-en-pos-3.1.jar:\\
clearnlp-bioinformatics-en-dep-3.1.jar:.
- Add the following lines to your
pom.xml
.
<dependency>
<groupId>edu.emory.clir</groupId>
<artifactId>clearnlp-bioinformatics-en-pos</artifactId>
<version>3.1</version>
</dependency>
<dependency>
<groupId>edu.emory.clir</groupId>
<artifactId>clearnlp-bioinformatics-en-dep</artifactId>
<version>3.1</version>
</dependency>