Skip to content

Latest commit

 

History

History
202 lines (161 loc) · 8.39 KB

models.md

File metadata and controls

202 lines (161 loc) · 8.39 KB

Models

Contents

Dictionary

Dictionaries are required by several components in ClearNLP. The general dictionary contains general morphology information and the global lexica contains knowledge-base as well as distributional semantics information.

Without Maven

export CLASSPATH=clearnlp-dictionary-3.2.jar:\\
                    clearnlp-global-lexica-3.1.jar:.

With Maven

  • Add the following lines to your pom.xml.
<dependency>
     <groupId>edu.emory.clir</groupId>
     <artifactId>clearnlp-dictionary</artifactId>
     <version>3.2</version>
</dependency>
<dependency>
     <groupId>edu.emory.clir</groupId>
     <artifactId>clearnlp-global-lexica</artifactId>
     <version>3.1</version>
</dependency>
<dependency>
     <groupId>edu.emory.clir</groupId>
     <artifactId>clearnlp-general-en-ner-gazetteer</artifactId>
     <version>3.0</version>
   </dependency>

General Domain

The general models are trained on OntoNotes 5.0, English Web Treebank, and QuestionBank.

OntoNotes 5.0 Sentence Counts Token Counts
Broadcasting conversations 10,822 171,101
Broadcasting news 10,344 206,020
News magazines 6,672 163,627
Newswires 34,434 875,800
Religious texts 21,418 296,432
Telephone conversations 8,963 85,444
Web texts 12,447 284,951
Engilsh Web Treebank Sentence Counts Token Counts
Answers 2,699 43,916
Email 2,983 44,168
Newsgroup 1,995 37,714
Reviews 2,915 44,337
Weblog 1,753 38,770
QuestionBank Sentence Counts Token Counts
Questions 3,199 29,715

Without Maven

export CLASSPATH=clearnlp-general-en-pos-3.2.jar:\\
                    clearnlp-general-en-dep-3.2.jar:\\
                    clearnlp-general-en-ner-3.1.jar:\\
                    clearnlp-general-en-ner-gazetteer-3.0:\\

With Maven

  • Add the following lines to your pom.xml.
<dependency>
     <groupId>edu.emory.clir</groupId>
     <artifactId>clearnlp-general-en-pos</artifactId>
     <version>3.2</version>
</dependency>
<dependency>
     <groupId>edu.emory.clir</groupId>
     <artifactId>clearnlp-general-en-dep</artifactId>
     <version>3.2</version>
</dependency>
<dependency>
     <groupId>edu.emory.clir</groupId>
     <artifactId>clearnlp-general-en-ner</artifactId>
     <version>3.1</version>
</dependency>
<dependency>
     <groupId>edu.emory.clir</groupId>
     <artifactId>clearnlp-general-en-ner-gazetteer</artifactId>
     <version>3.0</version>
</dependency>

Medical Domain

The medical models are trained on MiPACQ, SHARP, and THYME corpora.

MiPACQ Sentence Counts Token Counts
Clinical questions 1,600 30,138
Medpedia articles 2,796 49,922
Clinical notes 8,383 113,164
Pathological notes 1,205 21,353
SHARP Sentence Counts Token Counts
Seattle group health notes 7,205 94,474
Clinical notes 6,807 93,914
Stratified 4,320 43,536
Stratified SGH 13,668 139,424
THYME Sentence Counts Token Counts
Clinical & patheological notes 26,734 388,371
Braincancer 18,700 225,486

Without Maven

export CLASSPATH=clearnlp-medical-en-pos-3.1.jar:\\
                    clearnlp-medical-en-dep-3.1.jar:.

With Maven

  • Add the following lines to your pom.xml.
<dependency>
     <groupId>edu.emory.clir</groupId>
     <artifactId>clearnlp-medical-en-pos</artifactId>
     <version>3.1</version>
</dependency>
<dependency>
     <groupId>edu.emory.clir</groupId>
     <artifactId>clearnlp-medical-en-dep</artifactId>
     <version>3.1</version>
</dependency>

Bioinformatics Domain

The bioinformaitcs models are trained on CRAFT Treebank.

CRAFT Sentence Counts Token Counts
Training data 16,297 452,769

Without Maven

  1. Download the following models and add them to your Java classpath.
export CLASSPATH=clearnlp-bioinformatics-en-pos-3.1.jar:\\
                    clearnlp-bioinformatics-en-dep-3.1.jar:.

With Maven

  • Add the following lines to your pom.xml.
<dependency>
     <groupId>edu.emory.clir</groupId>
     <artifactId>clearnlp-bioinformatics-en-pos</artifactId>
     <version>3.1</version>
</dependency>
<dependency>
     <groupId>edu.emory.clir</groupId>
     <artifactId>clearnlp-bioinformatics-en-dep</artifactId>
     <version>3.1</version>
</dependency>