-
Notifications
You must be signed in to change notification settings - Fork 15
turian/crfchunking-with-wordrepresentations
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
CRF chunking with word representations -------------------------------------- scripts and steps by Joseph Turian A standard baseline in NLP is chunking (shallow parsing), which was the CoNLL 2000 shared task. Training a CRF is a standard approach to this task (Sha + Pereira 2003). In fact, many CRF implementations include instructions for reimplementing the Sha+Pereira chunker with identical features: crfsgd: http://leon.bottou.org/projects/sgd crf++: http://crfpp.sourceforge.net/ CRFsuite: http://www.chokkan.org/software/crfsuite/ We use CRFsuite because it makes it simple to modify the feature generation code, so one can easily add new features. We have instructions and scripts for how we add word representations (Brown clusters and/or word embeddings) to the training. INSTALLATION: ------------- Download and install CRFsuite: http://www.chokkan.org/software/crfsuite/ You will need my common Python library: http://github.com/turian/common Go into data/ and download the CoNLL train and test files: cd data/ wget http://www.cnts.ua.ac.be/conll2000/chunking/train.txt.gz wget http://www.cnts.ua.ac.be/conll2000/chunking/test.txt.gz gunzip *.gz Download word representations: cd representations/ wget http://pylearn.org/turian/brown-clusters/brown-rcv1.clean.tokenized-CoNLL03.txt-c100-freq1.txt wget http://pylearn.org/turian/brown-clusters/brown-rcv1.clean.tokenized-CoNLL03.txt-c320-freq1.txt wget http://pylearn.org/turian/brown-clusters/brown-rcv1.clean.tokenized-CoNLL03.txt-c1000-freq1.txt wget http://pylearn.org/turian/brown-clusters/brown-rcv1.clean.tokenized-CoNLL03.txt-c3200-freq1.txt wget http://pylearn.org/turian/embeddings-ACL2010-20100116-redo-baseline-with-100dims/model-1750000000.LEARNING_RATE%3d1e-09.EMBEDDING_LEARNING_RATE%3d1e-06.EMBEDDING_SIZE%3d200.txt.gz wget http://pylearn.org/turian/embeddings-ACL2010-20100116-redo-baseline-with-100dims/model-2030000000.LEARNING_RATE%3d1e-09.EMBEDDING_LEARNING_RATE%3d1e-06.EMBEDDING_SIZE%3d100.txt.gz wget http://pylearn.org/turian/embeddings-ACL2010-20100116-redo-baseline-with-100dims/model-2270000000.LEARNING_RATE%3d1e-09.EMBEDDING_LEARNING_RATE%3d1e-06.txt.gz wget http://pylearn.org/turian/embeddings-ACL2010-20100116-redo-baseline-with-100dims/model-2280000000.LEARNING_RATE%3d1e-08.EMBEDDING_LEARNING_RATE%3d1e-07.EMBEDDING_SIZE%3d25.txt.gz ln -s model-1750000000.LEARNING_RATE=1e-09.EMBEDDING_LEARNING_RATE=1e-06.EMBEDDING_SIZE=200.txt.gz cw-embeddings-200dim.txt.gz ln -s model-2030000000.LEARNING_RATE\=1e-09.EMBEDDING_LEARNING_RATE\=1e-06.EMBEDDING_SIZE\=100.txt.gz cw-embeddings-100dim.txt.gz ln -s model-2270000000.LEARNING_RATE\=1e-09.EMBEDDING_LEARNING_RATE\=1e-06.txt.gz cw-embeddings-50dim.txt.gz ln -s model-2280000000.LEARNING_RATE\=1e-08.EMBEDDING_LEARNING_RATE\=1e-07.EMBEDDING_SIZE\=25.txt.gz cw-embeddings-25dim.txt.gz wget http://pylearn.org/turian/hlbl_reps_clean_1.rcv1.clean.tokenized-CoNLL03.case-intact.txt.gz wget http://pylearn.org/turian/hlbl_reps_clean_2.50d.rcv1.clean.tokenized-CoNLL03.case-intact.txt.gz ln -s hlbl_reps_clean_1.rcv1.clean.tokenized-CoNLL03.case-intact.txt.gz hlbl-embeddings-100dim.txt.gz ln -s hlbl_reps_clean_2.50d.rcv1.clean.tokenized-CoNLL03.case-intact.txt.gz hlbl-embeddings-50dim.txt.gz BATCH EVALUATIONS ----------------- ./scripts/train-and-evaluate.py -name baseline --dev --l2 2 WARNING: Everytime you change the --features parameter, you should also change the --name. NOTE: ----- CRFsuite has benchmark results on the CoNLL shared task: http://www.chokkan.org/software/crfsuite/benchmark.html However, I did not achievable achieve comparable F1 score on the CoNLL test set until I used the following parameters: Dev F1 Test F1 params 94.04 93.63 l2=2 94.03 93.65 l2=3.2, possible_transitions=1 94.15 93.73 l2=3.2, possible_transitions=1, possible_states=1 94.16 93.79 SGD, l2=3.2, possible_transitions=1, possible_states=1 I chose the l2 penalty on the dev set, which was a subset of the training data. I then used this l2 penalty and trained over the entire training set.
About
Train a CRF for syntactic chunking (CoNLL2000), and use word representations
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published