woffle
is a project template which aims to allows you to compose various NLP
tasks via a common interface using each of the most popular currently available
tools. This includes
- spaCy
- fastText
- flair
- others coming soon
The project was borne out of frustrations in trying to tie together all of the methods and attributes of each of popular NLP programs. I intend to have the program broken down into composable tasks where each task takes some 'sensible' default operations and when you import a specific part of the tool then it only exposes that one thing and these tasks will be separated by whether they are deterministic processing or whether the output is probabilistically generated (it makes it easier to control what models are left hanging around).
Currently the tasks we aim to perform are
-
parsing
including replacement of a list of regex strings defined in a configuration file
-
embedding
not only generating numeric vectors from your text using fasttext, spacy's gloVe implementation and similar but I envision that this should also include tasks such as topic modelling and semantic analysis purely because they are mappings from your text into some kind of representation space
-
clustering
deterministic (e.g. Ward linkage) clustering and proabilitistic clustering will be included
-
selection
the ability to replace the content of a cluster with a representative 'label', in optimus this is based on functions of the cluster based on decisions on the content of the cluster but this could be as simple as replacing the the cluster with its sentiment score
These functions will be called the same thing regardless of which back end you use and most importantly they will be composable so that you can chain deterministic and probabilistic functions together, where it makes sense.
woffle
is intended for use on modern linux and macOS operating systems. This
is due to the dependency on GNU make
, curl
et al. (see the contents of
Makefile
for more details. If you are comfortable setting up the dependencies
in Windows I don't believe that there is a reason it should not work.
The standard installation, including an installation of fasttext (but without a model) and spacy looks like:
git clone https://github.com/datasciencecampus/woffle
cd woffle
make
If you also wish to download the wiki.en.zip
vectors for fasttext then add:
make ftmodel
and if you do not have a CUDA enabled GPU then you may wish to use the flair-fast models which are optimised for running on CPUs instead:
make flair-fast
else just install
make flair
If you'd like to use the pytorch-pretrained-BERT embeddings, run:
make bert
Please note the implementation of these embeddings is experimental and fairly simplistic however it conforms to the overall woffle standard of use and is accessible in the same way as other embeddings.
Currently the repository is in WIP. There are rudimentary tests set up for a variety of components and CI is set up on Travis. The status of the build can be seen on the badge at the top of this README.
If you would like to run tests yourself, you can use any of the following commands:
make
make test
# ---- OR ----
make ci
The intention of this repo is to provide a working example from which to base your own processing. Below is the minimum code required in order to:
- perform regex based cleaning of the text
- clean text using spacy to identify root nouns
- select the first noun as the target noun
- embed the strings using fasttext
- perform a hierarchical clustering of the vectors
- perform a cutoff at depth 3 to generate clusters
- print all of the generated information
This accepts the default actions and structure of the 'hcluster' (hierarchical clustering) theme. Should you want to 'roll your own' processing please see the manual on the website.
# import the required parts of the toolkit
from woffle.hcluster import parse, embed, cluster
with open('data/test.txt') as handle:
text = handle.read().splitlines()
target = parse(text) # note, generator, not yet evaluated
embed = [i for i in embed(target)] # clusters cannot yet use generators
clusters = [i for i in cluster(embed, text, 3)]
target = parse(text) # generator has been consumed at this point in the above!
pairs = ((i,j) for i,j in zip(text, target))
for o, t in pairs:
entrant = [cluster.tolist() for cluster in clusters if o in cluster]
print(f"{o:>30s}: {t:15s} -> {entrant[0]}")
For more on the included themes please see the documentation. If you wish to build your own back end then please see the instructions on the website.