Skip to content

Latest commit

 

History

History
406 lines (318 loc) · 18.7 KB

README.md

File metadata and controls

406 lines (318 loc) · 18.7 KB

Ruby GitHub version Gem Version Documentation Report Issues License

LittleWeasel

Table of Contents

About LittleWeasel

LittleWeasel is more than just a spell checker for words (and word blocks, i.e. groups of words); LittleWeasel provides information about a particular word(s) through its API. LittleWeasel allows you to apply preprocessing to words through any number of word preprocessors before they are checked against the dictionary(ies) you provide. In addition to this, you may provide any number of word filters that allow you to consider the validity of each word being checked, regardless of whether or not it's literally found in the dictionary. LittleWeasel will tell you exactly what word preprocessors were applied to a given word, even showing you the transformation of the original word as it passes through each preprocessor; it will also inform you of each matching word filters along the way, so you can make a decision about every word being validated.

LittleWeasel provides other features as well:

  • LittleWeasel allows you to provide any number of "dictionaries" which may be in the form of a collecton of words in a file on disk or an Array of words you provide, so that dictionaries may be created dynamically.
  • Dictionaries are identified by a unique "dictionary key"; that is, a key based on locale (-, e.g. en-US) and/or optional "tag" (en-US-, e.g. en-US-slang).
  • Dictionaries created from files on disk are cached; their words and metadata are shared across dictionary instances that share the same dictionary key.
  • Dictionaries can have observable, metadata objects attached to them which are notified when a word or word block is being evaluated; therefore, metadata about the dictionary, words, etc. can be gathered and used. For example, LittleWeasel uses a LittleWeasel::Metadata::InvalidWordsMetadata metadata object that caches and keeps track of the total bytes of invalid words searched against the dictionary. If the total bytes of invalid words exceeds what is set in the configuration, caching of invalid words ceases. You can create your own metadata objects to gather and use your own metadata.

Usage

At its most basic level, there are three (3) steps to using LittleWeasel:

  1. Create a LittleWeasel::Dictionary object.
  2. Consume the LittleWeasel::Dictionary#word_results and/or LittleWeasel::Dictionary#block_results APIs to obtain a LittleWeasel::WordResults 1 object for a particular word or word block.
  3. Interrogate the LittleWeasel::WordResults 1 object returned from either of the aforementioned APIs.

Some of the more advanced LittleWeasel features include the use of word preprocessors, word filters and dictionary metadata modules; for these, read on.

Creating Dictionaries

Creating a Dictionary from Memory

# Create a Dictionary Manager.
dictionary_manager = LittleWeasel::DictionaryManager.new

# Create our unique key for the dictionary.
en_us_names_key = LittleWeasel::DictionaryKey.new(language: :en, region: :us, tag: :names)

# Create a dictionary of names from memory.
en_us_names_dictionary = dictionary_manager.create_dictionary_from_memory(
  dictionary_key: en_us_names_key, dictionary_words: %w(Abel Bartholomew Cain Deborah Elijah))

Creating a Dictionary from a File on Disk

# Create a Dictionary Manager.
dictionary_manager = LittleWeasel::DictionaryManager.new

# Create our unique key for the dictionary.
en_us_key = LittleWeasel::DictionaryKey.new(language: :en, region: :us)

# Create a dictionary from a file on disk. The below assumes the
# dictionary file name matches the dictionary key (e.g. en-US).
en_us_dictionary = dictionary_manager.create_dictionary_from_file(
  dictionary_key: en_us_key, file: "dictionaries/#{en_us_key}.txt")

Basic Word Search Example

Using the Dictionary#word_results API

Continued from Creating a Dictionary from Memory example.

# Get word results for 'Abel'. true is returned because the 'Abel' is found in the dictionary.
en_us_names_dictionary.word_results('Abel').word_valid?
#=> true

# Get word results for 'elijah'. false is returned because while 'Elijah' is found in the dictionary, 'elijah' is NOT (case sensitive).
en_us_names_dictionary.word_results('elijah').word_valid?
#=> false

Using the Dictionary#block_results API

Continued from Creating a Dictionary from a File on Disk example.

word_block = "This is a word-block of 8 words and 2 numbers."

# Add a word filter so that numbers are considered valid.
en_us_dictionary.add_filters word_filters: [
  LittleWeasel::Filters::EnUs::NumericFilter.new
]

block_results = en_us_dictionary.block_results word_block
# Returns a LittleWeasel::BlockResults object.
# The below is formatted for readability...
# Results of calling #word_block with:
#   "This is a word-block of 8 words and 2 numbers."...
block_results #=>
    preprocessed_words_or_original_words #=>
        ["This", "is", "a", "word-block", "of", "8", "words", "and", "2", "numbers"]

    word_results[0] #=>
      # The word before any word preprocessors have been applied.
      original_word #=> "This"

      # The word after all word preprocessors have been applied against
      # the word.
      preprocessed_word #=> nil

      # Indicates whether or not the word was found in the literal
      # dictionary (#word_valid?) OR if the word (after word preprocessing)
      # was matched against a word filter (#filter_match?).
      success?  #=> false

      # Indicates whether or not word (after word preprocessing) was found
      # in the literal dictionary.
      word_valid? #=> false

      # Indicates whether or not the word is cached, either as a word found
      # in the literal dictionary OR as an invalid word. The latter will
      # only take place if LittleWeasel::Configuration#max_invalid_words_bytesize
      # is greater than 0.
      word_cached? #=> false

      # Indicates whether or not #preprocessed_word is present due to
      # word having passed through one or more word preprocessors. This
      # will only return true if word preprocessors are available to the
      # dictionary, turned on
      # (LittleWeasel::Preprocessors::WordPreprocessor#preprocessor_on?)
      # AND the word meets the criteria for word preprocessing for one or
      # more word preprocessors (LittleWeasel::Preprocessors::WordPreprocessor#preprocess?).
      preprocessed_word? #=> false

      # Returns #preprocessed_word if word preprocessing has been applied
      # or original_word if word preprocessing has NOT been applied.
      preprocessed_word_or_original_word #=> "This"

      # Indicates whether or not word has been matched by at least 1
      # word filter.
      filter_match? #=> false

      # Indicates the word filters that were matched against
      # word (LittleWeasel::Filters::WordFilter#filter_match?). If
      # word did not match any word filters, an empty Array is returned.
      filters_matched #=> []

      # Indicates the word preprocessors that were applied against
      # word. If no word preprocessors were applied to word, an empty
      # Array is returned.
      preprocessed_words #=> []

    word_results[1] #=>
        original_word #=> "is"
        preprocessed_word #=> nil
        success? #=> true
        word_valid? #=> true
        word_cached? #=> true
        preprocessed_word? #=> false
        preprocessed_word_or_original_word #=> "is"
        filter_match? #=> false
        filters_matched: #=> []
        preprocessed_words #=> []

    word_results[2] #=>
        original_word #=> "a"
        preprocessed_word #=> nil
        success? #=> true
        word_valid? #=> true
        word_cached? #=> true
        preprocessed_word? #=> false
        preprocessed_word_or_original_word #=> "a"
        filter_match? #=> false
        filters_matched: #=> []
        preprocessed_words #=> []

    word_results[3] #=>
        original_word #=> "word-block"
        preprocessed_word #=> nil
        success? #=> false
        word_valid? #=> false
        word_cached? #=> false
        preprocessed_word? #=> false
        preprocessed_word_or_original_word #=> "word-block"
        filter_match? #=> false
        filters_matched: #=> []
        preprocessed_words #=> []

    word_results[4] #=>
        original_word #=> "of"
        preprocessed_word #=> nil
        success? #=> false
        word_valid? #=> false
        word_cached? #=> false
        preprocessed_word? #=> false
        preprocessed_word_or_original_word #=> "of"
        filter_match? #=> false
        filters_matched: #=> []
        preprocessed_words #=> []

    word_results[5] #=>
        original_word #=> "8"
        preprocessed_word #=> nil
        success? #=> true
        word_valid? #=> false
        word_cached? #=> false
        preprocessed_word? #=> false
        preprocessed_word_or_original_word #=> "8"
        filter_match? #=> true
        filters_matched: #=> [:numeric_filter]
        preprocessed_words #=> []

    word_results[6] #=>
        original_word #=> "words"
        preprocessed_word #=> nil
        success? #=> true
        word_valid? #=> true
        word_cached? #=> true
        preprocessed_word? #=> false
        preprocessed_word_or_original_word #=> "words"
        filter_match? #=> false
        filters_matched: #=> []
        preprocessed_words #=> []

    word_results[7] #=>
        original_word #=> "and"
        preprocessed_word #=> nil
        success? #=> true
        word_valid? #=> true
        word_cached? #=> true
        preprocessed_word? #=> false
        preprocessed_word_or_original_word #=> "and"
        filter_match? #=> false
        filters_matched: #=> []
        preprocessed_words #=> []

    word_results[8] #=>
        original_word #=> "2"
        preprocessed_word #=> nil
        success? #=> true
        word_valid? #=> true
        word_cached? #=> true
        preprocessed_word? #=> false
        preprocessed_word_or_original_word #=> "2"
        filter_match? #=> true
        filters_matched: #=> [:numeric_filter]
        preprocessed_words #=> []

    word_results[9] #=>
        original_word #=> "numbers"
        preprocessed_word #=> nil
        success? #=> false
        word_valid? #=> false
        word_cached? #=> false
        preprocessed_word? #=> false
        preprocessed_word_or_original_word #=> "numbers"
        filter_match? #=> false
        filters_matched: #=> []
        preprocessed_words #=> []

Word Search using Word Filters and Word Preprocessors Example

Continued from Creating a Dictionary from Memory example.

Note: The below use of word filters and word preprocessors apply equally to both Dictionary#word_results and Dictionary#block_results APIs.

# Set up any word preprocessors and/or word filters we want to use...

# Word preprocessors perform preprocessing on words prior to being passed through any word filters
# and prior to being checked against the literal dictionary words.
en_us_names_dictionary.add_preprocessors word_preprocessors: [LittleWeasel::Preprocessors::EnUs::CapitalizePreprocessor.new]

# Word filters check words for validity prior to being checked against the literal dictionary.
# In other words, word filters allow you to consider words valid (if they match the filter) despite not being found in the literal dictionary.
en_us_names_dictionary.add_filters word_filters: [LittleWeasel::Filters::EnUs::SingleCharacterWordFilter.new]

# Check some words against the dictionary; a LittleWeasel::WordResults object is returned...

# Try to find a name...
word_results = en_us_names_dictionary.word_results 'elijah'

# Returns true if the word is found in the literal dictionary (#word_valid?) or
# if the word matched any word filters (#filter_match?). true is returned because
# 'elijah' ('Elijah' after preprocessing) was found in the literal dictionary,
# despite not having matched any word filters.
word_results.success?
#=> true

# Returns true because 'elijah' ('Elijah' after preprocessing) was found in the
# literal dictionary.
word_results.word_valid?
#=> true

# Returns true because the word had word preprocessing applied to it.
word_results.preprocessed_word?
#=> true

# Returns false because the word (after preprocessing) did not match any word filters.
word_results.filter_match?
#=> false

# Returns the original word, before any word preprocessors were applied.
word_results.original_word
#=> 'elijah'

# The resulting word after all word preprocessors have been applied.
word_results.preprocessed_word
#=> 'Elijah'

Word filters and word preprocessors working together example...

# This builds on the preceeding example...

# Search for a word that does not exist in the literal dictionary, but matches a filter...

# Because of the LittleWeasel::Filters::EnUs::SingleCharacterWordFilter word filter,
# "i" ("I" after word preprocessing) will be considered valid, even though it's not
# literally found in the dictionary.
word_results = en_us_names_dictionary.word_results 'i'

# true is returned because  'i' ('I' after preprocessing) was matched against the
# LittleWeasel::Filters::EnUs::SingleCharacterWordFilter word filter, despite not
# having been found in the literal dictionary.
word_results.success?
#=> true

# Returns false because 'i' ('I' after preprocessing) was not found in the
# literal dictionary.
word_results.word_valid?
#=> false

# Returns true because 'i' ('I' after preprocessing) had word preprocessing applied to it.
word_results.preprocessed_word?
#=> true

# Returns true because the 'i' ('I' after preprocessing) matched the LittleWeasel::Filters::EnUs::SingleCharacterWordFilter word filter.
word_results.filter_match?
#=> true

# Returns the original word, before any word preprocessors were applied.
word_results.original_word
#=> 'i'

# The resulting word after all word preprocessors have been applied.
word_results.preprocessed_word
#=> 'I'

Rake Tasks

Below are some rake tasks that can be used as examples:

Rake Task Description
word_results:basic Creates a LittleWeasel::Dictionary from a file source (file on disk) and calls LittleWeasel::Dictionary#word_results API.
word_results:from_memory Creates a LittleWeasel::Dictionary from a memory source (Array of words) and calls LittleWeasel::Dictionary#word_results API.
word_results:advanced Creates a LittleWeasel::Dictionary from a memory source (Array of words) and calls LittleWeasel::Dictionary#word_results API. Demonstrates the use of word preprocessors and word filters.
word_results:word_filters Creates a LittleWeasel::Dictionary from a memory source (Array of words) and calls LittleWeasel::Dictionary#word_results API. Demonstrates some of the word filters that come prepackaged with LittleWeasel (LittleWeasel::Filters::EnUs::NumericFilter, LittleWeasel::Filters::EnUs::CurrencyFilter and LittleWeasel::Filters::EnUs::SingleCharacterWordFilter).
block_results:basic Creates a LittleWeasel::Dictionary from a file source (file on disk) and calls LittleWeasel::Dictionary#block_results API. Demonstrates how to use the #block_results API.

Installation

Add this line to your application's Gemfile:

gem 'LittleWeasel'

And then execute:

$ bundle

Or install it yourself as:

$ gem install LittleWeasel

Contributing

Not taking contributions just yet.

License

(The MIT License)

Copyright © 2013-2021 Gene M. Angelo, Jr.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the ‘Software’), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED ‘AS IS’, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Footnotes

  1. The LittleWeasel::WordResults object returned from these APIs provides information related to the given word or words that have passed through their respective processes (i.e. word preprocessing, word filtering and dictionary checks). 2