Words

Background

A basic and standard need for this site is to have some spell checking (and maybe some more enhanced grammatical analysis later). This is commodity functionality that should almost certainly be provided by a tool such as ispell, but a quick glance showed that ispell has enough going on to not be quickly pulled in as part of my software reboot and therefore I’ll start off with something homegrown.

Usage

The specific way I’m using this will also inform how it is being built. Beyond just spell checking I’m hoping to leverage this to improve my vocabulary, which I’m planning on doing by increasing the world list each time I update the site with a word which is in use on the site. If I manage to achieve the ambituous goal of submitting at least once daily then this is roughly equivalent to receiving and using a word of the day.

Functionality

The desired functionality is fairly simple: I want to report any words which are not present in the defined set of words. Reporting the first such word is sufficient. Here “word” will be defined as a case-insenstive contiguous sequence of word characters.

The input text will be expected to be passed on stdin, and the file containing the word set will be expected as a positional argument to the command.

Implementation

Let’s get this thing started. Some of the later details impact includes and since my tangle program is presently very dumb my exposition will be compromised here in a way that is an affront to Mathematical Writing. I may also be deviating from C convention in that I’ll define functions prior to main rather than using forward declarations.

I’ll be writing this in C since it seems a relatively good fit overall and the strongest contender out of the options I’m immediately considering.

To start my word list is small enough that virtually any implementation would work so I’ll start fairly basic and then likely add a more efficient solution later. My first pass will therefore make use of the binary search call present in stdlib.h.

The initial code will presume single byte characters as unicode support is non-trivial in C. Such support will be added when needed (or hopefully the implementation will be replaced by then).

Includes

I’ll be doing some I/O, string manipulation, making use of library provided character conversions for normalization, and will be using at least the aforementioned stlib.h bsearch.


#include <ctype.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

Globals

The crux of this utility will be checking set membership so it makes sense to start with defining that set and the membership test.

Defining the set as a global seems reasonable to keep this focused and also keeps any type information out of the membership function which will be nice if the current implementation is replaced by something more interesting like a trie.

The number of words in the set and the allocated capacity will also be useful.

The buffer for the current line will be stored in a global to facilitate structured freeing of memory (out of laziness). TODO: This should be renamed or things should be tidied.


char **word_set;
size_t word_set_length = 0;
size_t word_set_capacity = 0;
char *line = NULL;

die

The global data should be cleaned up on exit which will be handled in a function which is designated for exiting.

This should be noreturn but I’m not sure how to reliably define that.


static void
die(const int status) {
  for (register int i=0; i<word_set_length; i++) free(word_set[i]);
  free(word_set);
  free(line);
  exit(status);
}

is_valid_word

The membership test will return a boolean (int) indicating whether the requested word is present. As mentioned earlier this will use bsearch. The comparison function is ultimately just strcmp, but that is wrapped to take care of coercing the argument types.


static const int
word_compare(const void *left, const void *right) {
  const char* const *l = left;
  const char* const *r = right;
  return strcmp(*l, *r);
}

static const int
is_valid_word(const char *word) {
  const char *found
    = bsearch(&word, word_set, word_set_length, sizeof(*word_set), word_compare);
  return found != NULL;
}

locate_word

An additional constraint to call out is that a single word should never cross over a newline. This is equivalent with the prior definition of word and the distinction that a newline is not a word character, but seems worth additionally highlighting in light of parsing impacts. Each word within a line can then be identified by the starting and exclusive ending offsets of the next word after an offset within the line. To facilitate iteration the offset at which the subsequent word may start will be returned, or an indicator that the end of the line has been reached.


#define AT_END -1

static const int
locate_word(const char *line, int offset, int *start, int *end) {

Reset Outputs

If no word is found we should be sure not to return stale data, so let’s reset the output parameters to negative values;


  *start = *end = -1;

Locate Start

I can’t think of a current need for a word that doesn’t start with a letter, so the next word would start at the next letter. If we don’t find a letter prior to end of line then return, otherwise update start.


  for (; !isalpha(line[offset]) && line[offset] != '\n'; offset++) ;
  if (line[offset] == '\n') return AT_END;
  *start = offset;

Locate End

Now that we are within a word, the word will continue until the first non-word character. Some non-alphabetic characters can be used within words. For the first iteration there will be no distinction between inner word characters and those with which may end a word. This is expected to need to change but is deferred until needed.

This will advance offset past the end of the word which establishes the exclusive end; the exclusive end allows length to be equal to \(end-start\). If we’re at the end of line that should be indicated, otherwise we’ll return forward one since we’re on a non-word character so the earliest start for the next word would be after this point.


  for (;
    (
      isalpha(line[offset])
      || line[offset] == '\''
      || line[offset] == '-'
    ) && line[offset] != '\n'
    ; offset++) ;
  *end = offset;
  return line[offset] == '\n' ? AT_END : offset+1;
}

Normalizing Words

To efficiently test for membership while ignoring aspects such as case, the word set should be normalized and each input word should also be normalized such that comparisons can be appples-to-apples.

To start this will just involve case folding. This will modify the provided argument to offload memory management concerns.


static void normalize(char *word) {
  for (int i=0; word[i] != '\0'; i++) word[i] = tolower(word[i]);
}

Loading the Word Set

For now the word set will be loaded from a file which should contain an alphabetized and normalized list of words with one word per line.

Similar to the implementation comments above this isn’t likely to be particularly efficient for large data but that won’t be a problem for quite a while.

The input will just be expected to be alphabetized since that’s easy enough to make an external concern. If it is not then some words may be missed when searching (which should then lead to the recognition of the missort when attempting to add an already present word).

Beyond aspects like compression it would also be convenient for the file to indicate size, but for now we’ll just resize as needed using the defined chunk size.

This should populate the global variable and return an error status. On an error status the program should abort, so this code will be written optimistically with the assumption that broken state won’t be used.

Each line will end in the newline character which will be omitted from the copied string.


#define CHUNK_SIZE 1000

static int
load_word_set(const char *file_name) {
  int status = 0;

  FILE *fp = fopen(file_name, "r");
  if (!fp) return -1;
  size_t len = 0;
  ssize_t line_length = 0;
  for (; (line_length = getline(&line, &len, fp)) != -1;) {
    if (++word_set_length > word_set_capacity) {
      word_set_capacity += CHUNK_SIZE;
      word_set = reallocarray(word_set, word_set_capacity, sizeof(*word_set));
      if (word_set == NULL) {
        status = -1;
        break;
      }
    }
    word_set[word_set_length-1] = strndup(line, line_length-1);
  }

  fclose(fp);
  return status;
}

main

The main function will load the word set and die on any issue, and then loop over the provided input locating and normalizing words and finally outputting and exiting on the first word in the input not present in the set.

To contain memory used the same buffer will be used for each word that is tested. Since memory is fairly cheap and words have a fairly certain upper bound on length I’ll just allocate a fixed sized buffer and error on issue.


#define MAX_WORD_LENGTH 50

int
main(int argc, char **argv) {
  if (load_word_set(argv[1])) {
    fprintf(stderr, "Error loading word set, aborting!\n");
    die(-1);
  }

  char word_buffer[MAX_WORD_LENGTH];
  int offset, start, end;
  size_t len = 0;
  for (; getline(&line, &len, stdin) != -1;) {
    for (
      offset = locate_word(line, 0, &start, &end)
      ; offset != AT_END
      ; offset = locate_word(line, offset, &start, &end)
    ) {
      if (start == -1) break;
      strncpy(word_buffer, line+start, end-start);
      word_buffer[end-start] = 0;
      normalize(word_buffer);
      if (is_valid_word(word_buffer)) continue;
      printf("Unknown word: %s\n", word_buffer);
      die(1);
    }
  }
    
  die(0);
}