weave

Overview

This is the initial version of my weave program which is used to convert source code to pandoc(1) compatible markdown. The original source is available at ./weave.c.

This utility supports weave-only literate programming

The logic provided herein is expected to be very simple. The input sources are expected to include comments consisting of the desired markdown and the program simply uses the delimiters to flip between outputting the raw markdown or code fences, with the primary differences across patterns being the defined delimiter and some filtering for the markdown (such as removing comment markers). The end result is effectively flipping the source inside out where the comments become the primary content within which the code is embedded.

In the interest of minimizing logic the program is unlikely to be overly adaptive: the input should be well-formed and have tokens where they are expected such as at the beginning of lines rather than leaving logic to accommodate more variable input.

Argument Behavior

The utility itself will provide behavior based on provided options which will be described as they are defined. Arbitrary syntaxes can therefore be supported so long as the structure lends itself to the defined concepts. Composed syntaxes can be defined as desired externally such that they invoke this command with the appropriate arguments.

Currently for simplicity only short-form options are supported but long-form will be preferred once I get around to stealing some mature command parsing logic. Each concept will be defined along with the appropriate flag with which it should be passed. Specifically each option should be passed as -<token><value> with no spaces; the entire argument may need quoting to bypass shell expansion. Arguments for which multiple values are desired and supported should be passed multiple times.

Code

Imports

The imports are fairly standard and draw from the stdlib.

The _GNU_SOURCE feature macro is defined to allow for use of getline(2) and strdup(3), and _FORTIFY_SOURCE is defined for some extra checks(4).

stdbool.h(5) is used to provide a more expressive boolean type. stdlib.h(6) is used for EXIT constants and dynamic memory management, and stdio.h(7) is used for all of its typical I/O goodies.

This code will make use of a fair amount of string comparison which will utilize memcmp(8) from string.h(9).


#define _GNU_SOURCE
#define _FORTIFY_SOURCE 2

#include <errno.h>

#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

Macros

Some macros will be defined to make the code a bit more expressive, providing some functionality that messes around with scoping slightly.

FOREACH

Iterate over a NULL terminated array. Provide the element type and array reference and the code will loop over the elments setting each one to the it variable (which should not otherwise be used, nor should an i used for index counter).


#define FOREACH(type, array) unsigned int i=0; type it; while(it=array[i++])

ASSIGN_ALLOC

Dealing with failed allocations is good practice but a hassle. This macro returns errno if the attemped allocation failed.

It is passed the lvalue and rvalue and will attempt to assign the lvalue to the rvalue and returning if the lvalue ends up as NULL.


#define ASSIGN_ALLOC(l, r) (l) = (r); if ((l) == NULL) return errno;

Options Definitions

The options are collected into a struct to be passed around to functions.


typedef struct {

(-i) inflectors

The general logic behind this program involves defining some form of token that signals a mode switch (between documentation and code), which will be called inflectors.

My standard practice (likely inherited and tweaked from doxygen) is to use extended comment markers to demarcate documentation blocks. For example in C this translates to the sequence / * * and the symmetrical complement * * / (without interleaved spaces).

Some languages have distinct opening and closing comment delimiters whereas others do not. The logic could make use of the difference but that will be avoided since it cannot be utilized across source formats without additional special treatment and it invites the need to track additional state within the logic.


  char **inflectors;

(-c) Comment Prefix

Content within comment blocks each line may also have some prefix which should be removed. In languages with block comments (for example C) such a prefix is optional but in languages that only support line comments they are likely to be required.

Currently the matching will be simple iteration and should likely be replaced by the use of regular expressions. As a consequence any patterns should be passed after any others for which they may be a prefix to ensure that the longest match wins.


  char **comment_prefixes;

(-o, -e) Opening and Ending Fence Attributes

The output will make use of fence attributes to enable syntax highlighting. This is a pair of strings which will be appended to opening and ending fences respectively (where the ending fences are likely to get nothing appended).

Only a single value is supported for either of these (the last value would likely win). Multiple attributes should simply be passed as the single value.


  char *open_attributes;
  char *end_attributes;

} Options;

initializer

Largely for the sake of symmetry, a function is provided to initialize an (already allocated) Options object. This will make use of static memory only (with some defined flyweight empty arrays) and so should be totally safe.

The arrays will reference an empty array and the scalars will contain empty strings.


static char **empty_char_array = {NULL};

static void
options_init(Options *o) {
  o->inflectors = empty_char_array;
  o->comment_prefixes = empty_char_array;
  o->open_attributes = "";
  o->end_attributes = "";
}

cleaner upper

If the Options no longer references the empty arrays it is dynamic memory that should be freed.


static Options*
options_cleanup(Options *o) {
  if (o->inflectors != empty_char_array) {
    FOREACH(char*, o->inflectors) { free(it); };
  }
  if (o->comment_prefixes != empty_char_array) {
    FOREACH(char*, o->comment_prefixes) { free(it); };
  }
  return NULL;
}
is_inflector

Detecting an inflector is simply a matter of seeing whether the current line contains any of the inflectors defined in the Options.


const static bool
is_inflector(const Options o, const char* line) {
  FOREACH(char*, o.inflectors) {
    if (memcmp(it, line, strlen(it)) == 0) return true;
  }
  return false;
}
without_comment_prefix

Return the current line with any optional comment prefix removed.


const static char*
without_comment_prefix(const Options o, const char* line) {
  FOREACH(char*, o.comment_prefixes) {
    const size_t cp_len = strlen(it);
    if (memcmp(line, it, cp_len) == 0) return line + cp_len;
  }
  return line;
}

Argument Parsing

Options will be passed from command line arguments.

For now this will only support short options for simpler logic, but should borrrow logic from elsewhere and prefer longer forms.


static int
parse_args(Options *o, int argc, char **argv) {
  

Array Allocation

The options makes use of several arrays which need to be allocated. This makes use of dynamic memory given the variable size.

The arrays can be allocated based on the total number of arguments. This wastes a bit of space but it should be a negligble amount and avoids two passes or repeated allocation calls.

These use the value of argc which will be one more than the arguments, allowing for final sentinel values (though in practice that value will be far earlier in the array). calloc will zero out the values such that the NULL tests done elsewhere will work for the resulting values.

The index/offset for each array will be tracked in an integer (equal to the number of collected values), and the Options struct is made to reference the allocated arrays.

If any allocations fail, then return a failure status.


  char **ifs;
  ASSIGN_ALLOC(ifs, calloc(sizeof(char*), argc));
  o->inflectors = ifs;
  unsigned int if_ix = 0;

  char **cps;
  ASSIGN_ALLOC(cps, calloc(sizeof(char*), argc));
  o->comment_prefixes = cps;
  unsigned int cp_ix = 0;

Parsing Argument Strings

The current logic relies on the short-form syntax covered shortly which requires that the first two characters in the argument indicate the option and all subsequent chracters within the argument provide the value. This logic makes use of that through basic string comparison and copying the offset adjusted string to the allocated arrays or the Options struct. Any unknown arguments produce an error.

If no issues arise, consider parsing a success.


  for (int i=1; i<argc; i++) {
    if (!memcmp(argv[i], "-i", 2)) {
      ASSIGN_ALLOC(ifs[if_ix], strdup(argv[i]+2));
      if_ix++;
    }
    else if (!memcmp(argv[i], "-c", 2)) {
      ASSIGN_ALLOC(cps[cp_ix], strdup(argv[i]+2));
      cp_ix++;
    }
    else if (!memcmp(argv[i], "-o", 2)) {
      ASSIGN_ALLOC(o->open_attributes, strdup(argv[i]+2));
    }
    else if (!memcmp(argv[i], "-e", 2)) {
      ASSIGN_ALLOC(o->end_attributes, strdup(argv[i]+2));
    }
    else {
      fprintf(stderr, "Could not parse argument: %s!\n", argv[i]);
      return EXIT_FAILURE;
    }
  }

  return EXIT_SUCCESS;
}

The Entrypoint

The main body of the program is a straightforward filter of lines collected from stdin. If lines start with one of the inflectors, then a code fence will be output instead.

Argument Parsing and Initialization

The arguments are parsed into an Options struct and some automatic variables are defined for reuse.

The is_doc state is tracked to determine whether to output the fence attributes, to stip comment prefixes, and to make sure that the output is closed appropriately (not leaving any dangling code fences). is_doc starts as false to reflect the assumption that most input formats do not start in a comment mode. This could easily be exposed as a parameter if required.


int
main(int argc, char **argv) {

  Options options = {};
  options_init(&options);
  int parse_status = parse_args(&options, argc, argv);
  if (parse_status) return parse_status;

  bool is_doc=false;

  char *line = NULL;
  size_t len = 0;
  size_t nread;

Handle the First Line

The very first line is treated as a special case since it is always expected that a source that is passed through this filter will start with a documentation block. Advancing past the first inflector therefore makes sure the modes align with what pandoc will be expecting without introducing additional noise around mode switching. This is done outside of the subsequent loop to reduce the tests within that loop.


  if (nread = getline(&line, &len, stdin) == 0) return 0;

  if (is_inflector(options, line)) is_doc = !is_doc;
  else {
    printf("~~~~%s\n%s", options.open_attributes, line);
  }

Loop Over Remaining Lines

The rest of the content is handled similarly to the first line, but will consistently toggle with fences as necessary.

Any attributes for the fence are derived from is_doc and therefore the ordering of the logic is significant. Currently the decision is based on the state which is being exited since that seems to read more naturally to me (and therefore the state should be changed after the attributes are output).


  while (nread = getline(&line, &len, stdin) != EOF) {
    if (is_inflector(options, line)) {
      printf("\n~~~~%s\n", is_doc ? options.open_attributes: options.end_attributes);
      is_doc = !is_doc;
    } else printf("%s", (is_doc ? without_comment_prefix(options, line) : line));
  }

Cleanup

If the final state for the input source is outside of a comment then a final closing fence should be output to keep with treating documentation as the primary output.

The small amount of dynamic memory is cleaned up to be courteous.


  if (!is_doc) printf("~~~~\n");

  options_cleanup(&options);
  free(line);
}
1.
Pandoc - about pandoc [online]. 7 May 2022. Available from: https://pandoc.org
2.
GETLINE(3) - linux programmer’s manual. 22 March 2021.
3.
STRDUP(3) - linux programmer’s manual. 22 March 2021.
4.
FEATURE_TEST_MACROS(7) - linux programmer’s manual. 22 March 2021.
5.
GROUP, IEEE/The Open. Stdbool.h(0P) - POSIX programmer’s manual. 2017.
6.
GROUP, IEEE/The Open. Stdlib.h(0P) - POSIX programmer’s manual. 2017.
7.
GROUP, IEEE/The Open. Stdio.h(0P) - POSIX programmer’s manual. 2017.
8.
MEMCMP(3) - linux programmer’s manual. 22 March 2021.
9.
GROUP, IEEE/The Open. String.h(0P) - POSIX programmer’s manual. 2017.