weave

Overview

This is the initial version of my weave program which is used to convert source code to pandoc compatible markdown. The original source is available at ./weave.cpp.

Ultimately this will likely be replaced by defining appropriate pandoc input sources.

This utility supports weave-only literate programming

The logic provided herein is expected to be very simple. The input sources are expected to include comments consisting of the desired markdown and the program simply uses the delimiters to flip between outputting the raw markdown or code fences, with the primary differences across patterns being the defined delimiter and some filtering for the markdown (such as removing comment markers). The end result is effectively flipping the source inside out where the comments become the primary content within which the code is embedded.

In the interest of minimizing logic the program is unlikely to be overly adaptive: the input should be well-formed and have tokens where they are expected such as at the beginning of lines.

This is currently in the process of being ported back from C++ to C since I had been dusting off C++ but that effort is currently paused.

Supported Input Formats

Source formats will be implemented as needed as will any supporting parameterization. The first supported format will be C code so that the program can be applied to itself.

Code

Imports

The imports are fairly standard and draw from the stdlib.


#include <iostream>
#include <string>
#include <vector>
#include <stdio.h>

using namespace std;

Syntax Definitions

Each supported syntax can be defined in a struct and the relevant struct will be returned based on a provided argument.

struct Syntax {

Mode Toggles

The general syntax for this program will involve defining some form of token that signals a mode switch (between documentation and code), and an optional comment prefix which will be removed if present.

My standard practice (likely inherited and tweaked from doxygen) is to use extended comment markers to demarcate documentation blocks. For example in C this translates to the sequence /** and the symmetrical complement **``/.

Some languages have distinct opening and closing comment delimiters whereas others do not. The logic could make use of the difference but that will be avoided since it cannot be utilized across source formats without additional special treatment.

Detecting a mode toggle is then simply a matter of seeing whether the current line contains any of the defined toggle tokens.

  const vector<string> mode_toggles;

  const bool
  is_mode_toggle(const string line) {
    for (auto mt: mode_toggles) {
      if (line.substr(0, mt.size()) == mt) return true;
    }
    return false;
  }

Fence Attributes

The output will make use of fence_attributes to enable syntax highlighting. This is a pair of strings which will be appended to opening and closing fences respectively (where the closing fences are likely to get nothing appended).


  const vector<string> fence_attributes;

Comment Prefix

Content within comment blocks each line may also have some prefix which should be removed. In languages with block comments (for example C) such a prefix is optional but in languages that only support line comments they are likely to be required.

Currently the matching will be simple iteration and should likely be replaced by the use of regular expressions. As a consequence any patterns should be placed after any others for which they may be a prefix to ensure that the longest match wins.


  const vector<string> comment_prefixes;

  const string
  without_comment_prefix(const string line) {
    for (auto cp: comment_prefixes) {
      if (line.substr(0, cp.size()) == cp) return line.substr(cp.size());
    }
    return line;
  }

};

Supported Source Syntaxes

Zero Value

To work to avoid null pointers, a zero value will be defined. This should be used in places where a reference is required but should not be accessed.


static const Syntax syntax_0 = Syntax{};
C

C uses /** and **``/ to indicate comments to weave and enables an optional * prefix.

Spaces are included in some of the values to enable the left column of *s to align neatly.


static const Syntax syntax_c = Syntax{
  .mode_toggles = {"/**","**/"," **/"},
  .fence_attributes = {"{.c}", ""},
  .comment_prefixes = {" * "},
};
CPP

C++ is mostly the same as C but uses a format appropriate fence attribute. There is likely to be much such overlap which will likely be reorganized using shared factory methods or similar after there’s enough mass to warrant such attention.


static const Syntax syntax_cpp = Syntax{
  .mode_toggles = {"/**","**/"," **/"},
  .fence_attributes = {"{.cpp}", ""},
  .comment_prefixes = {" * "},
};
Make

Make shares a common commenting pattern of using hashes/pound signs/octothorpes. The prefix on each line is expected to have a subsequent space for legibility.


static const Syntax syntax_make = Syntax{
  .mode_toggles = {"##"},
  .fence_attributes = {"{.Makefile}",""},
  .comment_prefixes = {"# ", "#"},  
};

static const Syntax syntax_bash = Syntax{
  .mode_toggles = {"##"},
  .fence_attributes = {"{.bash}",""},
  .comment_prefixes = {"# ", "#"},  
};
Parsing Syntax

The source syntax will be selected by a command line argument. Currently the parsing is simple and won’t provide much additional help or feedback. A library will likely be pulled in later to provide things like validation and informing usage.

To start we’ll expect a single flag of the form -f where format is one of the defined syntaxes (in all lower case).

This function will make use of pair which acts as an Either type. There is likely some more idiomatic solution to this somewhere or other, but otherwise I may at least look at adding some helper functions.


static pair<int, Syntax>
parse_syntax(string arg) {
  if (arg.substr(0, 2) != "-f") {
    fprintf(stderr, "Unknown argument, bailing!");
    return pair{-1, syntax_0};
  }
  string arg_val = arg.substr(2);

  if (arg_val == "c") return pair{0, syntax_c};
  else if (arg_val == "bash") return pair{0, syntax_bash};
  else if (arg_val == "cpp") return pair{0, syntax_cpp};
  else if (arg_val == "make") return pair{0, syntax_make};
  else {
    fprintf(stderr, "Unknown syntax, bailing!");
    return pair{-1, syntax_0};
  }
}

The Main Loop

The main body of the program is a straightforward filter of lines collected from stdin. If lines start with one of the mode toggles, then a code fence will be output instead.

The very first line is treated as a special case since it is always expected that a source that is passed through this filter will start with a documentation block. Advancing past the first marker therefore makes sure the modes align with what pandoc will be expecting and this is done outside of the subsequent loop to reduce the tests within that loop.

The state is tracked to determine whether to output the fence attributes, to stip comment prefixes, and to make sure that the output is closed appropriately (not leaving any dangling code fences).


int
main(int argc, char **argv) {

  auto [status, syntax] = parse_syntax(string{argv[1]});
  if (status) return status;

  string line;
  bool is_doc=false;

  if (!getline(cin, line)) return 0;

  if (syntax.is_mode_toggle(line)) is_doc = !is_doc;
  else {
    printf("~~~~%s\n%s\n", syntax.fence_attributes[is_doc].c_str(), line.c_str());
  }

  while (getline(cin, line)) {
    if (syntax.is_mode_toggle(line)) {
      is_doc = !is_doc;
      printf("\n~~~~%s\n", syntax.fence_attributes[is_doc].c_str());
      continue;
    }
    printf("%s\n", (is_doc ? syntax.without_comment_prefix(line) : line).c_str());
  }

  if (!is_doc) printf("~~~~\n");
}