Types in C

I’ve dabbled with C periodically but have historically tended to dedicate most of my attention to other languages. While I strive to recognize and contain inherited biases, the low level nature of C invites strategies for implementing higher level organizational approches; one of the most primitive of which is how to represent types.

While many of the choices in C seem like a fairly evident consequence of the language resembling portable assembly, the distinction between how types are defined in C as opposed to most other languages seemed subtle enough that I blindly sought to “improve” upon conventions while writing C and have only recently acquired a healthier perspective that allows the multiple styles to cohabitate for reasons more substantial than entrenchment.

Typing Types

Identifier Typing

The phrase I’ll use to describe common typing practices outside of C (very much subject to change since there’s likely an existing phrase) is Identifier Typing. In this model a type is associated with an identifier/lvalue. The type imputed to such an identifier can be used to inform tooling such as type checkers and further indicates the shape of the memory which the identifer references. This creates a clear association between identifiers and their types which paves the way for composing levels of abstraction out of types upon types and the pursuit of abstract data types which break down the barriers between provided primitives and any custom constructs.

Data Typing

Data Typing is the phrase I’ll use for the more idiomatic C style of typing that is used in K&R and in most C projects (subject to the same disclaimers as the Identifier Typing phrase). While the construction of abstract types can be very powerful, ultimately all of the data therein needs to be represented somewhere as 1s and 0s. Those bits are inherently meaningless, and so somewhere something needs to assign meaning to them and allow for encoding and decoding. This lends itself to the notion that the conversion of bits into anything useful requires the mapping of types to those bits. The need to coordinate this level of typing is a natural consequence of the low level of C.

Practical Differences

Simple Pointers

The above may seem entirely academic but it manifests very directly in C style. Likely the simplest and largely cosmetic example would be the use of the * operators. Idiomatic C tends to define pointers using a syntax such as:

char *foo = "bar";

The above fits naturally with data yyping: there is some memory of type char and the dereferencing of pointer foo references that value.

Preferring identifier typing the above becomes:

char* foo = "bar";

This shifts the focus such that foo is of the type pointer to char. The focus of the type shifts from the underlying bits to the identifier which will act as a handle for those bits. This matches the style used in C++ and Go.

The pointer/dereference example is a simple one and due to the simplicity of C itself the only flexible one that occurs to me in basic usage of the language, but the usage of compound types with typedefs is where the conceptual differences become prominent.

Disclaimer

For the purposes of this topic, these styles will be treated in isolation. Some of the differences may introduce additional considerations such as signals or impacts to memory management and performance. Such concerns may be discussed later separately.

Pointer Typedefs

A natural extension of the simple pointer convention is whether to wrap the pointer operator within a typedef. Definitions and sample usage for data typing resemble:

typedef char phone;
int call(*phone number);
...
phone *mobile;
...
*(mobile + len) = '\0';

whereas identifier typing resemble:

typedef char* phone;
int call(phone number);
...
phone mobile;
...
*(mobile + len) = '\0';

As wth the simple pointer section this reflects a difference in whether the type itself contains information from which the code is potentially insulated or whether the types primarily act as mappings to the underlying data. As implied by the examples, manpulation of the underlying data remains unchanged, but those areas of code which don’t need to open up the values can do so with a bit more ignorance; the specifics of the contents of such typedefs can be contained within those typedefs and a more focused subset of the code through which the values travel.

I’ve heard a mentality that the fact that a value is pointer should always be exposed which aligns itself very obviously with the Data Typing approach while likely also helping to establish ownership of objects and likely inviting more liberal use of keeping objects on the stack: but neither of those values seem guaranteed whereas the potential for leaking knowledge seems very real along with an invitation for less consistent interactions with types.

Struct Typedefs

Complementary to wrapping pointers in typedefs, the same can be done with structs. This is done automatically in C++ which reflects that languages preference for abstract data types as opposed to data typing. Examples for this pattern do not seem particularly valuable as the difference is that without the typedef a named struct is only defined within the namespace for structs and therefore each usage of the type must be prefaced by the struct keyword.

As with pointers I’ve heard the perspective that structs should always have the struct qualifier. As with pointers the fact that something is a struct informs how it should be acted upon, and so ultimately much of the practical impact only applies to that code which does not need to peer inside of the data within an identifier. This is also very likely to combine with concerns around pointers, as many identifiers may be references to structs and structs themselves are likely to contain fields which are themselves a compount type.

From a data typing perspective keeping all of this information is highly valuable as it more directly tracks the memory usage of your code and can help drive interactions with that memory. From the perspective of sweeping certain knowledge under approprate rugs, however, it can lead to arguably clumsy code.

Given a very contrived snippet such as:

int configureIt(struct *myObj o, struct *myConfig c)) {
  struct myConfig sampleConfig = tryConfig(o);
  int status = isValid(sampleConfig);
  if (status) *c = myConfig;
  return status;
}

None of the logic present actually cares what is behind those identifiers yet still needs to declare some aspects of the underlying data. If for any reason the types used where shifted such that myObj were accessed directly to make use of pass-by-value semantics or if myConfig where shifted to exclusively use dynamic memory to provide polymorphism then this code would need to be updated even though the logic would remain unchanged. Across a larger project the overhead of needing to care about this information along with the potential for needing to make updates (though tooling can absorb most of that cost) can add up.

Takeaways

While the tone of most of this may reflect my experience and bias, my fully realized attitude is not one of preference but rather recognition of the alternatives. Perhaps most notably the above consciously avoids the syntax for C arrays which is immovably locked on to the data typing model and is a wart elsewhere unless arrays are simply avoided (which I tend to do anyway). In recognition of that constraint along with the widespread conventions the intent here is to establish a mental model which allows me to embrace typical typing in C without needing to silence an inner critic suggesting that it is in some way inferior.

In addition to the evolution of type systems in languages another likely contributing factor which has been alluded to throughout is the level of abstraction at which programming is being performed. Having spent most of my time several layers above systems level programming I’d been able to lean on those layers such that many of the values provided by data tying seem unnecessary. More time spent writing such code and considering more of the specific interactions with hardware would likely instill in me a desire to not hide some of the key parameters for such interactions, and certainly fits in with the portable assembly language perspective. A slightly more conceptual dressing for the assembly language perspective is also around having a clearer distinction between data and the logical text which is acting upon it. Conferring non-essential semantics to the data can quickly blur such a boundary.

In practical terms I’ll be preferring data typing for most C utilites that I write as that matches community practices and avoids the inconsistency with array syntax. In larger projects I will use identifier typing along with many other non-idiomatic C practices and will typically also port such projects to another language. In the long term it is unlikely that I will write much C code so the distinction is likely to be moot. Such projects are likely to be relatively small and low level and therefore would use data typing. I’m somewhat more likely to contribute to existing C projects at which point any standard within those projects would be followed.