nullptr has been such a Godsend for C++. Good to see it coming to C.
If you ever see the macro NULL in code, be afraid. There are two valid ways of defining the macro and that cause weird issues when porting code. For example, in the statement printf ("%p %s\n", NULL, "Hello world!"), one of the definitions leads to NULL being interpreted as the null pointer, and the other leads to NULL being interpreted as an integer. The latter may crash if integer and pointer are different sizes.
It also causes problems with C++ overloading if one overload takes a pointer and another takes an integer.
> If you ever see the macro NULL in code, be afraid. There are two valid ways of defining the macro
Not on a Posix system, where the only valid definition of it is `(void*)0`. C could have adopted this definition.
Nullptr is needed in C++ because `0` is the only definition of `NULL` that works with the type system, due to the lack of implicit `void*` conversions.
C doesn't have this problem.
Adopting the Posix definition of NULL in the standard would have been sufficient -- and unlike `nullptr`, would have solved bugs in existing programs.
> Nullptr is needed in C++ because `0` is the only definition of `NULL` that works with the type system, due to the lack of implicit `void` conversions.*
> C doesn't have this problem.
Except for conversions between data pointers and function pointers ;)
Initialization seems to be a special case, but with `-pedantic` the following code will show a warning on the initialization of `fp2`:
That's a fine general sentiment. However, in this context it's a problem if you want to assign NULL to a pointer without a cast, which is why C++ added the magically convertible nullptr in addition to the magically convertible `0` constant.
char *x = 0; // ok in C and C++
char *y = (void*)0; // ok in C, error in C++
char *z = nullptr; // ok in C++
therefore:
#define NULL ((void*)0) // Required by Posix C, invalid C++
#define NULL 0 // Pre-nullptr, the only valid C++ definition
C++ can't define NULL the safe way that Posix C does.
I don't understand why it's more acceptable to allow magic `0` conversions than magic `(void*)0` conversions, given that the latter is far less likely to happen by accident -- but here we are.
- NULL is idiomatic: using NULL is entrenched in C programming and it is not going away.
- In spite of nullptr existing now, NULL is still (quite stupidly) not required to just expand to nullptr, but to an implementation-defined null pointer constant, rather than #define NULL nullptr. (According to the N2596 draft).
- They had over 30 years to tighten the requirements on how NULL can be defined; what's the matter? C99 could already have required NULL to be ((void *) X) where X is an integer-typed constant expression evaluating to zero.
I'm not going to start using nullptr. It's not idiomatic C. I'm going to hold out hope that NULL will be fixed so that it expands to nullptr.
--
Also, it's possible for a compiler to diagnose when a constant, zero-valued expression is used as the argument of a variadic function. The diagnostic can be confined to cases when such a constant expression is the result of macro expansion:
> In spite of nullptr existing now, NULL is still (quite stupidly) not required to just expand to nullptr, but to an implementation-defined null pointer constant, rather than #define NULL nullptr. (According to the N2596 draft).
This is so silly. I sort of get why not
(can't break the dork that decided to do
int i = NULL;
i++;
)
But, at the same time... I almost feel like this is a "you are being a dork, go fix your code." moment. This isn't the sort of break where someone would see it and go "Oh yeah, assuming NULL is anything other than nullptr is dumb!"
Why not? We've broken the dork who used undeclared functions, void main, gets ...
(It's the same funking dork anyway. You know who you are, I'm looking at you!)
Note that
int x = ((void *) 0);
will actually work in GCC and get you a zero into x, just with a conversion warning. The dork is unaffected; their code works and they don't read warnings.
See https://gcc.gnu.org/pipermail/gcc/2023-May/241264.html for why this might not be the case anymore soon - although I suppose that adding in -fpermissive or -Wno-error=conversion or something like that isn't too much effort.
But does C need a nullptr keyword? If you're programming in C, you usually define 0 as an invalid value, or a null value. C doesn't have the insane type system C++ has and doesn't have a very strong need to make a distinction between a pointer or an integer, since they're all in the end numbers.
The printf example you gave is an example of garbage in, garbage out. If NULL is a macro not defined as a pointer sized integer, then you're at fault here.
> If you're programming in C, you usually define 0 as an invalid value, or a null value.
That was also the usual pattern in C++ when there was no alternative. Once nullptr was introduced in C++, NULL or 0 quickly became a code smell.
> C doesn't have the insane type system C++ has and doesn't have a very strong need to make a distinction between a pointer or an integer, since they're all in the end numbers.
C++'s type system is far from insane. It's actually one of it's killer features.
You're both entirely oblivious to the need to not conflate pointers with integers and failing to present any case in favour of the legacy and broken use of NULL, and in the process failing to address all family of known error patterns involving it.
> The printf example you gave is an example of garbage in, garbage out. If NULL is a macro not defined as a pointer sized integer, then you're at fault here.
Again, you seem to be completely oblivious to the problem domain. NULL is not a macro as far as C or C++ compilers are concerned. NULL is a magic constant that's resolved at preprocessing time. Replacing NULL with nullptr means a magic constant is replaced by a concrete type, and thus whole family of errors can be avoided with compile time checks. Claiming that the developers who wrote in bugs are at fault for inadvertently adding bugs makes no sense at all because it does not solve any problem at all, and instead is just cynical finger pointing. I take compile-time checks over unhelpful finger pointing all day every day.
The original mistake by the standards committee was allowing implicit conversions from integer to pointer. I.e. allowing NULL to be defined as simply 0.
If NULL had been defined always as ((void *) 0) then I don't see that we would have had a problem.
But that's all history now and in this situation I can see that adding nullptr becomes a reasonable way out.
It's ironic though that the fix for the different ways to write null is to add yet another way.
As per the C standard, NULL is an implementation-defined null Pointer constant.
Macros are resolved in the preprocessing step. The compiler does not know what a macro is. What the compiler knows is whatever the preprocessor passes off in place of the macro. This means the compiler only sees a constant, and has no way to tell what that constant means.
If instead of passing random pointer constants you pass an actual type, now the compiler can tell more things.
> If NULL had been (...)
Irrelevant. The whole point is that it wasn't the committee looked at the problem, and it determined that using a dedicated type is safer, more powerful, and more elegant than passing magic numbers around.
> If NULL is a macro not defined as a pointer sized integer, then you're at fault here.
If it was you who wrote stdlib.h, sure; otherwise, if you’re on a platform where NULL is traditionally defined as 0 and not (void *)0, you’re stuck. A conformant implementation is free to use either definition.
If you want to language-lawyer more heavily, C does not require there to be pointer-sized integers (uintptr_t is optional), does not require that all zero bytes represent a null pointer in memory (unlike for integers), does not require that the implementation choose to store an integer with value zero as all zero bytes (there may be other valid representations), and in any case does not require an implementation to do anything reasonable at all if the caller passes an integer but a vararg callee looks for a pointer (think separate integer and pointer registers).
[I’m not entirely sure if (void *)(void *)0 is a null pointer constant (though it’s certainly an expression that evaluates to a null pointer)—does it count as a zero-valued integer constant expression cast to a pointer to void? So you might not even be able to use (void *)NULL as a hedge against bad platform headers.]
I don’t think you are? Redefining a reserved identifier is UB per ISO C (any version) 7.1.3p2, and per 7.1.3p1,
> Each macro name in [the standard library] is reserved for use as specified if any of its associated headers is included; unless [you’re #undef’ing a function also provided as a macro].
The general idea seems to be that standard headers are allowed to use macros they define, even in other macros they define, and because macro names are late-bound (ugh), even if the user only redefines the name afterwards, every macro that uses it will then be affected.
As a silly example, a valid part of stdlib.h could be
Do it where it works (hopefully some platform for which you build all the time), and get your enhanced diagnostics there; avoid it where there are problems like this.
Assuming 0 is an invalid value is not always correct. 0 is a perfectly valid pointer, and making it impossible to refer to that location is bad. Of course if you are not writing an OS or embedded system you won't ever have a pointer value of 0 anyway as the OS can put things elsewhere with no problem (if you are you need to see your CPU docs, some CPUs 0 is invalid, some it is not).
Umm, no. 0 is the null pointer constant, same as nullptr. It is not a location, but an abstraction. If a platforms null pointer happens to be the address 0xFFFFFFFF, then 0 will produce that.
There is no difference between
char *p = nullptr;
and
char *q = 0;
other than the variable name; the two have to compare equal: (p == q).
What's wrong with 0 is that when it's not in an expression where it's being converted to integer type, it's just an integer.
The problem is if 0 is a valid pointer and I write
volatile int* x=0;
x=0x1234;
Did I just deference the null pointer or make a valid write to that memory location? There is no way to know for sure, you can only apply heuristics to make a guess.
Of course if the lines are that closely spaced you can guess, but in real code they can be in different translation units.
ISO C says that x isn't a valid pointer for dereferencing, so *x = 0x1234 is undefined behavior. If in your environment that pointer refers to some location where you can put a value, and this is documented, then you're using a documented extension.
If you're using some environment in which x isn't the zero address, but you do want the actual zero address, you need some other way to obtain it, like converting a non-constant expression:
const int x = 0;
char *zeroaddr = (char *) x;
C implementations are not required to diagnose null dereferences. In many familiar environments, its arranged by the way the program is loaded. An unmapped page of virtual memory is put at that address. That scheme can be defeated if an offset is involved ptr[large_offset] or ptr->member where member is at a large offset into the struct type.
It's possible to have run-time checks for a null pointer as a compiler option, with a run-time penalty. Other than that, you can use assertions to defend against them.
0 never should have been overloaded in C to refer to the NULL pointer. With pointer assignment and comparison it transforms to the platform's encoding for NULL which isn't necessarily all zeros. No other literal has this sort of magic.
This has nothing to do with the preprocessor. The concept of NULL existed before the macro was standardized. Literal zeros were the way to refer to it which was a design mistake.
I think it's a fine design. Obviously, the C++ people who invented
virtual void fun() = 0;
were blind to whatever mistake it embodies.
When I gained the proper understanding that 0 is a null pointer constant, I immediatelly stopped using NULL. It's so much nicer.
- It works fine in a typeless language in which every value is a machine word, and integers and the null pointer share exactly the same representation.
- It works fine in a strongly typed language, where context indicates what zero means, even in the context of different types of different sizes and representations.
Unfortunately C (and also C++) contains some untyped areas, like variadic argument lists.
I would have designed it differently: only the token 0 would have the special overloaded meaning, and not all constant zero-valued expressions of any integer type. So (0 + 0) would not be a null pointer.
In unsafe contexts, the use of a 0 token would be diagnosed.
printf("...", ... 0, ...); // diagnosed
printf("...", ... 0L, ...); // not diagnosed: argument of long type
printf("...", ... 0 + 0, ...); // not diagnosed: argument of int type
printf("...", ... (void *) 0, ...); // not diagnosed: null pointer of (void *) type.
I think I would also not have the hexadecimal or octal zero tokens be the null pointer constant:
char *p = 0; // null pointer, no diagnostic
char *q = 0x0; // error, integer to pointer with no cast
The downside is that the AST would have to retain that representation detail somewhere (possibly just in a single Boolean flag).
This isn't just a C++-ism. The null pointer constants are only more prominent in C++ because of the rejection of void * even though it isn't any "safer" to have a special integer literal with the same semantics. Ultimately, it all comes from K&R C.
There are legitimate cases where you need a function pointer assigned to address zero (reset vectors commonly). The correct behavior in C is ambiguous if the null address isn't also 0 since the standard doesn't call out special behavior for function pointers. That wouldn't be the case if nullptr had been standardized earlier and there was no need for the magic 0 as a null pointer constant.
That's a compiler extension. In C17, 6.5.16.1 (Simple assignment) implies that the RHS of an assignment to a pointer must either have pointer type or be a null pointer constant (i.e., an integer constant equal to 0, or such a constant casted to pointer type), and 6.7.9 (Initialization) states that "the same type constraints and conversions as for simple assignment apply" to expressions used as initializers.
This particular rule is essentially unchanged from C89 (3.5.7 for initialization, 3.3.16.1 for simple assignment) and from C99 (6.7.8 for initialization, 6.5.16.1 for simple assignment). Generally, most existing rules don't change much at all across the C Standard versions.
The address is still 3 which has valid applications. C is permissive enough to run on platforms that don't use address 0 for NULL. With pointer operations the compiler will change the encoding from 0 to that platform's NULL address.
int *p = 0;
intptr_t i = (intptr_t)p;
if(i == 0) ... // Isn't always true
Well, the fault depends on who "you" are: the NULL macro generally comes from one's libc, and allegedly some libc maintainers have been very obstinately against changing their NULL macros to have pointer type.
Aren't there platforms where pointers have additional type or space information encoded that is orthogonal to the numeric address? It's only by convention that NULL == 0 because on platforms like Intel & ARM you would typically not use the first page. But that's only a convention, and you could just as easily put a null page at the top of your address space, especially in systems with an MMU where mappings can be added, removed, or remapped as-needed.
> It's only by convention that NULL == 0 [...] and you could just as easily put a null page at the top of your address space [...].
Technically NULL == 0 always because the standard special-cases zero-valued integer constant expressions; (uintptr_t)NULL == 0 or NULL == *(void **)calloc(1, sizeof(void *)) is another matter :)
Language lawyering aside, a non-all-zeroes representation of NULL will probably blow up most C programs [e.g. static-storage-duration initialization is now not the same as calloc or memset(,0,) and is even type-specific]. Like CHAR_BIT, that’s a joint that technically exists but has been rusted for decades (pun not intended).
There is no problem with static initializations with a null pointer that is not all zero bits, or a floating-point 0.0 that is not all zero bits.
Those values just cannot participate in the "BSS" trick, whereby everything that is zero-initialized is put into a special section that doesn't actually exist in the program image, and is only provided on startup.
Those values would go into the initialized data section.
The problem with 0.0 or null pointers not being all zero bits is all the code that uses calloc or memset zero.
If this is on some specialized platform (e.g. DSP chip), it might not matter that vast quantities of C code are not portable.
In general, compiler (and to a great extent instruction set architecture!) designers are quite hamstrung by the expectations of C programmers and programs; that has been the situation for some thirty years now.
Today, you could not sucessfully introduce a system in which pointers to bytes (void , char ) have a different representation from other pointers (let alone different size, lord forbid).
Hardware memory tagging schemes carefully avoid breaking common C idioms.
What breaks are
- truly dubious programs which use pointers in ways they really shouldn't, and are almost certainly just bugs; like use-after-free. When a pointer with an out-of-date tag is passed to the library, that is pretty much a confirmed use-after-free, which is not a legitimate idiom of non-maximally-portable C, like assuming that all pointers are the same size.
- C programs which manipulate the pointer representation: e.g. run-times for dynamic languages that put their own tags in pointers. These can easily work around tagging. (I've dealt with this on Android recently, quite easily. For the affected objects, I strip the tag away, and work with the untagged pointers. When it's time to pass the pointer back to Android's library, I put the tag back. All other code works as before.
Incidentally, AFAICT C23 now allows stashing log2 alignof(max_align_t) bits in pointers in a portable (if awkward) manner: for example, for a char pointer p, the lowest such bit can be retrieved with memalignment(p)<2 and masked off with p-(memalignment(p)<2).
(C++’s std::aligned would be less awkward, but, well, memalignment() is what we get.)
You still can’t portably store pointers and integers (or doubles) in the same place and distinguish them, though.
In C, the integer 0 is explicitly defined to convert to a null pointer for all assignments, casts, comparisons, etc., regardless of what the pointer's "actual" value is. The only time where you can see that a null pointer doesn't have numeric value 0 is when you manipulate its object representation with memset, memcpy, etc. The compiler is also at liberty to return whatever it wants when you convert a null pointer to an integer, except that converting it back must produce a null pointer (if it's at least as wide as intptr_t).
Particularly when using structs this removes a lot of ambiguity if you ignore the indirection to find out the underlying type of the enum (or encode it in the name hungarian style).
enum D : uint8_t {
A = 0,
B = 1,
C = 2
}
typedef struct {
D f;
} __attribute__((packed)) E;
assert(sizeof(E)==1);
etc. could make grokking protocol declarations with enums less onerous and requiring one less level of indirection.
As a sneering C++ programmer, why are you even reading / commenting on a new C standard? This is basically a "if you don't have anything nice to say, don't say it" situation.
Honestly, because there's very little c++ content here on HN and a relatively large amount of C content. Most of the C content is full of people saying "we don't need X from C++" but the reality is most of these things have significant uses
Neither of those statements really matches my experience with HN (high C to C++ content ratio, lots of comments rejecting advances first added to C++). I totally agree some of these things are very useful and I'm glad to see them formalized in C (even years later than C++).
They're too busy looking for their vowels to reply here.
Joking aside, I've a healthy amount of respect for rust, and I hope that many of the ideas make their way into other language. The terseness, heavy use of macros, _insane_ compile times (and that's coming from someone who writes templates in c++), general assumption that you're on Linux from third party crates, and IDE support combine for something that just isn't usable for me just yet. Maybe in a few years!
Who is the audience for new features in C? And who is driving stuff through the standardisation process? Is this stuff likely to make its way through to embedded toolchains? Or is this for people who are maintaining existing codebases?
Changes to the Standard usually happens as a result of defect reports (confusing details that implementation writers want clarity on) or vast enough general adoption (unifying how implementations were differently achieving the same thing).
As for the audience, it's all the C developers, the open-source and commercial compiler implementations, vendors of libraries, tooling, services, learning material and everything else built in C; which is just innumerable.
Each Standard version released supersedes and obsoletes the previous versions. Intentionally, the versions are meant to be as backwards compatible as possible so that one can mix and match C89/C99/C11 codebases with minimum effort.
C has gained only a handful of features in the last 40 years. Compared to the great many things that are improved w.r.t. undefined/implementation-specific/unspecified behaviors, or removed to keep up with modern times (e.g. Trigraphs, Two's Complement integer representations, etc).
I'd say: (1) upgrading is not the spooky thing people make it out to be. Go, Rust, they all move much faster than this and have very ambitious big design ideas on their mind. (2) It's necessary to take good care of C as it, and the things built in it, will realistically outlive many of us.
The early adopters are usually transpilers (or code generators) which can quickly take advantage of new features without the effort of rewriting an entire codebase.
In the same way that Rust used underlying `const` attributes in LLVM (and found all the weird edge cases), and Nim used C as an intermediate as have many other lisp or object-ish languages.
Yes, I'd expect they will. Most embedded toolchains these days are built around GCC. So as GCC grows new features, embedded toolchains will get them too.
Officially adopting __auto_type as auto is good. Unfortunate that N2953 dropped function return type and parameters.
This feature composes with _Generic macros quite well:
#define div(X,Y) _Generic((X)+(Y), int: div, long: ldiv, long long: lldiv) ((X), (Y))
auto res = div(38484848448, 448484844);
auto a = b * res.quot + res.rem;
It also lets you get rid of all the "typeof(X)" foo in macro definitions.
> This proposal also recommends adoption of Unicode normalization form C (NFC) for identifiers to ensure that when compared, identifiers intended to be the same will compare as equal. Legacy encodings are generally naturally in NFC when converted to Unicode. Most tools will, by default, produce NFC tex.
Er, a much better approach is to allow unnormalized Unicode in source code and use form-insensitive matching of symbol names so that all forms of a symbol are equivalent. This can be done by normalizing during the parse, or by implementing form-insensitive string comparison and hashing functions that normalize glyph by glyph as needed -- the latter can be very fast for all-ASCII and mostly-ASCII symbols!
The reason this is a better way is that there's too many places that don't produce NFC. For example, HFS+ uses NFD, so if you cut-n-paste a file name from HFS+ into other contexts, you'll be pasting NFD unless the cut-n-paste system normalizes to NFC. Also, while it's true that input modes typically produce NFC, it's more that they produce NFC for a small subset of Unicode, not that they will normalize other forms seen on input. Using form-insensitive string comparison/hashing/matching yields a better user experience at not that much implementation cost: you're gonna need a Unicode library, and that library will need to have normalization support, so you can implement form-insensitivity.
> Er, a much better approach is to allow unnormalized Unicode in source code and use form-insensitive matching of symbol names so that all forms of a symbol are equivalent.
Linkers will often be blissfully unaware of Unicode or any form of localization. This was the impetus for UTF-8, so that the bulk of software which is 8-bit clean or which operates on opaque, NUL-terminated strings can continue working as-is. This can't be changed without breaking backwards ABI compatibility; therefore, it's very unlikely to change.
There are countless half-measures that could be taken, but few if any are suitable for standardization. If the history of software localization is any guide, in the face of strict, forward-looking specifications various vendors and ecosystems will likely go there own way, with the one sure thing being a failure to fully adopt or properly implement the specification.
Yes, the compiler should normalize symbols before writing object files, no doubt. I'm talking about the inputs though -the source files- which should not have to be normalized.
Only a tiny part of the Unicode database, the normalization tables. Problem is these tables have to be updated every year, and they dont with similar tables, such as the glibc.
And Unicode identifiers are entirely insecure, because they are not identifiable. My proposal was postponed to C26.
> Problem is these tables have to be updated every year, and they dont with similar tables, such as the glibc.
The language standard can avoid this by committing to a single Unicode version for each language version.
> And Unicode identifiers are entirely insecure, ...
Eh, it depends on what we're talking about using them for. If it's symbols in object files, it's not really a problem. More importantly confusables are unavoidable -- even ASCII all by itself has confusable characters (1 and l for example). The thing to do is to forbid arbitrary mixing of scripts, allowing only those scripts that are relevant in the relevant contexts. For DNS this means that DNS TLD registries should come up with per-registry rules, for example.
Can you expand on what the security concerns are as to confusables in symbol names in C source code? Clearly there's a security concern with cut-n-paste, but that's true regardless of what the rules might be for C identifiers.
It's not like it's obvious that UTR#39 applies literally everywhere that there are "identifiers".
Also, can you speak to what is the security concern with form-insensitivity (rather than confusables) as to symbols in input source files? I just don't see a concern at all there, but maybe I'm missing something.
Lastly, I think `#include` is the most important place to get this right since that does interface with the world outside the compiler (specifically: the filesystem), but as you note the filesystems mostly are just-use-8bit -- very few filesystems normalize on create (HFS+) or are form-insensitive (ZFS). The other place to get this right is on the object file output side, where symbols definitely must be normalized.
Oh, one more thing: the platform might impose some rules regarding symbols in ELF and any other object file formats. Are they known to? I suppose C can't necessarily cater to all platform-imposed limitations on symbol naming, but it'd be useful to know about them.
The compiler is going to need a Unicode library anyways.
It's true that checking that some string is in some canonical form (say, NFC) is easier than normalizing it. In fact, it's even easier to check that some string is in NFD than it is to check that it's in NFC because NFC is defined in terms of NFD. That's not enough to justify not doing form-insensitive string matching in the compiler and forcing a pre-compilation normalization step.
Also, you'd only want symbols normalized, not string literals (since one might need a string literal that is not in a canonical form, or not in NFC). Thus the pre-compilation normalization step would have to be language-aware, so it might as well be part of the compiler and not a separate step.
But I'd like to convince you and everyone that in general we want to accept inputs in any form and be form-insensitive because that's much more user-friendly.
We don't in fact have universal NFC-only input modes. We have accidentally NFC-mostly input modes -- accidental because typically we have transcoding from legacy codesets, which yields NFC, but no normalization is actually done so that copy-paste operations can cause non-canonical input to be provided to applications. We don't have universal agreement on NFC. This makes for a mess with occasional user-visible problems.
Form-insensitivity is essentially the same as normalizing character-by-character when hashing or comparing strings, but this can be very fast for mostly-ASCII text, so form-insensitivity can be very fast.
For string comparison form-insensitivity is faster than first normalizing because either the strings are equal in content and probably equal in form, or they differ early and possibly at some character where normalization is not necessary to determine the result of the comparison, and thus less work need be done to compare strings form-insensitively than first normalizing all inputs (unless so many string comparisons will be done that maybe normalizing first might be a win). For string hashing form-insensitivity is not faster because one has to normalize every input, whereas if all inputs are normalized once then one need never normalize to hash, but still form-insensitivity yields a better user experience.
If you ever see the macro NULL in code, be afraid. There are two valid ways of defining the macro and that cause weird issues when porting code. For example, in the statement printf ("%p %s\n", NULL, "Hello world!"), one of the definitions leads to NULL being interpreted as the null pointer, and the other leads to NULL being interpreted as an integer. The latter may crash if integer and pointer are different sizes.
It also causes problems with C++ overloading if one overload takes a pointer and another takes an integer.