Branch
Hash :
5656e32e
Author :
Date :
2025-05-10T03:16:52
string-desc: Distinguish writable strings from read-only strings. * lib/string-desc.h (HAVE_STATEMENT_EXPRESSIONS): New macro. (rw_string_desc_t): New type. (string_desc_t) [HAVE_STATEMENT_EXPRESSIONS]: Change field _data from 'char *' to 'const char *'. (sd_readonly, sd_readwrite): New inline functions. (sd_length): Define through a macro with _Generic. (sd_char_at): Define through a macro and an inline function. (sd_data, sd_is_empty): Define through a macro with _Generic. (sd_equals, sd_startswith, sd_endswith, sd_cmp, sd_c_casecmp, sd_index, sd_last_index, sd_contains): Define through a macro. (sd_new_addr): Define through a macro with _Generic. (sd_substring, sd_write, sd_fwrite): Define through a macro. (sd_new, sd_new_filled): Change parameter type. (sd_copy): Define through a macro. (sd_concat): Change parameter type. (sd_c): Define through a macro. (sd_set_char_at, sd_fill): Change parameter type. (sd_overwrite): Define through a macro. (sd_free): Change parameter type. * lib/string-desc.c (_sd_equals): Renamed from sd_equals. Take scalar parameters. (_sd_startswith): Renamed from sd_startswith. Take scalar parameters. (_sd_endswith): Renamed from sd_endswith. Take scalar parameters. (_sd_cmp): Renamed from sd_cmp. Take scalar parameters. (_sd_c_casecmp): Renamed from sd_c_casecmp. Take scalar parameters. (_sd_index): Renamed from sd_index. Take scalar parameters. (_sd_last_index): Renamed from sd_last_index. Take scalar parameters. (_sd_new_addr, _rwsd_new_addr): Renamed from sd_new_addr. (sd_substring): Remove function. (_sd_write): Renamed from sd_write. Take scalar parameters. (_sd_fwrite): Renamed from sd_fwrite. Take scalar parameters. (sd_new, sd_new_filled): Change parameter type. (_sd_copy): Renamed from sd_copy. Change parameter type. Take scalar parameters. (sd_concat): Change parameter type. (_sd_c): Renamed from sd_c. Take scalar parameters. (sd_set_char_at, sd_fill): Change parameter type. (_sd_overwrite): Renamed from sd_overwrite. Change parameter type. Take scalar parameters. (sd_free): Change parameter type. * lib/string-desc-contains.c (_sd_contains): Renamed from sd_contains. Take scalar parameters. * lib/xstring-desc.h (xsd_new, xsd_new_filled, xsd_copy, xsd_concat): Change return type to rw_string_desc_t. (xsd_c): Define through a macro. * lib/xstring-desc.c (xsd_concat): Change return type to rw_string_desc_t. * doc/string-desc.texi (Handling strings with NUL characters): Mention rw_string_desc_t and the sd_readonly() function. * lib/string-buffer.h (sb_dupfree, sb_xdupfree): Change return type to rw_string_desc_t. * lib/string-buffer.c (sb_contents): Add a cast to 'const char *'. (sb_dupfree): Change return type to rw_string_desc_t. * lib/xstring-buffer.c (sb_xdupfree): Change return type to rw_string_desc_t. * lib/string-buffer-reversed.h (sbr_dupfree, sbr_xdupfree): Change return type to rw_string_desc_t. * lib/string-buffer-reversed.c (sbr_contents): Add a cast to 'const char *'. (sbr_dupfree): Change return type to rw_string_desc_t. * lib/xstring-buffer-reversed.c (sbr_xdupfree): Change return type to rw_string_desc_t. * tests/test-string-desc.c (main): Use type rw_string_desc_t as appropriate. * tests/test-xstring-desc.c (main): Likewise. * tests/test-sf-istream.c (main): Remove cast in sd_new_addr argument. * tests/test-sfl-istream.c (main): Likewise. * NEWS: Mention the change.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129
@node Handling strings with NUL characters
@section Handling strings with NUL characters
@c Copyright (C) 2023--2025 Free Software Foundation, Inc.
@c Permission is granted to copy, distribute and/or modify this document
@c under the terms of the GNU Free Documentation License, Version 1.3 or
@c any later version published by the Free Software Foundation; with no
@c Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A
@c copy of the license is at <https://www.gnu.org/licenses/fdl-1.3.en.html>.
@c Written by Bruno Haible.
Strings in C are usually represented by a character sequence with a
terminating NUL character. A @samp{char *}, pointer to the first byte
of this character sequence, is what gets passed around as function
argument or return value.
The major restriction of this string representation is that it cannot
handle strings that contain NUL characters: such strings will appear
shorter than they were meant to be. In most application areas, this is
not a problem, and the @code{char *} type is well usable.
A second problem of this string representation is that
taking a substring is not cheap:
it either requires a memory allocation
or a destructive modification of the string.
The former has a runtime cost;
the latter complicates the logic of the program.
This matters for application areas that analyze text, such as parsers.
In areas where strings with embedded NUL characters need to be handled
or where taking substrings is a recurrent operation,
the common approach is to use a @code{char *ptr} pointer variable
together with a @code{size_t nbytes} variable (or an @code{idx_t nbytes}
variable, if you want to avoid problems due to integer overflow). This
works fine in code that constructs or manipulates strings with embedded
NUL characters. But when it comes to @emph{storing} them, for example
in an array or as key or value of a hash table, one needs a type that
combines these two fields.
@mindex string-desc
@mindex xstring-desc
@mindex string-desc-quotearg
The Gnulib modules @code{string-desc}, @code{xstring-desc}, and
@code{string-desc-quotearg} provide such a type. We call it a
``string descriptor'' and name it @code{string_desc_t}.
The type @code{string_desc_t} is a struct that contains a pointer to the
first byte and the number of bytes of the memory region that make up the
string. An additional terminating NUL byte, that may be present in
memory, is not included in this byte count. This type implements the
same concept as @code{std::string_view} in C++, or the @code{String}
type in Java.
@code{string_desc_t} is a string descriptor to a string that cannot
be written to. There is also a type @code{rw_string_desc_t}, that is
a descriptor for a writable string.
@code{rw_string_desc_t} compares to @code{string_desc_t}, like the
pointer type @samp{char *} compares to the pointer type
@samp{const char *}.
A @code{string_desc_t} or @code{rw_string_desc_t}
can be passed to a function as an argument, or
can be the return value of a function. This is type-safe: If, by
mistake, a programmer passes a @code{string_desc_t} to a function that
expects a @code{char *} argument, or vice versa, or assigns a
@code{string_desc_t} value to a variable of type @code{char *}, or
vice versa, the compiler will report an error.
Unfortunately, @code{string_desc_t} and @code{rw_string_desc_t}
being different types, there is no implicit conversion from
@code{rw_string_desc_t} to @code{string_desc_t}. In places
where such a conversion is desired, the (inline) function
@code{sd_readonly} needs to be called.
Functions related to string descriptors are provided:
@itemize
@item
Side-effect-free operations in @code{"string-desc.h"},
@item
Memory-allocating operations in @code{"string-desc.h"},
@item
Memory-allocating operations with out-of-memory checking in
@code{"xstring-desc.h"},
@item
Operations with side effects in @code{"string-desc.h"}.
@end itemize
For outputting a string descriptor, the @code{*printf} family of
functions cannot be used directly. A format string directive such as
@code{"%.*s"} would not work:
@itemize
@item
it would stop the output at the first encountered NUL character,
@item
it would require to cast the number of bytes to @code{int}, and thus
would not work for strings longer than @code{INT_MAX} bytes.
@end itemize
@c @noindent Other format string directives don't work either, because
@c the only way to produce a NUL character in @code{*printf}'s output
@c is through a dedicated @code{%c} or @code{%lc} directive.
Therefore Gnulib offers
@itemize
@item
a function @code{sd_fwrite} that outputs a string descriptor to
a @code{FILE} stream,
@item
a function @code{sd_write} that outputs a string descriptor to
a file descriptor,
@item
and for those applications where the NUL characters should become
visible as @samp{\0}, a family of @code{quotearg} based functions, that
allow to specify the escaping rules in detail.
@end itemize
The functionality is thus split across three modules as follows:
@itemize
@item
The module @code{string-desc}, under LGPL, defines the type and
elementary functions.
@item
The module @code{xstring-desc}, under GPL, defines the memory-allocating
functions with out-of-memory checking.
@item
The module @code{string-desc-quotearg}, under GPL, defines the
@code{quotearg} based functions.
@end itemize