Difference between revisions of "Element type of string ranges"

From D Wiki
Jump to: navigation, search
(Comparison)
(Comparison)
 
(3 intermediate revisions by the same user not shown)
Line 8: Line 8:
 
! Current behavior<br>(code point iteration)
 
! Current behavior<br>(code point iteration)
 
! Code unit iteration
 
! Code unit iteration
! Code unit iteration + forbid implicit <tt>char</tt> <=> <tt>dchar</tt> conversions
+
! Code unit iteration + [http://forum.dlang.org/post/knrwiqxhlvqwxqshyqpy@forum.dlang.org forbid implicit <tt>char</tt> <=> <tt>dchar</tt> conversions]
 
! Notes
 
! Notes
 
|-
 
|-
Line 25: Line 25:
 
|rowspan=2| This should not be recommended practice (not all languages have notions of characters, and not all characters (glyphs/graphemes) can be represented in one <tt>dchar</tt>).
 
|rowspan=2| This should not be recommended practice (not all languages have notions of characters, and not all characters (glyphs/graphemes) can be represented in one <tt>dchar</tt>).
 
|-
 
|-
! Searching for a particular <tt>dchar</tt> in a non-normalized string. || {{No}} Above fails for [http://forum.dlang.org/post/hxudajoutambsznfdydb@forum.dlang.org combining marks], as that requires normalization.
+
! Searching for a particular <tt>dchar</tt> in a non-normalized string.
 +
| {{No}} Above fails for [http://forum.dlang.org/post/hxudajoutambsznfdydb@forum.dlang.org combining marks], as that requires normalization.
 
|-
 
|-
 
! Case conversion, insensitive comparison in ranges for certain languages
 
! Case conversion, insensitive comparison in ranges for certain languages
Line 31: Line 32:
 
|rowspan=2| {{No}} Fails silently.
 
|rowspan=2| {{No}} Fails silently.
 
|rowspan=2| {{No}} Will emit a warning or fail to compile
 
|rowspan=2| {{No}} Will emit a warning or fail to compile
|rowspan=2| This should not be recommended practice (correct case conversion and comparison for all languages is more complicated, and depends on locale - e.g. Turkish I / ı and İ / i).
+
|rowspan=2| This should not be recommended practice (correct case conversion and comparison for all languages is more complicated, and depends on locale - e.g. [http://en.wikipedia.org/wiki/Dotted_and_dotless_I Turkish I / ı and İ / i]).
 
|-
 
|-
 
! Case conversion, insensitive comparison in ranges for other languages
 
! Case conversion, insensitive comparison in ranges for other languages
Line 46: Line 47:
 
|-
 
|-
 
! Implementation difficulty
 
! Implementation difficulty
| {{No}}<br><tt>phobos/std $ grep ElementEncodingType *.d | wc -l<br>80</tt>
+
| {{No}} Requires quite a bit of scaffolding:<br><tt>ElementEncodingType</tt> - 80 instances<br><tt>isSomeString</tt> - 138 instances<br><tt>isExactSomeString</tt> - 23 instances<br><tt>isSomeChar</tt> - 129 instances
 
|colspan=2| {{Yes}} Strings are treated as any other arrays
 
|colspan=2| {{Yes}} Strings are treated as any other arrays
 
|-
 
|-

Latest revision as of 22:41, 9 March 2014

This article attempts to summarize the arguments in the thread Major performance problem with std.array.front().

Comparison

One of the proposals in the thread is to switch the iteration type of string ranges from dchar to the string's character type.

Argument Current behavior
(code point iteration)
Code unit iteration Code unit iteration + forbid implicit char <=> dchar conversions Notes
Status quo Green check.png The current situation, with its (dubious) advantages and known disadvantages Red x.png Will cause breakage Red x.png Will cause more breakage
Red x.png Will cause silent breakage Green check.png No silent breakage
Searching for a particular dchar in a string. Green check.png s.canFind('é') Red x.png Will result in a pragma warning in some places, will fail silently in others (when specified via predicate). Red x.png Will emit a warning or fail to compile This should not be recommended practice (not all languages have notions of characters, and not all characters (glyphs/graphemes) can be represented in one dchar).
Searching for a particular dchar in a non-normalized string. Red x.png Above fails for combining marks, as that requires normalization.
Case conversion, insensitive comparison in ranges for certain languages Green check.png s.count!((a, b) => std.uni.toLower(a) == std.uni.toLower(b))("é") Red x.png Fails silently. Red x.png Will emit a warning or fail to compile This should not be recommended practice (correct case conversion and comparison for all languages is more complicated, and depends on locale - e.g. Turkish I / ı and İ / i).
Case conversion, insensitive comparison in ranges for other languages Red x.png Fails.
Correctness Red x.png Only works for certain languages and alphabets Red x.png Only works for ASCII Red x.png Only works for ASCII; enforces correct character type conversions
Performance Red x.png Implicit decoding everywhere, unless each algorithm is specialized not to Green check.png As fast as ubyte[]
Implementation difficulty Red x.png Requires quite a bit of scaffolding:
ElementEncodingType - 80 instances
isSomeString - 138 instances
isExactSomeString - 23 instances
isSomeChar - 129 instances
Green check.png Strings are treated as any other arrays
Consistency Red x.png Inconsistencies between array and range types
Red x.png Range algorithms return values different from array algorithms
Green check.png String ranges work like ranges of any other arrays