Others |
Site Top > Others >
01/05/2024
String.Length etc can count correctly only characters that are in the Unicode Plane 0 (BMP)
Almost string counting functions in .NET are rely on String.Length.
In case of count characters in the Unicode Plane 1 and beyond or count apparent number of characters, you can count the number of Runes or use the features in StringInfo class.
Characters used in everyday life, including those in Japan and China, are included in the Unicode Plane 0 (BPM). Normally, they don't use Unicode Plane 1 or later characters or combining characters.
Many emojis like 🗿 are in the Unicode Plane 1, and I feel recently It increase the people who often use these emojis. To count these kind of Emoji as 1 character, you must not use String.Length.
In Japan and China, in rare cases, proper nouns such as place names and personal names may use characters from the Unicode Plane 0 onwards.
In addition, There are many variants in Japanese Ideographs (kanji) with slightly different shapes such as the position of the dots, and it is possible to express such differences using combining characters. However, this usage is hardly widespread as of 2024.
Ancient scripts, such as ancient Egyptian hieroglyphs, or characters used for special purposes, may be included in the Unicode Plane 1 and beyond.
Here are the characters used in the image above so that you can test them.
A1&あ각★☃ΩⒽ山竜你𩸽𓀉𫛣🗿🗻🗼
あ゙あ゚あ̰👩👨👦👧🏴☠⚔️
I wrote a sample program written as Console App(.NET 8) to count characters by three ways.
Depending on your environment, the characters may not be displayed properly. Because the font you use may not contain the characters or the software you use may not be fully compatible to Unicode specifications.
C#
namespace CsConsole
{
internal class Program
{
static void Main(string[] args)
{
PrintLength("A");
PrintLength("あ");
PrintLength("☃");
PrintLength("山");
PrintLength("𩸽");
PrintLength("𓀉");
PrintLength("🗿");
PrintLength("あ゙");
PrintLength("👩👨👦👧");
PrintLength("⚔️");
}
private static void PrintLength(string text)
{
int len1 = text.Length; //String.Length
int len2 = text.EnumerateRunes().Count(); //Count of Runes
int len3 = EnumerateTextElements(text).Count(); //Count of TextElements
System.Diagnostics.Debug.WriteLine($"{text} Length:{len1} Runes:{len2} Text Elements:{len3}");
}
private static IEnumerable<string> EnumerateTextElements(string text)
{
var enumerator = System.Globalization.StringInfo.GetTextElementEnumerator(text);
while (enumerator.MoveNext())
{
yield return (string)enumerator.Current;
}
}
}
}
→ Where Debug.WriteLine output to
VB
Module Program
Sub Main(args As String())
PrintLength("A")
PrintLength("あ")
PrintLength("☃")
PrintLength("山")
PrintLength("𩸽")
PrintLength("𓀉")
PrintLength("🗿")
PrintLength("あ゙")
PrintLength("👩👨👦👧")
PrintLength("⚔️")
End Sub
Private Sub PrintLength(text As String)
Dim len1 As Integer = text.Length 'String.Length
Dim len2 As Integer = text.EnumerateRunes().Count() 'Number of Runes
Dim len3 As Integer = EnumerateTextElements(text).Count() 'Number of TextElements
Debug.WriteLine($"{text} Length:{len1} Runes:{len2} Text Elements:{len3}")
End Sub
Private Iterator Function EnumerateTextElements(text As String) As IEnumerable(Of String)
Dim enumerator = Globalization.StringInfo.GetTextElementEnumerator(text)
While enumerator.MoveNext()
Yield CStr(enumerator.Current)
End While
End Function
End Module
→ Where Debug.WriteLine output to
Executing the program, you will see following result in the output window.
A Length:1 Runes:1 Text Elements:1 あ Length:1 Runes:1 Text Elements:1 ☃ Length:1 Runes:1 Text Elements:1 山 Length:1 Runes:1 Text Elements:1 𩸽 Length:2 Runes:1 Text Elements:1 𓀉 Length:2 Runes:1 Text Elements:1 🗿 Length:2 Runes:1 Text Elements:1 あ゙ Length:2 Runes:2 Text Elements:1 👩👨👦👧 Length:11 Runes:7 Text Elements:1 ⚔️ Length:2 Runes:2 Text Elements:1
Example: count char, Rune, and text element instances
The world of the Unicode ~Emoji Combinations~ (This article is Japanese)
https://qiita.com/noritsune/items/46134cb7a50236540be5