Sorting the characters in a UTF-16 string in JavaWhat is stability in sorting algorithms and why is it important?What is the difference between String and string in C#?Is Java “pass-by-reference” or “pass-by-value”?How do I read / convert an InputStream into a String in Java?How do I sort a dictionary by value?Sort array of objects by string property valueHow to replace all occurrences of a string in JavaScriptHow to check whether a string contains a substring in JavaScript?How do I convert a String to an int in Java?Why is char[] preferred over String for passwords?Why is it faster to process a sorted array than an unsorted array?

Why do real positive eigenvalues result in an unstable system? What about eigenvalues between 0 and 1? or 1?

Does a large simulator bay have standard public address announcements?

I preordered a game on my Xbox while on the home screen of my friend's account. Which of us owns the game?

Can someone publish a story that happened to you?

Why do distances seem to matter in the Foundation world?

A faster way to compute the largest prime factor

What makes accurate emulation of old systems a difficult task?

Are there moral objections to a life motivated purely by money? How to sway a person from this lifestyle?

How long after the last departure shall the airport stay open for an emergency return?

Drawing a german abacus as in the books of Adam Ries

Can a level 2 Warlock take one level in rogue, then continue advancing as a warlock?

Why did Rep. Omar conclude her criticism of US troops with the phrase "NotTodaySatan"?

Can a Bard use the Spell Glyph option of the Glyph of Warding spell and cast a known spell into the glyph?

Can a stored procedure reference the database in which it is stored?

std::unique_ptr of base class holding reference of derived class does not show warning in gcc compiler while naked pointer shows it. Why?

Can I criticise the more senior developers around me for not writing clean code?

What is the most expensive material in the world that could be used to create Pun-Pun's lute?

Why do games have consumables?

Where was the County of Thurn und Taxis located?

Mistake in years of experience in resume?

My bank got bought out, am I now going to have to start filing tax returns in a different state?

How can I wire a 9-position switch so that each position turns on one more LED than the one before?

Magical attacks and overcoming damage resistance

Is Electric Central Heating worth it if using Solar Panels?



Sorting the characters in a UTF-16 string in Java


What is stability in sorting algorithms and why is it important?What is the difference between String and string in C#?Is Java “pass-by-reference” or “pass-by-value”?How do I read / convert an InputStream into a String in Java?How do I sort a dictionary by value?Sort array of objects by string property valueHow to replace all occurrences of a string in JavaScriptHow to check whether a string contains a substring in JavaScript?How do I convert a String to an int in Java?Why is char[] preferred over String for passwords?Why is it faster to process a sorted array than an unsorted array?






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;








13















TLDR



Java uses two characters to represent UTF-16. Using Arrays.sort (unstable sort) messes with character sequencing. Should I convert char[] to int[] or is there a better way?



Details



Java represents a character as UTF-16. But the Character class itself wraps char (16 bit). For UTF-16, it will be an array of two chars (32 bit).



Sorting a string of UTF-16 characters using the inbuilt sort messes with data. (Arrays.sort uses dual pivot quick sort and Collections.sort uses Arrays.sort to do the heavy lifting.)



To be specific, do you convert char[] to int[] or is there a better way to sort?



import java.util.Arrays;

public class Main
public static void main(String[] args)
int[] utfCodes = 128513, 128531, 128557;
String emojis = new String(utfCodes, 0, 3);
System.out.println("Initial String: " + emojis);

char[] chars = emojis.toCharArray();
Arrays.sort(chars);
System.out.println("Sorted String: " + new String(chars));




Output:



Initial String: 😁😓😭
Sorted String: ??😁??









share|improve this question









New contributor




dingy is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.















  • 2





    This is what we call a "Collation". You should use a library for this because there are many collations to choose from.

    – Guillaume F.
    Apr 23 at 2:33











  • I don't think that 'unstable sort' is a right word to use here: stackoverflow.com/questions/1517793/…

    – Artur Biesiadowski
    2 days ago






  • 2





    You are confusing Unicode with UTF-16. A Java char is a UTF-16 unit. Guess why it is called “UTF-16” and how it relates to the fact that a char has 16 bits. You may need two UTF-16 units to encode a single codepoint, but it’s not Java’s char to blame for that.

    – Holger
    2 days ago


















13















TLDR



Java uses two characters to represent UTF-16. Using Arrays.sort (unstable sort) messes with character sequencing. Should I convert char[] to int[] or is there a better way?



Details



Java represents a character as UTF-16. But the Character class itself wraps char (16 bit). For UTF-16, it will be an array of two chars (32 bit).



Sorting a string of UTF-16 characters using the inbuilt sort messes with data. (Arrays.sort uses dual pivot quick sort and Collections.sort uses Arrays.sort to do the heavy lifting.)



To be specific, do you convert char[] to int[] or is there a better way to sort?



import java.util.Arrays;

public class Main
public static void main(String[] args)
int[] utfCodes = 128513, 128531, 128557;
String emojis = new String(utfCodes, 0, 3);
System.out.println("Initial String: " + emojis);

char[] chars = emojis.toCharArray();
Arrays.sort(chars);
System.out.println("Sorted String: " + new String(chars));




Output:



Initial String: 😁😓😭
Sorted String: ??😁??









share|improve this question









New contributor




dingy is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.















  • 2





    This is what we call a "Collation". You should use a library for this because there are many collations to choose from.

    – Guillaume F.
    Apr 23 at 2:33











  • I don't think that 'unstable sort' is a right word to use here: stackoverflow.com/questions/1517793/…

    – Artur Biesiadowski
    2 days ago






  • 2





    You are confusing Unicode with UTF-16. A Java char is a UTF-16 unit. Guess why it is called “UTF-16” and how it relates to the fact that a char has 16 bits. You may need two UTF-16 units to encode a single codepoint, but it’s not Java’s char to blame for that.

    – Holger
    2 days ago














13












13








13


1






TLDR



Java uses two characters to represent UTF-16. Using Arrays.sort (unstable sort) messes with character sequencing. Should I convert char[] to int[] or is there a better way?



Details



Java represents a character as UTF-16. But the Character class itself wraps char (16 bit). For UTF-16, it will be an array of two chars (32 bit).



Sorting a string of UTF-16 characters using the inbuilt sort messes with data. (Arrays.sort uses dual pivot quick sort and Collections.sort uses Arrays.sort to do the heavy lifting.)



To be specific, do you convert char[] to int[] or is there a better way to sort?



import java.util.Arrays;

public class Main
public static void main(String[] args)
int[] utfCodes = 128513, 128531, 128557;
String emojis = new String(utfCodes, 0, 3);
System.out.println("Initial String: " + emojis);

char[] chars = emojis.toCharArray();
Arrays.sort(chars);
System.out.println("Sorted String: " + new String(chars));




Output:



Initial String: 😁😓😭
Sorted String: ??😁??









share|improve this question









New contributor




dingy is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.












TLDR



Java uses two characters to represent UTF-16. Using Arrays.sort (unstable sort) messes with character sequencing. Should I convert char[] to int[] or is there a better way?



Details



Java represents a character as UTF-16. But the Character class itself wraps char (16 bit). For UTF-16, it will be an array of two chars (32 bit).



Sorting a string of UTF-16 characters using the inbuilt sort messes with data. (Arrays.sort uses dual pivot quick sort and Collections.sort uses Arrays.sort to do the heavy lifting.)



To be specific, do you convert char[] to int[] or is there a better way to sort?



import java.util.Arrays;

public class Main
public static void main(String[] args)
int[] utfCodes = 128513, 128531, 128557;
String emojis = new String(utfCodes, 0, 3);
System.out.println("Initial String: " + emojis);

char[] chars = emojis.toCharArray();
Arrays.sort(chars);
System.out.println("Sorted String: " + new String(chars));




Output:



Initial String: 😁😓😭
Sorted String: ??😁??






java string sorting utf-16






share|improve this question









New contributor




dingy is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











share|improve this question









New contributor




dingy is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









share|improve this question




share|improve this question








edited 2 days ago









Peter Mortensen

14k1987114




14k1987114






New contributor




dingy is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









asked Apr 23 at 2:00









dingydingy

716




716




New contributor




dingy is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





New contributor





dingy is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






dingy is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







  • 2





    This is what we call a "Collation". You should use a library for this because there are many collations to choose from.

    – Guillaume F.
    Apr 23 at 2:33











  • I don't think that 'unstable sort' is a right word to use here: stackoverflow.com/questions/1517793/…

    – Artur Biesiadowski
    2 days ago






  • 2





    You are confusing Unicode with UTF-16. A Java char is a UTF-16 unit. Guess why it is called “UTF-16” and how it relates to the fact that a char has 16 bits. You may need two UTF-16 units to encode a single codepoint, but it’s not Java’s char to blame for that.

    – Holger
    2 days ago













  • 2





    This is what we call a "Collation". You should use a library for this because there are many collations to choose from.

    – Guillaume F.
    Apr 23 at 2:33











  • I don't think that 'unstable sort' is a right word to use here: stackoverflow.com/questions/1517793/…

    – Artur Biesiadowski
    2 days ago






  • 2





    You are confusing Unicode with UTF-16. A Java char is a UTF-16 unit. Guess why it is called “UTF-16” and how it relates to the fact that a char has 16 bits. You may need two UTF-16 units to encode a single codepoint, but it’s not Java’s char to blame for that.

    – Holger
    2 days ago








2




2





This is what we call a "Collation". You should use a library for this because there are many collations to choose from.

– Guillaume F.
Apr 23 at 2:33





This is what we call a "Collation". You should use a library for this because there are many collations to choose from.

– Guillaume F.
Apr 23 at 2:33













I don't think that 'unstable sort' is a right word to use here: stackoverflow.com/questions/1517793/…

– Artur Biesiadowski
2 days ago





I don't think that 'unstable sort' is a right word to use here: stackoverflow.com/questions/1517793/…

– Artur Biesiadowski
2 days ago




2




2





You are confusing Unicode with UTF-16. A Java char is a UTF-16 unit. Guess why it is called “UTF-16” and how it relates to the fact that a char has 16 bits. You may need two UTF-16 units to encode a single codepoint, but it’s not Java’s char to blame for that.

– Holger
2 days ago






You are confusing Unicode with UTF-16. A Java char is a UTF-16 unit. Guess why it is called “UTF-16” and how it relates to the fact that a char has 16 bits. You may need two UTF-16 units to encode a single codepoint, but it’s not Java’s char to blame for that.

– Holger
2 days ago













3 Answers
3






active

oldest

votes


















11














I looked around for a bit and couldn't find any clean ways to sort an array by groupings of two elements without the use of a library.



Luckily, the codePoints of the String are what you used to create the String itself in this example, so you can simply sort those and create a new String with the result.



public static void main(String[] args) 
int[] utfCodes = 128531, 128557, 128513;
String emojis = new String(utfCodes, 0, 3);
System.out.println("Initial String: " + emojis);

int[] codePoints = emojis.codePoints().sorted().toArray();
System.out.println("Sorted String: " + new String(codePoints, 0, 3));




Initial String: 😓😭😁



Sorted String: 😁😓😭




I switched the order of the characters in your example because they were already sorted.






share|improve this answer




















  • 1





    Haha.. my string was already sorted... I couldn't tell because I couldn't sort (pun intended). I should move to java8 =)

    – dingy
    2 days ago






  • 4





    @dingy Java 8 is EOL. You need to move to Java 12.

    – Boris the Spider
    2 days ago






  • 3





    Codepoint supports exists since Java 5. It’s only the Stream API, which makes it look almost a one-liner, that requires Java 8 or newer.

    – Holger
    2 days ago


















6














If you are using Java 8 or later, then this is a simple way to sort the characters in a string while respecting (not breaking) multi-char codepoints:



int[] codepoints = someString.codePoints().sort().toArray();
String sorted = new String(codepoints, 0, codepoints.length);


Prior to Java 8, I think you either need to use a loop to iterate the code points in the original string, or use a 3rd-party library method.




Fortunately, sorting the codepoints in a String is uncommon enough that the clunkyness and relative inefficiency of the solutions above are rarely a concern.



(When was the last time you tested for anagrams of emojis?)






share|improve this answer

























  • Thanks for reply. I was looking at Java 7's documentation, I should move to java 8. BTW, I am from China and making an app where I need to sort strings in Mandarin, just kidding, but it's a valid usecase. I stumbled upon it while I was trying to understand how Java works with UTF-16. Since other answers are same, I'll select the one which came earliest. Thanks again!

    – dingy
    2 days ago











  • I didn't say invalid. I said uncommon. (And the fact that you had to make up a use-case only reinforces my point ... :-) )

    – Stephen C
    2 days ago












  • See also: chinese.stackexchange.com/questions/24053/chinese-anagrams. (First answer: "Why do you need that? We never use that in China.")

    – Stephen C
    2 days ago







  • 4





    To add fuel to the flames, a single Emoji may consist of multiple codepoints. E.g. 🤦🏻‍♀️ consists of five codepoints (seven chars). But even latin characters may be composed of multiple codepoints.

    – Holger
    2 days ago


















4














We can't use char for Unicode, because Java's Unicode char handling is broken.



In the early days of Java, Unicode code points were always 16-bits (fixed size at exactly one char). However, the Unicode specification changed to allow supplemental characters. That meant Unicode characters are now variable widths, and can be longer than one char. Unfortunately, it was too late to change Java's char implementation without breaking a ton of production code.



So the best way to manipulate Unicode characters is by using code points directly, e.g., using String.codePointAt(index) or the String.codePoints() stream on JDK 1.8 and above.



Additional sources:




  • The Unicode 1.0 Standard, Chapter 2 (pg. 10 and 22)


  • Supplementary Characters in the Java Platform (Sun/Oracle)





share|improve this answer










New contributor




peekay is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.




















  • Thanks for reply, I completely missed the String::codePointAt api, also I think I should move to java 8. Since other answers are same, I'll select the one which came earliest.

    – dingy
    2 days ago






  • 1





    @dingy If you're planning to make the JDK jump, consider skipping Java 8 and go straight to (Open) JDK 11 LTS, which has some additional gems.

    – peekay
    2 days ago











  • Even before that change, there were combining characters, which invalidate the assumption that a single codepoint represents the entire character.

    – Holger
    2 days ago












  • @Holger To be more precise, suppose we encode the letter Á using two characters: A (U+0041 Latin Capital Letter A) plus the combining character ◌́ (U+0301 Combining Acute Accent). In this case, notice that combining characters do not change the fact that each code point still only represent one character: we have two characters and two code points to represent the letter (grapheme) Á.

    – peekay
    2 days ago











  • @MichaWiedenmann That's not correct. In Unicode 1.x a code point was always 16-bits and mapped to one Unicode character. See the Unicode 1.0 Specification. From the standard: Unicode code points are 16-bit quantities. (pg. 22) and All Unicode characters have a uniform width of 16 bits. (pg. 10). Code points larger than 16-bit (supplementary characters) were first assigned in Unicode 3.1. Java did not support them until JDK 5.0 (September 2004).

    – peekay
    2 days ago











Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);






dingy is a new contributor. Be nice, and check out our Code of Conduct.









draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55803293%2fsorting-the-characters-in-a-utf-16-string-in-java%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























3 Answers
3






active

oldest

votes








3 Answers
3






active

oldest

votes









active

oldest

votes






active

oldest

votes









11














I looked around for a bit and couldn't find any clean ways to sort an array by groupings of two elements without the use of a library.



Luckily, the codePoints of the String are what you used to create the String itself in this example, so you can simply sort those and create a new String with the result.



public static void main(String[] args) 
int[] utfCodes = 128531, 128557, 128513;
String emojis = new String(utfCodes, 0, 3);
System.out.println("Initial String: " + emojis);

int[] codePoints = emojis.codePoints().sorted().toArray();
System.out.println("Sorted String: " + new String(codePoints, 0, 3));




Initial String: 😓😭😁



Sorted String: 😁😓😭




I switched the order of the characters in your example because they were already sorted.






share|improve this answer




















  • 1





    Haha.. my string was already sorted... I couldn't tell because I couldn't sort (pun intended). I should move to java8 =)

    – dingy
    2 days ago






  • 4





    @dingy Java 8 is EOL. You need to move to Java 12.

    – Boris the Spider
    2 days ago






  • 3





    Codepoint supports exists since Java 5. It’s only the Stream API, which makes it look almost a one-liner, that requires Java 8 or newer.

    – Holger
    2 days ago















11














I looked around for a bit and couldn't find any clean ways to sort an array by groupings of two elements without the use of a library.



Luckily, the codePoints of the String are what you used to create the String itself in this example, so you can simply sort those and create a new String with the result.



public static void main(String[] args) 
int[] utfCodes = 128531, 128557, 128513;
String emojis = new String(utfCodes, 0, 3);
System.out.println("Initial String: " + emojis);

int[] codePoints = emojis.codePoints().sorted().toArray();
System.out.println("Sorted String: " + new String(codePoints, 0, 3));




Initial String: 😓😭😁



Sorted String: 😁😓😭




I switched the order of the characters in your example because they were already sorted.






share|improve this answer




















  • 1





    Haha.. my string was already sorted... I couldn't tell because I couldn't sort (pun intended). I should move to java8 =)

    – dingy
    2 days ago






  • 4





    @dingy Java 8 is EOL. You need to move to Java 12.

    – Boris the Spider
    2 days ago






  • 3





    Codepoint supports exists since Java 5. It’s only the Stream API, which makes it look almost a one-liner, that requires Java 8 or newer.

    – Holger
    2 days ago













11












11








11







I looked around for a bit and couldn't find any clean ways to sort an array by groupings of two elements without the use of a library.



Luckily, the codePoints of the String are what you used to create the String itself in this example, so you can simply sort those and create a new String with the result.



public static void main(String[] args) 
int[] utfCodes = 128531, 128557, 128513;
String emojis = new String(utfCodes, 0, 3);
System.out.println("Initial String: " + emojis);

int[] codePoints = emojis.codePoints().sorted().toArray();
System.out.println("Sorted String: " + new String(codePoints, 0, 3));




Initial String: 😓😭😁



Sorted String: 😁😓😭




I switched the order of the characters in your example because they were already sorted.






share|improve this answer















I looked around for a bit and couldn't find any clean ways to sort an array by groupings of two elements without the use of a library.



Luckily, the codePoints of the String are what you used to create the String itself in this example, so you can simply sort those and create a new String with the result.



public static void main(String[] args) 
int[] utfCodes = 128531, 128557, 128513;
String emojis = new String(utfCodes, 0, 3);
System.out.println("Initial String: " + emojis);

int[] codePoints = emojis.codePoints().sorted().toArray();
System.out.println("Sorted String: " + new String(codePoints, 0, 3));




Initial String: 😓😭😁



Sorted String: 😁😓😭




I switched the order of the characters in your example because they were already sorted.







share|improve this answer














share|improve this answer



share|improve this answer








edited Apr 23 at 2:51

























answered Apr 23 at 2:46









Jacob G.Jacob G.

17.1k52567




17.1k52567







  • 1





    Haha.. my string was already sorted... I couldn't tell because I couldn't sort (pun intended). I should move to java8 =)

    – dingy
    2 days ago






  • 4





    @dingy Java 8 is EOL. You need to move to Java 12.

    – Boris the Spider
    2 days ago






  • 3





    Codepoint supports exists since Java 5. It’s only the Stream API, which makes it look almost a one-liner, that requires Java 8 or newer.

    – Holger
    2 days ago












  • 1





    Haha.. my string was already sorted... I couldn't tell because I couldn't sort (pun intended). I should move to java8 =)

    – dingy
    2 days ago






  • 4





    @dingy Java 8 is EOL. You need to move to Java 12.

    – Boris the Spider
    2 days ago






  • 3





    Codepoint supports exists since Java 5. It’s only the Stream API, which makes it look almost a one-liner, that requires Java 8 or newer.

    – Holger
    2 days ago







1




1





Haha.. my string was already sorted... I couldn't tell because I couldn't sort (pun intended). I should move to java8 =)

– dingy
2 days ago





Haha.. my string was already sorted... I couldn't tell because I couldn't sort (pun intended). I should move to java8 =)

– dingy
2 days ago




4




4





@dingy Java 8 is EOL. You need to move to Java 12.

– Boris the Spider
2 days ago





@dingy Java 8 is EOL. You need to move to Java 12.

– Boris the Spider
2 days ago




3




3





Codepoint supports exists since Java 5. It’s only the Stream API, which makes it look almost a one-liner, that requires Java 8 or newer.

– Holger
2 days ago





Codepoint supports exists since Java 5. It’s only the Stream API, which makes it look almost a one-liner, that requires Java 8 or newer.

– Holger
2 days ago













6














If you are using Java 8 or later, then this is a simple way to sort the characters in a string while respecting (not breaking) multi-char codepoints:



int[] codepoints = someString.codePoints().sort().toArray();
String sorted = new String(codepoints, 0, codepoints.length);


Prior to Java 8, I think you either need to use a loop to iterate the code points in the original string, or use a 3rd-party library method.




Fortunately, sorting the codepoints in a String is uncommon enough that the clunkyness and relative inefficiency of the solutions above are rarely a concern.



(When was the last time you tested for anagrams of emojis?)






share|improve this answer

























  • Thanks for reply. I was looking at Java 7's documentation, I should move to java 8. BTW, I am from China and making an app where I need to sort strings in Mandarin, just kidding, but it's a valid usecase. I stumbled upon it while I was trying to understand how Java works with UTF-16. Since other answers are same, I'll select the one which came earliest. Thanks again!

    – dingy
    2 days ago











  • I didn't say invalid. I said uncommon. (And the fact that you had to make up a use-case only reinforces my point ... :-) )

    – Stephen C
    2 days ago












  • See also: chinese.stackexchange.com/questions/24053/chinese-anagrams. (First answer: "Why do you need that? We never use that in China.")

    – Stephen C
    2 days ago







  • 4





    To add fuel to the flames, a single Emoji may consist of multiple codepoints. E.g. 🤦🏻‍♀️ consists of five codepoints (seven chars). But even latin characters may be composed of multiple codepoints.

    – Holger
    2 days ago















6














If you are using Java 8 or later, then this is a simple way to sort the characters in a string while respecting (not breaking) multi-char codepoints:



int[] codepoints = someString.codePoints().sort().toArray();
String sorted = new String(codepoints, 0, codepoints.length);


Prior to Java 8, I think you either need to use a loop to iterate the code points in the original string, or use a 3rd-party library method.




Fortunately, sorting the codepoints in a String is uncommon enough that the clunkyness and relative inefficiency of the solutions above are rarely a concern.



(When was the last time you tested for anagrams of emojis?)






share|improve this answer

























  • Thanks for reply. I was looking at Java 7's documentation, I should move to java 8. BTW, I am from China and making an app where I need to sort strings in Mandarin, just kidding, but it's a valid usecase. I stumbled upon it while I was trying to understand how Java works with UTF-16. Since other answers are same, I'll select the one which came earliest. Thanks again!

    – dingy
    2 days ago











  • I didn't say invalid. I said uncommon. (And the fact that you had to make up a use-case only reinforces my point ... :-) )

    – Stephen C
    2 days ago












  • See also: chinese.stackexchange.com/questions/24053/chinese-anagrams. (First answer: "Why do you need that? We never use that in China.")

    – Stephen C
    2 days ago







  • 4





    To add fuel to the flames, a single Emoji may consist of multiple codepoints. E.g. 🤦🏻‍♀️ consists of five codepoints (seven chars). But even latin characters may be composed of multiple codepoints.

    – Holger
    2 days ago













6












6








6







If you are using Java 8 or later, then this is a simple way to sort the characters in a string while respecting (not breaking) multi-char codepoints:



int[] codepoints = someString.codePoints().sort().toArray();
String sorted = new String(codepoints, 0, codepoints.length);


Prior to Java 8, I think you either need to use a loop to iterate the code points in the original string, or use a 3rd-party library method.




Fortunately, sorting the codepoints in a String is uncommon enough that the clunkyness and relative inefficiency of the solutions above are rarely a concern.



(When was the last time you tested for anagrams of emojis?)






share|improve this answer















If you are using Java 8 or later, then this is a simple way to sort the characters in a string while respecting (not breaking) multi-char codepoints:



int[] codepoints = someString.codePoints().sort().toArray();
String sorted = new String(codepoints, 0, codepoints.length);


Prior to Java 8, I think you either need to use a loop to iterate the code points in the original string, or use a 3rd-party library method.




Fortunately, sorting the codepoints in a String is uncommon enough that the clunkyness and relative inefficiency of the solutions above are rarely a concern.



(When was the last time you tested for anagrams of emojis?)







share|improve this answer














share|improve this answer



share|improve this answer








edited yesterday

























answered Apr 23 at 3:11









Stephen CStephen C

528k72590946




528k72590946












  • Thanks for reply. I was looking at Java 7's documentation, I should move to java 8. BTW, I am from China and making an app where I need to sort strings in Mandarin, just kidding, but it's a valid usecase. I stumbled upon it while I was trying to understand how Java works with UTF-16. Since other answers are same, I'll select the one which came earliest. Thanks again!

    – dingy
    2 days ago











  • I didn't say invalid. I said uncommon. (And the fact that you had to make up a use-case only reinforces my point ... :-) )

    – Stephen C
    2 days ago












  • See also: chinese.stackexchange.com/questions/24053/chinese-anagrams. (First answer: "Why do you need that? We never use that in China.")

    – Stephen C
    2 days ago







  • 4





    To add fuel to the flames, a single Emoji may consist of multiple codepoints. E.g. 🤦🏻‍♀️ consists of five codepoints (seven chars). But even latin characters may be composed of multiple codepoints.

    – Holger
    2 days ago

















  • Thanks for reply. I was looking at Java 7's documentation, I should move to java 8. BTW, I am from China and making an app where I need to sort strings in Mandarin, just kidding, but it's a valid usecase. I stumbled upon it while I was trying to understand how Java works with UTF-16. Since other answers are same, I'll select the one which came earliest. Thanks again!

    – dingy
    2 days ago











  • I didn't say invalid. I said uncommon. (And the fact that you had to make up a use-case only reinforces my point ... :-) )

    – Stephen C
    2 days ago












  • See also: chinese.stackexchange.com/questions/24053/chinese-anagrams. (First answer: "Why do you need that? We never use that in China.")

    – Stephen C
    2 days ago







  • 4





    To add fuel to the flames, a single Emoji may consist of multiple codepoints. E.g. 🤦🏻‍♀️ consists of five codepoints (seven chars). But even latin characters may be composed of multiple codepoints.

    – Holger
    2 days ago
















Thanks for reply. I was looking at Java 7's documentation, I should move to java 8. BTW, I am from China and making an app where I need to sort strings in Mandarin, just kidding, but it's a valid usecase. I stumbled upon it while I was trying to understand how Java works with UTF-16. Since other answers are same, I'll select the one which came earliest. Thanks again!

– dingy
2 days ago





Thanks for reply. I was looking at Java 7's documentation, I should move to java 8. BTW, I am from China and making an app where I need to sort strings in Mandarin, just kidding, but it's a valid usecase. I stumbled upon it while I was trying to understand how Java works with UTF-16. Since other answers are same, I'll select the one which came earliest. Thanks again!

– dingy
2 days ago













I didn't say invalid. I said uncommon. (And the fact that you had to make up a use-case only reinforces my point ... :-) )

– Stephen C
2 days ago






I didn't say invalid. I said uncommon. (And the fact that you had to make up a use-case only reinforces my point ... :-) )

– Stephen C
2 days ago














See also: chinese.stackexchange.com/questions/24053/chinese-anagrams. (First answer: "Why do you need that? We never use that in China.")

– Stephen C
2 days ago






See also: chinese.stackexchange.com/questions/24053/chinese-anagrams. (First answer: "Why do you need that? We never use that in China.")

– Stephen C
2 days ago





4




4





To add fuel to the flames, a single Emoji may consist of multiple codepoints. E.g. 🤦🏻‍♀️ consists of five codepoints (seven chars). But even latin characters may be composed of multiple codepoints.

– Holger
2 days ago





To add fuel to the flames, a single Emoji may consist of multiple codepoints. E.g. 🤦🏻‍♀️ consists of five codepoints (seven chars). But even latin characters may be composed of multiple codepoints.

– Holger
2 days ago











4














We can't use char for Unicode, because Java's Unicode char handling is broken.



In the early days of Java, Unicode code points were always 16-bits (fixed size at exactly one char). However, the Unicode specification changed to allow supplemental characters. That meant Unicode characters are now variable widths, and can be longer than one char. Unfortunately, it was too late to change Java's char implementation without breaking a ton of production code.



So the best way to manipulate Unicode characters is by using code points directly, e.g., using String.codePointAt(index) or the String.codePoints() stream on JDK 1.8 and above.



Additional sources:




  • The Unicode 1.0 Standard, Chapter 2 (pg. 10 and 22)


  • Supplementary Characters in the Java Platform (Sun/Oracle)





share|improve this answer










New contributor




peekay is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.




















  • Thanks for reply, I completely missed the String::codePointAt api, also I think I should move to java 8. Since other answers are same, I'll select the one which came earliest.

    – dingy
    2 days ago






  • 1





    @dingy If you're planning to make the JDK jump, consider skipping Java 8 and go straight to (Open) JDK 11 LTS, which has some additional gems.

    – peekay
    2 days ago











  • Even before that change, there were combining characters, which invalidate the assumption that a single codepoint represents the entire character.

    – Holger
    2 days ago












  • @Holger To be more precise, suppose we encode the letter Á using two characters: A (U+0041 Latin Capital Letter A) plus the combining character ◌́ (U+0301 Combining Acute Accent). In this case, notice that combining characters do not change the fact that each code point still only represent one character: we have two characters and two code points to represent the letter (grapheme) Á.

    – peekay
    2 days ago











  • @MichaWiedenmann That's not correct. In Unicode 1.x a code point was always 16-bits and mapped to one Unicode character. See the Unicode 1.0 Specification. From the standard: Unicode code points are 16-bit quantities. (pg. 22) and All Unicode characters have a uniform width of 16 bits. (pg. 10). Code points larger than 16-bit (supplementary characters) were first assigned in Unicode 3.1. Java did not support them until JDK 5.0 (September 2004).

    – peekay
    2 days ago















4














We can't use char for Unicode, because Java's Unicode char handling is broken.



In the early days of Java, Unicode code points were always 16-bits (fixed size at exactly one char). However, the Unicode specification changed to allow supplemental characters. That meant Unicode characters are now variable widths, and can be longer than one char. Unfortunately, it was too late to change Java's char implementation without breaking a ton of production code.



So the best way to manipulate Unicode characters is by using code points directly, e.g., using String.codePointAt(index) or the String.codePoints() stream on JDK 1.8 and above.



Additional sources:




  • The Unicode 1.0 Standard, Chapter 2 (pg. 10 and 22)


  • Supplementary Characters in the Java Platform (Sun/Oracle)





share|improve this answer










New contributor




peekay is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.




















  • Thanks for reply, I completely missed the String::codePointAt api, also I think I should move to java 8. Since other answers are same, I'll select the one which came earliest.

    – dingy
    2 days ago






  • 1





    @dingy If you're planning to make the JDK jump, consider skipping Java 8 and go straight to (Open) JDK 11 LTS, which has some additional gems.

    – peekay
    2 days ago











  • Even before that change, there were combining characters, which invalidate the assumption that a single codepoint represents the entire character.

    – Holger
    2 days ago












  • @Holger To be more precise, suppose we encode the letter Á using two characters: A (U+0041 Latin Capital Letter A) plus the combining character ◌́ (U+0301 Combining Acute Accent). In this case, notice that combining characters do not change the fact that each code point still only represent one character: we have two characters and two code points to represent the letter (grapheme) Á.

    – peekay
    2 days ago











  • @MichaWiedenmann That's not correct. In Unicode 1.x a code point was always 16-bits and mapped to one Unicode character. See the Unicode 1.0 Specification. From the standard: Unicode code points are 16-bit quantities. (pg. 22) and All Unicode characters have a uniform width of 16 bits. (pg. 10). Code points larger than 16-bit (supplementary characters) were first assigned in Unicode 3.1. Java did not support them until JDK 5.0 (September 2004).

    – peekay
    2 days ago













4












4








4







We can't use char for Unicode, because Java's Unicode char handling is broken.



In the early days of Java, Unicode code points were always 16-bits (fixed size at exactly one char). However, the Unicode specification changed to allow supplemental characters. That meant Unicode characters are now variable widths, and can be longer than one char. Unfortunately, it was too late to change Java's char implementation without breaking a ton of production code.



So the best way to manipulate Unicode characters is by using code points directly, e.g., using String.codePointAt(index) or the String.codePoints() stream on JDK 1.8 and above.



Additional sources:




  • The Unicode 1.0 Standard, Chapter 2 (pg. 10 and 22)


  • Supplementary Characters in the Java Platform (Sun/Oracle)





share|improve this answer










New contributor




peekay is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.










We can't use char for Unicode, because Java's Unicode char handling is broken.



In the early days of Java, Unicode code points were always 16-bits (fixed size at exactly one char). However, the Unicode specification changed to allow supplemental characters. That meant Unicode characters are now variable widths, and can be longer than one char. Unfortunately, it was too late to change Java's char implementation without breaking a ton of production code.



So the best way to manipulate Unicode characters is by using code points directly, e.g., using String.codePointAt(index) or the String.codePoints() stream on JDK 1.8 and above.



Additional sources:




  • The Unicode 1.0 Standard, Chapter 2 (pg. 10 and 22)


  • Supplementary Characters in the Java Platform (Sun/Oracle)






share|improve this answer










New contributor




peekay is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









share|improve this answer



share|improve this answer








edited 2 days ago





















New contributor




peekay is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









answered Apr 23 at 3:14









peekaypeekay

26613




26613




New contributor




peekay is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





New contributor





peekay is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






peekay is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.












  • Thanks for reply, I completely missed the String::codePointAt api, also I think I should move to java 8. Since other answers are same, I'll select the one which came earliest.

    – dingy
    2 days ago






  • 1





    @dingy If you're planning to make the JDK jump, consider skipping Java 8 and go straight to (Open) JDK 11 LTS, which has some additional gems.

    – peekay
    2 days ago











  • Even before that change, there were combining characters, which invalidate the assumption that a single codepoint represents the entire character.

    – Holger
    2 days ago












  • @Holger To be more precise, suppose we encode the letter Á using two characters: A (U+0041 Latin Capital Letter A) plus the combining character ◌́ (U+0301 Combining Acute Accent). In this case, notice that combining characters do not change the fact that each code point still only represent one character: we have two characters and two code points to represent the letter (grapheme) Á.

    – peekay
    2 days ago











  • @MichaWiedenmann That's not correct. In Unicode 1.x a code point was always 16-bits and mapped to one Unicode character. See the Unicode 1.0 Specification. From the standard: Unicode code points are 16-bit quantities. (pg. 22) and All Unicode characters have a uniform width of 16 bits. (pg. 10). Code points larger than 16-bit (supplementary characters) were first assigned in Unicode 3.1. Java did not support them until JDK 5.0 (September 2004).

    – peekay
    2 days ago

















  • Thanks for reply, I completely missed the String::codePointAt api, also I think I should move to java 8. Since other answers are same, I'll select the one which came earliest.

    – dingy
    2 days ago






  • 1





    @dingy If you're planning to make the JDK jump, consider skipping Java 8 and go straight to (Open) JDK 11 LTS, which has some additional gems.

    – peekay
    2 days ago











  • Even before that change, there were combining characters, which invalidate the assumption that a single codepoint represents the entire character.

    – Holger
    2 days ago












  • @Holger To be more precise, suppose we encode the letter Á using two characters: A (U+0041 Latin Capital Letter A) plus the combining character ◌́ (U+0301 Combining Acute Accent). In this case, notice that combining characters do not change the fact that each code point still only represent one character: we have two characters and two code points to represent the letter (grapheme) Á.

    – peekay
    2 days ago











  • @MichaWiedenmann That's not correct. In Unicode 1.x a code point was always 16-bits and mapped to one Unicode character. See the Unicode 1.0 Specification. From the standard: Unicode code points are 16-bit quantities. (pg. 22) and All Unicode characters have a uniform width of 16 bits. (pg. 10). Code points larger than 16-bit (supplementary characters) were first assigned in Unicode 3.1. Java did not support them until JDK 5.0 (September 2004).

    – peekay
    2 days ago
















Thanks for reply, I completely missed the String::codePointAt api, also I think I should move to java 8. Since other answers are same, I'll select the one which came earliest.

– dingy
2 days ago





Thanks for reply, I completely missed the String::codePointAt api, also I think I should move to java 8. Since other answers are same, I'll select the one which came earliest.

– dingy
2 days ago




1




1





@dingy If you're planning to make the JDK jump, consider skipping Java 8 and go straight to (Open) JDK 11 LTS, which has some additional gems.

– peekay
2 days ago





@dingy If you're planning to make the JDK jump, consider skipping Java 8 and go straight to (Open) JDK 11 LTS, which has some additional gems.

– peekay
2 days ago













Even before that change, there were combining characters, which invalidate the assumption that a single codepoint represents the entire character.

– Holger
2 days ago






Even before that change, there were combining characters, which invalidate the assumption that a single codepoint represents the entire character.

– Holger
2 days ago














@Holger To be more precise, suppose we encode the letter Á using two characters: A (U+0041 Latin Capital Letter A) plus the combining character ◌́ (U+0301 Combining Acute Accent). In this case, notice that combining characters do not change the fact that each code point still only represent one character: we have two characters and two code points to represent the letter (grapheme) Á.

– peekay
2 days ago





@Holger To be more precise, suppose we encode the letter Á using two characters: A (U+0041 Latin Capital Letter A) plus the combining character ◌́ (U+0301 Combining Acute Accent). In this case, notice that combining characters do not change the fact that each code point still only represent one character: we have two characters and two code points to represent the letter (grapheme) Á.

– peekay
2 days ago













@MichaWiedenmann That's not correct. In Unicode 1.x a code point was always 16-bits and mapped to one Unicode character. See the Unicode 1.0 Specification. From the standard: Unicode code points are 16-bit quantities. (pg. 22) and All Unicode characters have a uniform width of 16 bits. (pg. 10). Code points larger than 16-bit (supplementary characters) were first assigned in Unicode 3.1. Java did not support them until JDK 5.0 (September 2004).

– peekay
2 days ago





@MichaWiedenmann That's not correct. In Unicode 1.x a code point was always 16-bits and mapped to one Unicode character. See the Unicode 1.0 Specification. From the standard: Unicode code points are 16-bit quantities. (pg. 22) and All Unicode characters have a uniform width of 16 bits. (pg. 10). Code points larger than 16-bit (supplementary characters) were first assigned in Unicode 3.1. Java did not support them until JDK 5.0 (September 2004).

– peekay
2 days ago










dingy is a new contributor. Be nice, and check out our Code of Conduct.









draft saved

draft discarded


















dingy is a new contributor. Be nice, and check out our Code of Conduct.












dingy is a new contributor. Be nice, and check out our Code of Conduct.











dingy is a new contributor. Be nice, and check out our Code of Conduct.














Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55803293%2fsorting-the-characters-in-a-utf-16-string-in-java%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

How does Billy Russo acquire his 'Jigsaw' mask? Unicorn Meta Zoo #1: Why another podcast? Announcing the arrival of Valued Associate #679: Cesar Manara Favourite questions and answers from the 1st quarter of 2019Why does Bane wear the mask?Why does Kylo Ren wear a mask?Why did Captain America remove his mask while fighting Batroc the Leaper?How did the OA acquire her wisdom?Is Billy Breckenridge gay?How does Adrian Toomes hide his earnings from the IRS?What is the state of affairs on Nootka Sound by the end of season 1?How did Tia Dalma acquire Captain Barbossa's body?How is one “Deemed Worthy”, to acquire the Greatsword “Dawn”?How did Karen acquire the handgun?

Личност Атрибути на личността | Литература и източници | НавигацияРаждането на личносттаредактиратередактирате

A sequel to Domino's tragic life Why Christmas is for Friends Cold comfort at Charles' padSad farewell for Lady JanePS Most watched News videos