When getting substring in .Net, does the new strin

2019-02-06 09:03发布

站内文章 / C#

36 0

叼着烟拽天下

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

Assuming I have the following strings:

string str1 = "Hello World!";  
string str2 = str1.SubString(6, 5); // "World"

I am hoping that in the above example str2 does not copy "World", but simply ends up being a new string that points to the same memory space only that it starts with an offset of 6 and a length of 5.

In actuality I am dealing with some potentially very long strings and am interested in how this works behind the scenes for performance reasons. I am not familiar enaugh with IL to look into this.

回答1:

It's a new string.

Strings, in .NET, are always immutable. Whenever you generate a new string via a method, including Substring, it will construct the new string in memory. The only time you share references to the same data in strings in .NET is if you explicitly assign a string variable to another string (in which its copying the reference), or if you work with string constants, which are typically interned. If you know your string is going to share a value with an interned string (constant/literal from your code), you can retrieve the "shared" copy via String.Intern.

This is a good thing, btw - In order to do what you were describing, every string would require a reference (to the string data), as well as an offset + length. Right now, they only require a reference to the string data.

This would dramatically increase the size of strings in general, throughout the framework.

回答2:

As others have noted, the CLR makes copies when doing a substring operation.

As you note, it certainly would be possible for a string to be represented as an interior pointer with a length. This makes the substring operation extremely cheap.

There are also ways to make other operations cheap. For example, string concatenation can be made cheap by representing strings as a tree of substrings.

In both cases what is happening here is the result of the operation is not actually the "result" itself, per se, but rather, a cheap object which represents the ability to get at the results when needed.

The attentive reader will have just realized that this is how LINQ works. When we say

var results = from c in customers where c.City == "London" select c.Name;

"results" does not contain the results of the query. This code returns almost immediately; results contains an object which represents the query. Only when the query is iterated does the expensive mechanism of searching the collection spin up. We use the power of a monadic representation of sequence semantics to defer the calculations until later.

The question then becomes "is it a good idea to do the same thing on strings?" and the answer is a resounding "no". I have plenty of painful real-world experiments on this. I once spent a summer rewriting the VBScript compiler's string handling routines to store string concatenations as a tree of string concatenation operations; only when the result is actually being used as a string does the concatenation actually happen. It was disastrous; the additional time and memory needed to keep track of all the string pointers made the 99% case -- someone doing a few simple little string operations to render a web page -- about twice as slow, while massively speeding up the tiny, tiny minority of pages that were written using naive string concatenations.

The vast majority of realistic string operations in .NET programs are extremely fast; they compile down to memory moves that in normal circumstances stay well within the memory blocks that are cached by the processor, and are therefore blazingly fast.

Furthermore, using an "interior pointer" approach for strings complicates the garbage collector considerably; going with such an approach seems to make it likely that the GC would slow down overall, which benefits no one. You have to look at the total cost of the impact of the change, not just its impact on some narrow scenarios.

If you have specific performance needs due to your unusually large data then you should consider writing your own special-purpose string library that uses a "monadic" approach like LINQ does. You can represent your strings internally as arrays of char, and then substring operations simply become copying a reference to the array and changing the start and end positions.

回答3:

It creates a new string but that's a very intelligent question and would not be inconceivable. However I think the performance losses in the majority of cases would far outweigh the memory savings for rare cases.

I recently heard of something called "ropes" which would work the way you suggest but I don't know of any implementation in .NET.

http://en.wikipedia.org/wiki/Rope_(computer_science)

回答4:

It references a brand new string.

回答5:

Ya know what, I don't know a thing about .NET.

But, I'd like to make an observation.

Most modern String packages have "copy on write" behaviors.

Specifically, that means if you allocate a substring, it will use the existing storage of the parent string, until the string has a need to change, at which point it will copy out the underlying data in to it's own new space for use.

Now, if you have immutable Strings, where the underlying data can not change, there is little reason to NOT do this. There's no way to "write" to an immutable string, so it doesn't even need copy on write functions, just sharing. C++ has mutable strings, so they do copy on write.

For example, Java does this.

Normally this is a good thing. There's little performance impact.

Where you DON'T want this to happen, though, is say in this example:

String big1MBString = readLongHonkinStringFromTheInterTubes();
static String ittyBitty = big1MBString.substring(1, 5);

You now have a "5 character" string that consumes 1MB of memory, because it shares the underlying 1MB string buffer of the large string, but it's manifested as only a 5 character string. Since you retain the reference to the larger String, internally, you'll "never" free up that original space.

Looking at the Mono sources, they do, in fact, allocate new memory. So, perhaps .NET is an exception to what seems to be common practice today. No doubt they have their valid and informed reasons (i.e. I'm not saying .NET did it wrong), just...different from what others are doing.

回答6:

SubString creates a new string. So new memory for the new strin will be allocated.

回答7:

as Reed said, string are immutable. if you're dealing with long strings, consider using a StringBuilder, it might improve performance, depending of course on what you're trying to accomplish. if you can add some details to your question, you'll surely get suggestion on the best implementation.

回答8:

Strings are immutable, so it will create a copy of the string. However if the substring matches the exact string of another string that was known at compile time, it will actually use the same memory as that substring. That's string interning.

From MSDN: "The common language runtime automatically maintains a table, called the "intern pool", which contains a single instance of each unique literal string constant declared in a program, as well as any unique instance of String you add programmatically.

The intern pool conserves string storage. If you assign a literal string constant to several variables, each variable is set to reference the same constant in the intern pool instead of referencing several different instances of String that have identical values."

The code sample is informative. You can prevent automatic interning using the [assembly: CompilationRelaxations(CompilationRelaxations.NoStringInterning)] attribute to prevent automatic string interning. You would also have to use NGEN.exe to compile it to a native image, to prevent interning.

Note that if you use a StringBuilder it avoids interning. It's only for strings that can be matched up with other strings known at compile time.

This is a modified example of the MSDN article, notice that if I pass in part of "abcd" from the Console, it is still interned, even though that str3 is constructed at runtime. However StringBuilder avoids interning.

// Sample for String.IsInterned(String)
using System;
using System.Text;
using System.Runtime.CompilerServices;
using System.Diagnostics;

// In the .NET Framework 2.0 the following attribute declaration allows you to 
// avoid the use of the interning when you use NGEN.exe to compile an assembly 
// to the native image cache.
//[assembly: CompilationRelaxations(CompilationRelaxations.NoStringInterning)]
class Sample
{
    public static void Main()
    {
        // String str1 is known at compile time, and is automatically interned.
        String str1 = "abcd";
        Console.WriteLine("Type cd and it will be ok, type anything else and Assert will fail.");
        string end = Console.ReadLine(); // Constructed, but still interned.
        string str3 = "ab" + end;

        // Constructed string, str2, is not explicitly or automatically interned.
        String str2 = new StringBuilder().Append("wx").Append("yz").ToString();
        Console.WriteLine();
        Test(1, str1);
        Test(2, str2);
        Test(3, str3);

        // Sanity checks. 
        // Debug.Assert(Object.ReferenceEquals(str3, str1)); // Assertion fails, as expected.
         Debug.Assert(Object.ReferenceEquals(string.Intern(str3), string.Intern(str1))); // Passes
         Debug.Assert(Object.ReferenceEquals(string.Intern(str3), (str1))); // Passes
         Debug.Assert(Object.ReferenceEquals((str3), string.Intern(str1))); // Fails
         Console.ReadKey();
    }

    public static void Test(int sequence, String str)
    {
        Console.Write("{0}) The string, '", sequence);
        String strInterned = String.IsInterned(str);
        if (strInterned == null)
            Console.WriteLine("{0}', is not interned.", str);
        else
            Console.WriteLine("{0}', is interned.", strInterned);
    }
}

回答9:

In the CLR strings are immutable meaning they cannot be changed. When manipulating large strings I would suggest looking at using the string builder class.