Invalid character exception when adding Metadata t

2019-04-19 02:29发布

问题:

Task

Upload a file to Azure Blob Storage with the original filename and also assign the filename as meta-data to the CloudBlob

Problem

These characters are not permitted in the meta-data but are acceptable as the blob name:

š Š ñ Ñ ç Ç ÿ Ÿ ž Ž Ð œ Œ « » éèëêð ÉÈËÊ àâä ÀÁÂÃÄÅ àáâãäå ÙÚÛÜ ùúûüµ òóôõöø ÒÓÔÕÖØ ìíîï ÌÍÎÏ

Question

  • Is there a way to store these characters in the meta-data? Are we missing some setting that causes this exception?
  • Most of these characters are standard glyphs in some languages, so how to handle that?
  • Is there any documentation available that advises about this issue? I found blob and meta-data naming conventions, but none about the data itself!

Code

var dirtyFileName      = file.FileName;
var normalizedFileName = file.FileName.CleanOffDiacriticAndNonASCII();

// Blob name accepts almost characters that are acceptable as filenames in Windows
var blob = container.GetBlobReference(dirtyFileName);

//Upload content to the blob, which will create the blob if it does not already exist.
blob.Metadata["FileName"] = normalizedFileName;
blob.Attributes.Properties.ContentType = file.ContentType;

// ERROR: Occurs here!
blob.UploadFromStream(file.InputStream);

blob.SetMetadata();
blob.SetProperties();

Error

References

  • Naming and Referencing Containers, Blobs, and Metadata
  • How to support other languages in Azure blob storage?
  • How do I remove diacritics (accents) from a string in .NET?
  • Azure CloudBlob SetMetadata fails with "The metadata specified is invalid. It has characters that are not permitted."
  • Replacing characters in C# (ascii)

Workarounds

Illegal characters in filename is only the tip of the ice-berg, magnified only for the purpose of this question! The bigger picture is that we index these files using Lucene.net and as such need a lot of meta-data to be stored on the blob. Please don't suggest storing it all separately in a database, just don't! Up until now we have been lucky to only have come across one file with diacritic characters!

So, at the moment we are making the effort to avoid saving the filename in the meta-data as a workaround!

回答1:

Just have had confirmation from the azure-sdk-for-net team on GitHub that only ASCII characters are valid as data within blob meta-data.

joeg commented:
The supported characters in the blob metadata must be ASCII characters. To work around this you can either escape the string ( percent encode), base64 encode etc.

Source on GitHub

So as a work-around, either:

  • escape the string (percent encode), base64 encode, etc, as suggested by joeg
  • use the techniques that I have mentioned in my other answer.



回答2:

Unless I get an answer that actually solves the issue, this workaround is a solution for the above issue!

Workaround

To get this to work, I am using a combination of the below methods to:

  1. Convert all possible characters to their ascii/english equivivalent
  2. Invalid Characters that escape this cleanup are literally stripped out of the string

But this isn't ideal as we are losing data!

Diacritics to ASCII

/// <summary>
/// Converts all Diacritic characters in a string to their ASCII equivalent
/// Courtesy: http://stackoverflow.com/a/13154805/476786
/// A quick explanation:
/// * Normalizing to form D splits charactes like è to an e and a nonspacing `
/// * From this, the nospacing characters are removed
/// * The result is normalized back to form C (I'm not sure if this is neccesary)
/// </summary>
/// <param name="value"></param>
/// <returns></returns>
public static string ConvertDiacriticToASCII(this string value)
{
    if (value == null) return null;
    var chars =
        value.Normalize(NormalizationForm.FormD)
             .ToCharArray()
             .Select(c => new {c, uc = CharUnicodeInfo.GetUnicodeCategory(c)})
             .Where(@t => @t.uc != UnicodeCategory.NonSpacingMark)
             .Select(@t => @t.c);
    var cleanStr = new string(chars.ToArray()).Normalize(NormalizationForm.FormC);
    return cleanStr;
}

Non-ASCII Burninator

/// <summary>
/// Removes all non-ASCII characters from the string
/// Courtesy: http://stackoverflow.com/a/135473/476786
/// Uses the .NET ASCII encoding to convert a string. 
/// UTF8 is used during the conversion because it can represent any of the original characters. 
/// It uses an EncoderReplacementFallback to to convert any non-ASCII character to an empty string.
/// </summary>
/// <param name="value"></param>
/// <returns></returns>
public static string RemoveNonASCII(this string value)
{
    string cleanStr = 
           Encoding.ASCII
                   .GetString(
                              Encoding.Convert(Encoding.UTF8,
                                               Encoding.GetEncoding(Encoding.ASCII.EncodingName,
                                                                    new EncoderReplacementFallback(string.Empty),
                                                                    new DecoderExceptionFallback()
                                                                    ),
                                               Encoding.UTF8.GetBytes(value)
                                               )
                              );
    return cleanStr;
}

I really hope to get an answer as the workaround is obviously not ideal, and it also doesn't make sense why this is not possible!



回答3:

To expand on the answer by bPratik, we've found that Base64 encoding metadata works nicely. We use this extension method to do the encode and decode:

    public static class Base64Extensions
    {
        public static string ToBase64(this string input)
        {
            var bytes = Encoding.UTF8.GetBytes(input);
            return Convert.ToBase64String(bytes);
        }

        public static string FromBase64(this string input)
        {
            var bytes = Convert.FromBase64String(input);
            return Encoding.UTF8.GetString(bytes);
        }
    }

and then when setting blob metadata:

blobReference.Metadata["Filename"] = filename.ToBase64();

and when retrieving it:

var filename = blobReference.Metadata["Filename"].FromBase64();

For search, you would have to decode the filename before presenting it to the indexer, or use the blob's actual filename assuming you're still using the original filename there.



回答4:

If the above list is exhaustive, it should be possible to encode the metadata to HTML and then decode it when you need it:

var htmlEncodedValue = System.Web.HttpUtility.HtmlEncode(value)
var originalValue = System.Web.HttpUtility.HtmlDecode(value)