Why doesn't CosmosDB index arrays by default? The default index path is
"path": "/*"
Doesn't that mean "index everything"? Not "index everything except arrays".
If I add my array field to the index with something like this:
"path": "/tags/[]/?"
It will work and start indexing that particular array field.
But my question is why doesn't "index everything" index everything?
EDIT: Here's a blog post that describes the behavior I'm seeing. http://www.devwithadam.com/2017/08/querying-for-items-in-array-in-cosmosdb.html Array_Contains queries are very slow, clearly not using the index. If you add the field in question to the index explicitly then the queries are fast (clearly they start using the index).
Cosmos DB does indexes all the element of an Array. By, default, All Azure Cosmos DB data is indexed. Read more here https://docs.microsoft.com/en-us/azure/cosmos-db/indexing-policies
"New" index layout
As stated in Index Types
The below issue does not apply to the new index layout. There the default indexing policy works fine (and delivers the results in
36.55 RUs
). However pre-existing collections may still be using the old layout."Old" index layout
I was able to reproduce the issue with
ARRAY_CONTAINS
that you are asking about.Setting up a CosmosDB collection with 100,000 posts from the SO data dump (e.g. this question would be represented as below)
And then performing the following query
The query took over 2,000 RUs with default indexing policy and 93 with the following addition (as shown in your linked article)
However what you are seeing here is not that the array values aren't being indexed by default. It is just that the default range index is not useful for your query.
The range index uses keys based on partial forward paths. So will contain paths such as the following.
tags/0/azure
tags/0/c#
tags/0/oracle
tags/0/sql-server
tags/1/azure-cosmosdb
tags/1/c#
tags/1/sql-server
With this index structure it starts at
tags/0/sql-server
and then reads all of the remainingtags/0/
entries and the entirety of the entries fortags/n/
wheren
is an integer greater than0
. Each distinct document mapping to any of these needs to be retrieved and evaluated.By contrast the hash index uses reverse paths (more details - PDF)
StackOverflow theoretically allows a maximum of 5 tags per question to be added by the UI so in this case (ignoring the fact that a few questions have more tags through site admin activities) the reverse paths of interest are
sql-server/0/tags
sql-server/1/tags
sql-server/2/tags
sql-server/3/tags
sql-server/4/tags
With the reverse path structure finding all paths with leaf nodes of value sql-server is straight forward.
In this specific use case as the arrays are bounded to a maximum of 5 possible values it is also possible to use the original range index efficiently by looking at just those specific paths.
The following query took 97 RUs with default indexing policy in my test collection.