Matching strings with at least one word in common

2019-02-27 22:48发布

I'm making a query to get the URIs of documents, that have a specific title. My query is:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dc: <http://purl.org/dc/elements/1.1/> SELECT ?document WHERE {
  ?document dc:title ?title.
  FILTER (?title = "…" ).
}

where "…" is actually the value of this.getTitle(), since the query string is generated by:

String queryString = "PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> " +
                "PREFIX dc: <http://purl.org/dc/elements/1.1/> SELECT ?document WHERE { " +
                "?document dc:title ?title." +
                "FILTER (?title = \"" + this.getTitle() + "\" ). }";

With the query above, I get only the documents with titles exactly like this.getTitle(). Imagine this.getTitle is formed by more than 1 word. I'd like to get documents even if only one word forming this.getTitle appears on the document title (for example). How could I do that?

1条回答
我只想做你的唯一
2楼-- · 2019-02-27 22:58

Let's say you've got some data like (in Turtle):

@prefix : <http://stackoverflow.com/q/20203733/1281433> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .

:a dc:title "Great Gatsby" .
:b dc:title "Boring Gatsby" .
:c dc:title "Great Expectations" .
:d dc:title "The Great Muppet Caper" .

Then you can use a query like:

prefix : <http://stackoverflow.com/q/20203733/1281433>
prefix dc: <http://purl.org/dc/elements/1.1/>

select ?x ?title where {
  # this is just in place of this.getTitle().  It provides a value for
  # ?TITLE that is "Gatsby Strikes Again".
  values ?TITLE { "Gatsby Strikes Again" }

  # Select a thing and its title.
  ?x dc:title ?title .

  # Then filter based on whether the ?title matches the result
  # of replacing the strings in ?TITLE with "|", and matching
  # case insensitively.
  filter( regex( ?title, replace( ?TITLE, " ", "|" ), "i" ))
}

to get results like

------------------------
| x  | title           |
========================
| :b | "Boring Gatsby" |
| :a | "Great Gatsby"  |
------------------------

What's particularly neat about this is that since you're generating the pattern on the fly, you could even make it based on another value from the graph pattern. For instance, if you want all pairs of things whose titles match on at least one word, you could do:

prefix : <http://stackoverflow.com/q/20203733/1281433>
prefix dc: <http://purl.org/dc/elements/1.1/>

select ?x ?xtitle ?y ?ytitle where {
  ?x dc:title ?xtitle .
  ?y dc:title ?ytitle .
  filter( regex( ?xtitle, replace( ?ytitle, " ", "|" ), "i" ) && ?x != ?y )
}
order by ?x ?y

to get:

-----------------------------------------------------------------
| x  | xtitle                   | y  | ytitle                   |
=================================================================
| :a | "Great Gatsby"           | :b | "Boring Gatsby"          |
| :a | "Great Gatsby"           | :c | "Great Expectations"     |
| :a | "Great Gatsby"           | :d | "The Great Muppet Caper" |
| :b | "Boring Gatsby"          | :a | "Great Gatsby"           |
| :c | "Great Expectations"     | :a | "Great Gatsby"           |
| :c | "Great Expectations"     | :d | "The Great Muppet Caper" |
| :d | "The Great Muppet Caper" | :a | "Great Gatsby"           |
| :d | "The Great Muppet Caper" | :c | "Great Expectations"     |
-----------------------------------------------------------------

Of course, it's very important to note that you're pulling generating patterns based on your data now, and that means that someone who can put data into your system could put very expensive patterns in to bog down the query and cause a denial-of-service. On a more mundane note, you could run into trouble if any of your titles have characters in them that would interfere with the regular expressions. One interesting problem would be if something had a title with multiple spaces so that the pattern became The|Words|With||Two|Spaces, since the empty pattern in there might make everything match. This is an interesting approach, but it's got a lot of caveats.

In general, you could do this as shown here, or by generating the regular expression in code (where you can take care of escaping, etc.), or you could use a SPARQL engine that supports some text-based extensions (e.g., jena-text, which adds Apache Lucene or Apache Solr to Apache Jena).

查看更多
登录 后发表回答