xpath: string manipulation

So in my scrapy project I was able to isolate some particular fields, one of the field return something like:

[Rank Info] on 2013-06-27 14:26 Read 174 Times

which was selected by expression:

(//td[@class="show_content"]/text())[4]

I usually do post-processing to extract the datetime information, i.e., 2013-06-27 14:26 Now since I've learned a little more on the xpath substring manipulation, I am wondering if it is even possible to extract that piece of information in the first place, i.e., in the xpath expression itself?

Thanks,

标签： python xpath scrapy

3条回答

Lonely孤独者°

2楼-- · 2019-02-16 01:06

Scrapy uses XPath 1.0 which has very limited string manipulation capabilities, especially does not support regular expressions. There are two ways to cut down a string, I demonstrate both with an example to strip down to the substring you're looking for.

By Character Index

This is fine if the character indices do not change (but the contents could).

substring($string, $start, $len)
substring(//td[@class="show_content"]/text(), 16, 16)

By pre-/suffix Search

This is fine if the index can change, but the contents immediatly before and after the string stay the same:

substring-before($string, $needle)
substring-after($string, $needle)
substring-before(
  substring-after(//td[@class="show_content"]/text(), 'on '), ' Read')

0人赞添加讨论(0) 举报

贪生不怕死

3楼-- · 2019-02-16 01:07

In all of the other answers so far, not only is the /text() not helpful, it is potentially (or even likely) a problem. For readers of the archive, they should be aware of the problems using /text() in addresses for arguments of a function. In my professional work, there are very (very!) few requirements for addressing text() directly.

I'm speaking of these expressions from the other posts:

substring-after(//td[@class='show_content']/text(), 'on ')

and

substring(//td[@class='show_content']/text(), 16, 10)

Let's put aside the issue that "//" is used when it shouldn't be used. In XSLT 1.0 only the first <td> would be considered and in XSLT 2.0 a run-time error would be triggered by more than a singleton for the first argument.

Consider this modified XML if it were the input:

   <td>[<emphasis>Rank Info</emphasis>] on 2013-06-27 14:26 Read 174 Times</td>

... where the " on " is on the second text node (the first text node has "[" in it). In XSLT 1.0, both expressions return the empty string. In XSLT 2.0 both expressions trigger run-time errors.

Consider this modified XML if it were the input:

   <td>[Rank Info]<emphasis> on </emphasis>2013-06-27 14:26 Read 174 Times</td>

In both cases the text() children of <td> do not include the string "on" because that is a descendant text node, not a child text node.

In both expressions, then, the following would work for both of the modified inputs because one is then dealing with the value of the element, not the value of the text nodes. The value of the element is the concatenation of all descendent text nodes.

So:

substring-after(td[@class='show_content'], 'on ')

and

substring(td[@class='show_content'], 16, 10)

would act on the entire string value found in the element. But even the above is going to have cardinality problems if there is more than one <td> child so the expression will have to be rewritten anyway.

My point is, the use of text() caught my eye and I tell my students if they think they need to use text() in an XPath expression, they need to think again because in most cases they do not.

0人赞添加讨论(0) 举报

ら.Afraid

4楼-- · 2019-02-16 01:14

this should work

substring(//td[@class="show_content"]/text(), 16, 10)

But I agree with Blender, in-code postprocessing is better for this purpose.

0人赞添加讨论(0) 举报