I am trying to match <input>
type “hidden” fields using this pattern:
/<input type="hidden" name="([^"]*?)" value="([^"]*?)" />/
This is sample form data:
<input type="hidden" name="SaveRequired" value="False" /><input type="hidden" name="__VIEWSTATE1" value="1H4sIAAtzrkX7QfL5VEGj6nGi+nP" /><input type="hidden" name="__VIEWSTATE2" value="0351118MK" /><input type="hidden" name="__VIEWSTATE3" value="ZVVV91yjY" /><input type="hidden" name="__VIEWSTATE0" value="3" /><input type="hidden" name="__VIEWSTATE" value="" /><input type="hidden" name="__VIEWSTATE" value="" />
But I am not sure that the type
, name
, and value
attributes will always appear in the same order. If the type
attribute comes last, the match will fail because in my pattern it’s at the start.
Question:
How can I change my pattern so it will match regardless of the positions of the attributes in the <input>
tag?
P.S.: By the way I am using the Adobe Air based RegEx Desktop Tool for testing regular expressions.
I would like to use
**DOMDocument**
to extract the html code.BTW, you can test it in here - regex101.com. It shows the result at real time. Some rules about Regexp: http://www.eclipse.org/tptp/home/downloads/installguide/gla_42/ref/rregexp.html Reader.
Contrary to all the answers here, for what you're trying to do regex is a perfectly valid solution. This is because you are NOT trying to match balanced tags-- THAT would be impossible with regex! But you are only matching what's in one tag, and that's perfectly regular.
Here's the problem, though. You can't do it with just one regex... you need to do one match to capture an
<input>
tag, then do further processing on that. Note that this will only work if none of the attribute values have a>
character in them, so it's not perfect, but it should suffice for sane inputs.Here's some Perl (pseudo)code to show you what I mean:
The basic principle here is, don't try to do too much with one regular expression. As you noticed, regular expressions enforce a certain amount of order. So what you need to do instead is to first match the CONTEXT of what you're trying to extract, then do submatching on the data you want.
EDIT: However, I will agree that in general, using an HTML parser is probably easier and better and you really should consider redesigning your code or re-examining your objectives. :-) But I had to post this answer as a counter to the knee-jerk reaction that parsing any subset of HTML is impossible: HTML and XML are both irregular when you consider the entire specification, but the specification of a tag is decently regular, certainly within the power of PCRE.
In the spirit of Tom Christiansen's lexer solution, here's a link to Robert Cameron's seemingly forgotten 1998 article, REX: XML Shallow Parsing with Regular Expressions.
http://www.cs.sfu.ca/~cameron/REX.html
If you enjoy reading about regular expressions, Cameron's paper is fascinating. His writing is concise, thorough, and very detailed. He's not simply showing you how to construct the REX regular expression but also an approach for building up any complex regex from smaller parts.
I've been using the REX regular expression on and off for 10 years to solve the sort of problem the initial poster asked about (how do I match this particular tag but not some other very similar tag?). I've found the regex he developed to be completely reliable.
REX is particularly useful when you're focusing on lexical details of a document -- for example, when transforming one kind of text document (e.g., plain text, XML, SGML, HTML) into another, where the document may not be valid, well formed, or even parsable for most of the transformation. It lets you target islands of markup anywhere within a document without disturbing the rest of the document.
you can try this :
and for closer result you can try this :
you can test your regex pattern here http://regexpal.com/
these pattens are good for this:
and for random order of
type
,name
andvalue
u can use this :or
on this :
`
by the way i think you want something like this :
its not good but it works in any way.
test it in : http://regexpal.com/
//input[@type="hidden"]
. Or if you don't want to use xpath, just get all inputs and filter which ones are hidden withgetAttribute
.I prefer #2.
Result:
Oh Yes You Can Use Regexes to Parse HTML!
For the task you are attempting, regexes are perfectly fine!
It is true that most people underestimate the difficulty of parsing HTML with regular expressions and therefore do so poorly.
But this is not some fundamental flaw related to computational theory. That silliness is parroted a lot around here, but don’t you believe them.
So while it certainly can be done (this posting serves as an existence proof of this incontrovertible fact), that doesn’t mean it should be.
You must decide for yourself whether you’re up to the task of writing what amounts to a dedicated, special-purpose HTML parser out of regexes. Most people are not.
But I am. ☻
General Regex-Based HTML Parsing Solutions
First I’ll show how easy it is to parse arbitrary HTML with regexes. The full program’s at the end of this posting, but the heart of the parser is:
See how easy that is to read?
As written, it identifies each piece of HTML and tells where it found that piece. You could easily modify it to do whatever else you want with any given type of piece, or for more particular types than these.
I have no failing test cases (left :): I’ve successfully run this code on more than 100,000 HTML files — every single one I could quickly and easily get my hands on. Beyond those, I’ve also run it on files specifically constructed to break naïve parsers.
This is not a naïve parser.
Oh, I’m sure it isn’t perfect, but I haven’t managed to break it yet. I figure that even if something did, the fix would be easy to fit in because of the program’s clear structure. Even regex-heavy programs should have stucture.
Now that that’s out of the way, let me address the OP’s question.
Demo of Solving the OP’s Task Using Regexes
The little
html_input_rx
program I include below produces the following output, so that you can see that parsing HTML with regexes works just fine for what you wish to do:Parse Input Tags, See No Evil Input
Here’s the source for the program that produced the output above.
There you go! Nothing to it! :)
Only you can judge whether your skill with regexes is up to any particular parsing task. Everyone’s level of skill is different, and every new task is different. For jobs where you have a well-defined input set, regexes are obviously the right choice, because it is trivial to put some together when you have a restricted subset of HTML to deal with. Even regex beginners should be handle those jobs with regexes. Anything else is overkill.
However, once the HTML starts becoming less nailed down, once it starts to ramify in ways you cannot predict but which are perfectly legal, once you have to match more different sorts of things or with more complex dependencies, you will eventually reach a point where you have to work harder to effect a solution that uses regexes than you would have to using a parsing class. Where that break-even point falls depends again on your own comfort level with regexes.
So What Should I Do?
I’m not going to tell you what you must do or what you cannot do. I think that’s Wrong. I just want to present you with possibilties, open your eyes a bit. You get to choose what you want to do and how you want to do it. There are no absolutes — and nobody else knows your own situation as well as you yourself do. If something seems like it’s too much work, well, maybe it is. Programming should be fun, you know. If it isn’t, you may be doing it wrong.
One can look at my
html_input_rx
program in any number of valid ways. One such is that you indeed can parse HTML with regular expressions. But another is that it is much, much, much harder than almost anyone ever thinks it is. This can easily lead to the conclusion that my program is a testament to what you should not do, because it really is too hard.I won’t disagree with that. Certainly if everything I do in my program doesn’t make sense to you after some study, then you should not be attempting to use regexes for this kind of task. For specific HTML, regexes are great, but for generic HTML, they’re tantamount to madness. I use parsing classes all the time, especially if it’s HTML I haven’t generated myself.
Regexes optimal for small HTML parsing problems, pessimal for large ones
Even if my program is taken as illustrative of why you should not use regexes for parsing general HTML — which is OK, because I kinda meant for it to be that ☺ — it still should be an eye-opener so more people break the terribly common and nasty, nasty habit of writing unreadable, unstructured, and unmaintainable patterns.
Patterns do not have to be ugly, and they do not have to be hard. If you create ugly patterns, it is a reflection on you, not them.
Phenomenally Exquisite Regex Language
I’ve been asked to point out that my proferred solution to your problem has been written in Perl. Are you surprised? Did you not notice? Is this revelation a bombshell?
I must confess that I find this request bizarre in the extreme, since anybody who can’t figure that out from looking at the very first line of my program surely has other mental disabilities as well.
It is true that not all other tools and programming languages are quite as convenient, expressive, and powerful when it comes to regexes as Perl is. There’s a big spectrum out there, with some being more suitable than others. In general, the languages that have expressed regexes as part of the core language instead of as a library are easier to work with. I’ve done nothing with regexes that you couldn’t do in, say, PCRE, although you would structure the program differently if you were using C.
Eventually other languages will be catch up with where Perl is now in terms of regexes. I say this because back when Perl started, nobody else had anything like Perl’s regexes. Say anything you like, but this is where Perl clearly won: everybody copied Perl’s regexes albeit at varying stages of their development. Perl pioneered almost (not quite all, but almost) everything that you have come to rely on in modern patterns today, no matter what tool or language you use. So eventually the others will catch up.
But they’ll only catch up to where Perl was sometime in the past, just as it is now. Everything advances. In regexes if nothing else, where Perl leads, others follow. Where will Perl be once everybody else finally catches up to where Perl is now? I have no idea, but I know we too will have moved. Probably we’ll be closer to Perl₆’s style of crafting patterns.
If you like that kind of thing but would like to use it in Perl₅, you might be interested in Damian Conway’s wonderful Regexp::Grammars module. It’s completely awesome, and makes what I’ve done here in my program seem just as primitive as mine makes the patterns that people cram together without whitespace or alphabetic identifiers. Check it out!
Simple HTML Chunker
Here is the complete source to the parser I showed the centerpiece from at the beginning of this posting.
I am not suggesting that you should use this over a rigorously tested parsing class. But I am tired of people pretending that nobody can parse HTML with regexes just because they can’t. You clearly can, and this program is proof of that assertion.
Sure, it isn’t easy, but it is possible!
And trying to do so is a terrible waste of time, because good parsing classes exist which you should use for this task. The right answer to people trying to parse arbitrary HTML is not that it is impossible. That is a facile and disingenuous answer. The correct and honest answer is that they shouldn’t attempt it because it is too much of a bother to figure out from scratch; they should not break their back striving to reïnvent a wheel that works perfectly well.
On the other hand, HTML that falls within a predicable subset is ultra-easy to parse with regexes. It’s no wonder people try to use them, because for small problems, toy problems perhaps, nothing could be easier. That’s why it’s so important to distinguish the two tasks — specific vs generic — as these do not necessarily demand the same approach.
I hope in the future here to see a more fair and honest treatment of questions about HTML and regexes.
Here’s my HTML lexer. It doesn’t try to do a validating parse; it just identifies the lexical elements. You might think of it more as an HTML chunker than an HTML parser. It isn’t very forgiving of broken HTML, although it makes some very small allowances in that direction.
Even if you never parse full HTML yourself (and why should you? it’s a solved problem!), this program has lots of cool regex bits that I believe a lot of people can learn a lot from. Enjoy!