Using Python/Selenium/Best Tool For The Job to get

2019-04-16 01:49发布

I have some JavaScript from a 3rd party vendor that is initiating an image request. I would like to figure out the URI of this image request.

I can load the page in my browser, and then monitor "Live HTTP Headers" or "Tamper Data" in order to figure out the image request URI, but I would prefer to create a command line process to do this.

My intuition is that it might be possible using python + qtwebkit, but perhaps there is a better way.

To clarify: I might have this (overly simplified code).

<script>
suffix = magicNumberFunctionIDontHaveAccessTo();
url = "http://foobar.com/function?parameter=" + suffix
img = document.createElement('img'); img.src=url; document.all.body.appendChild(img);
</script>

Then once the page is loaded, I can go figure out the url by sniffing the packets. But I can't just figure it out from the source, because I can't predict the outcome of magicNumberFunction...().

Any help would be muchly appreciated!

Thank you.

5条回答
地球回转人心会变
2楼-- · 2019-04-16 02:33

Use Firebug Firefox plugin. It will show you all requests in real time and you can even debug the JS in your Browser or run it step-by-step.

查看更多
孤傲高冷的网名
3楼-- · 2019-04-16 02:37

I would pick any one of the many http proxy servers written in Python -- probably one of the simplest ones at the very top of the list -- and tweak it to record all URLs requested (as well as proxy-serve them) e.g. appending them to a text file -- without loss of generality, call that text file 'XXX.txt'.

Now all you need is a script that: starts the proxy server in question; starts Firefox (or whatever) on your main desired URL with the proxy in question set as your proxy (see e.g. this SO question for how), though I'm sure other browsers would work just as well; waits a bit (e.g. until the proxy's XXX.txt file has not been altered for more than N seconds); reads XXX.txt to extract only the URLs you care about and record them wherever you wish; turns down the proxy and Firefox processes.

I think this will be much faster to put in place and make work correctly, for your specific requirements, than any more general solution based on qtwebkit, selenium, or other "automation kits".

查看更多
甜甜的少女心
4楼-- · 2019-04-16 02:45

Why can't you just read suffix, or url for that matter? Is the image loaded in an iframe or in your page?

If it is loaded in your page, then this may be a dirty hack (substitute document.body for whatever element is considered):

var ac = document.body.appendChild;
var sources = [];

document.body.appendChild = function(child) {
    if (/^img$/i.test(child.tagName)) {
        sources.push(child.getAttribute('src'));
    }
    ac(child);
}
查看更多
Ridiculous、
5楼-- · 2019-04-16 02:50

Ultimately, I did it in python, using Selenium-RC. This solution requires the python files for selenium-rc, and you need to start the java server ("java -jar selenium-server.jar")

from selenium import selenium
import unittest
import lxml.html

class TestMyDomain(unittest.TestCase):
    def setUp(self):
        self.selenium = selenium("localhost", \
            4444, "*firefox", "http://www.MyDomain.com")
        self.selenium.start()

    def test_mydomain(self):

        htmldoc = open('site-list.html').read()
        url_list = [link for (element, attribute,link,pos) in lxml.html.iterlinks(htmldoc)]
        for url in url_list:

            try: 
                sel = self.selenium
                sel.open(url)        
                sel.select_window("null")
                js_code = '''
                myDomainWindow = this.browserbot.getUserWindow();
                for(obj in myDomainWindow) {  

                   /* This code grabs the OMNITURE tracking pixel img */
                    if ((obj.substring(0,4) == 's_i_') && (myDomainWindow[obj].src)) {              
                        var ret = myDomainWindow[obj].src;
                    } 
                }        
                ret;
                '''
                omniture_url = sel.get_eval(js_code) #parse&process this however you want


            except Exception, e:
                print 'We ran into an error: %s' % (e,)


        self.assertEqual("expectedValue", observedValue)


    def tearDown(self):
        self.selenium.stop()

if __name__ == "__main__":
    unittest.main()
查看更多
迷人小祖宗
6楼-- · 2019-04-16 02:55

The simplest thing to do might be to use something like HtmlUnit and skip a real browser entirely. By using Rhino, it can evaluate JavaScript and likely be used to extract that URL out.

That said, if you can't get that working, try out Selenium RC and use the captureNetworkTraffic command (which requires the Selenium instant be started with an option of captureNetworkTraffic=true). This will launch Firefox with a proxy configured and then let you pull the request info back out as JSON/XML/plain text. Then you can parse that content and get what you want.

Try out the instant test tool that my company offers. If the data you're looking for is in our results (after you click View Details), you'll be able to get it from Selenium. I know, since I wrote the captureNetworkTraffic API for Selenium for my company, BrowserMob.

查看更多
登录 后发表回答