I am building a Dockerised record-playback system to help me record websites, so I can design scrapers aginst a local version rather than the real thing. This means that I do not swamp a website with automated requests, and has the added advantage that I do not need to be connected to the web to work.
I have used the Java-based WireMock internally, which records from a queue of site scrapes using Wget. I am using the WireMock API to read various pieces information from the mappings it records.
However, I have spotted from a mapping response that domain information does not seem to be recorded (except where it is in response headers by accident). See the following response from __admin/mappings
:
{
"result": {
"ok": true,
"list": [
{
"id": "794d609f-99b9-376d-b6b8-04dab161c023",
"uuid": "794d609f-99b9-376d-b6b8-04dab161c023",
"request": {
"url": "/robots.txt",
"method": "GET"
},
"response": {
"status": 404,
"bodyFileName": "body-robots.txt-j9qqJ.txt",
"headers": {
"Server": "nginx/1.0.15",
"Date": "Wed, 04 Jan 2017 21:04:40 GMT",
"Content-Type": "text/html",
"Connection": "keep-alive"
}
}
},
{
"id": "e246fac2-f9ad-3799-b7b7-066941408b8b",
"uuid": "e246fac2-f9ad-3799-b7b7-066941408b8b",
"request": {
"url": "/about/careers/",
"method": "GET"
},
"response": {
"status": 200,
"bodyFileName": "body-about-careers-GhVqy.txt",
"headers": {
"Server": "nginx/1.0.15",
"Date": "Wed, 04 Jan 2017 21:04:35 GMT",
"Content-Type": "text/html",
"Last-Modified": "Wed, 04 Jan 2017 12:52:12 GMT",
"Connection": "keep-alive",
"X-CACHE-URI": "/about/careers/",
"Accept-Ranges": "bytes"
}
}
},
{
"id": "def378f5-a93c-333e-9663-edcd30c936d7",
"uuid": "def378f5-a93c-333e-9663-edcd30c936d7",
"request": {
"url": "/about/careers/feed/",
"method": "GET"
},
"response": {
"status": 200,
"bodyFileName": "body-careers-feed-Fd2fO.xml",
"headers": {
"Server": "nginx/1.0.15",
"Date": "Wed, 04 Jan 2017 21:04:45 GMT",
"Content-Type": "application/rss+xml; charset=UTF-8",
"Transfer-Encoding": "chunked",
"Connection": "keep-alive",
"X-Powered-By": "PHP/5.3.3",
"Vary": "Cookie",
"X-Pingback": "http://www.example.com/xmlrpc.php",
"Last-Modified": "Thu, 06 Jun 2013 14:01:52 GMT",
"ETag": "\"765fc03186b121a764133349f8b716df\"",
"X-Robots-Tag": "noindex, follow",
"Link": "<http://www.example.com/?p=2680>; rel=shortlink",
"X-CACHE-URI": "null cache"
}
}
},
{
"id": "616ca6d7-6e57-4c10-8b57-f6f3dabc0930",
"uuid": "616ca6d7-6e57-4c10-8b57-f6f3dabc0930",
"request": {
"method": "ANY"
},
"response": {
"status": 200,
"proxyBaseUrl": "http://www.example.com"
},
"priority": 10
}
]
}
}
The only clear recording of a URL is in the final entry against proxyBaseUrl
, and given that I had to specify a URL in the console call I am now worried that if I record against a different domain, the domain that each one is from will be lost.
That would mean that in playback mode, WireMock would only be able to play back from one domain, and I'd have to restart it and point it to another cache in order to play back different sites. This is not workable for my use case, so is there a way around this problem?
(I have done a little work with Mountebank, and would be willing to switch to it, though I find WireMock generally easier to use. My limited understanding of Mountebank is that it suffers from the same single-domain problem, though I am happy to be corrected on that. I'd be happy to swap to any robust open-source API-driven recorder HTTP proxy, if dropping WireMock is the only way forward).