Web Page Source Annotation Tool

2019-01-20 00:37发布

问题:

<ul class="contact">
<li class="first">Carnegie Mellon University</li>
<li>5000 Forbes Avenue, Pittsburgh, PA 15213</li>
</ul>

=>

<ul class="contact">
<li class="first">[univ]Carnegie Mellon University[/univ]</li>
<li>[address]5000 Forbes Avenue, Pittsburgh, PA 15213[/address]</li>
</ul>

shows the semantic annotation process I'd like to go through for thousands of web pages. To make my intention clear, I'd like to download all these web pages and have them annotated with user-defined tags (e.g., univ, address). The annotated pages will be used later for processing.

The most naive method I've been using is download the page, open the page file, edit the file with a text editor and save it. It's just too cumbersome.

A better approach I've been using is select the target text to be annotated from browser, inspect the target with tools like Firebug, edit the target text within the inspected view and save the edited page. This helps alleviate some burden, but still there's much room for improvement.

The ideal tool I'd like to have is one with which I can select the target text from the page within a browser, pick the corresponding annotation tag(preferably choosing from a tool bar which displays all tags available), and save the edited page with a single button.

If anyone knows such kind of tools, it would be great. But I doubt such a tool exists at all. Chances are I have to write such a tool myself. The question is, how should I get started? I have little web application developing experience.

Should I write a browser extension? Should I write a standalone application? In what language?

EDIT: A simpler version of the question can be found here. A complete working solution is preferred. I assume this problem is no big deal for experienced web developers but it might really take a while for me and I need to implement the function as soon as possible.

EDIT: Also, I prefer a tool bar right after selection to a drop-down list after right click. I think the tool bar for chrome extension Diigo Web Collector is quite cool.

What's more, a save button is needed on the tool bar to save the edited page to default or user specified place on computer. The buttons on the tool bar should be in this order: univ address tag3 tag4 ... tagn save. The save button comes last.

EDIT: It seems impossible to download the edited page to local file system using Javascript. In either Chrome or Firefox, after we edit the source code of the page in the inspected view, there's a save button for us to save the edited page to local file system. How was that implemented? What I don't want about that save button is it always prompts for the directory to download the page. I'd like to make it a default directory with my own save button.

回答1:

A chrome extension can automate this functionality, you can further extend this skeleton functionality for all possibilities.

The following skeleton adds a context menu for right click event of mouse for selection event1

A Menu is added to chrome browser and is activated when a selection is made as shown in this screen shot

Context Menu Appearance after selection of text

1- Selection context event fires when a selection of text is done by mouse click

Demonstration

Look at jsfiddle, after installation of chrome extension it annotates with user-defined tags

HTML Code Before

HTML Code After Selection

Select a <li> text from output console of jsfiddle through context menu added to chrome browser,you can see DOM is also changed!

Code Reference

manifest.json

manifest file binds content script(s) and background page(s) to extension.

{
    "name": "Annotation Tool",
    "description": "http://stackoverflow.com/questions/14244498/web-page-source-annotation-tool",
    "version": "1",
    "manifest_version": 2,
    "content_scripts": [
        {
            "matches": [
                "<all_urls>"
            ],
            "js": [
                "myscript.js"
            ],
            "all_frames": true
        }
    ],
    "permissions": [
        "contextMenus",
        "<all_urls>",
        "tabs"
    ],
    "background": {
        "scripts": [
            "background.js"
        ]
    },
    "icons": {
        "16": "screen.png",
        "48": "screen.png",
        "128": "screen.png"
    }
}

background.js

Create(s) Context menu and binds it to browser and activates context menu execution through message passing.

var _selection_univ = chrome.contextMenus.create({
    "title": "Add <univ> tag for %s ",
    "id": "_selection_univ",
    "onclick": reportclick,
    "contexts": ["selection"]
}, function () {
    console.log("Context Menu 2 Created");
});
var _selection_address = chrome.contextMenus.create({
    "title": "Add <address> tag for %s ",
    "id": "_selection_address",
    "onclick": reportclick,
    "contexts": ["selection"]
}, function () {
    console.log("Context Menu 2 Created");
});
//Add number of variables here for your functionality
function reportclick(info, tab) {
    switch (info.menuItemId) {
        case "_selection_univ":
            chrome.tabs.sendMessage(tab.id, "univ");//Notify Content Script for univ
            break;
        case "_selection_address":
            chrome.tabs.sendMessage(tab.id, "address");//Notify Content Script for address
            break;
        default:
            console.log("Handle default case..");
    }
}

myscript.js

//Handle DOM Changes here
chrome.extension.onMessage.addListener(function (message, sender, response) {
    switch (message) {
        //Hanlde [univ] tag 
        case "univ":
            if (document.getSelection().baseNode != null) document.getSelection().baseNode.parentNode.innerHTML = "[univ]" + document.getSelection().baseNode.parentNode.innerHTML + "[/univ]";
            break;
        //Hanlde [address] tag  
        case "address":
            if (document.getSelection().baseNode != null) document.getSelection().baseNode.parentNode.innerHTML = "[address]" + document.getSelection().baseNode.parentNode.innerHTML + "[/address]";
            break;
        default:
            console.log("Handle default case..");
    }
});

Further Extension

If you want to further add few more context menu(s)

1) create a variable for context menu as shown here in background.js

var _selection_Some_Tag = chrome.contextMenus.create({
    "title": "Add [SOME TAG] tag for %s ",
    "id": "_selection_univ",
    "onclick": reportclick,
    "contexts": ["selection"]
}, function () {
    console.log("Context Menu for Some Tag Created");//In Call Back
});

2) add a case for switch in background page as shown here

case "_selection_your_case":
    chrome.tabs.sendMessage(tab.id, "_your_tag_content"); //Notify Content Script for address
    break;

3) handle your custom tag in content scripts by adding code as shown here

//Hanlde [your custom] tag 
case "univ":
    if (document.getSelection().baseNode != null) document.getSelection().baseNode.parentNode.innerHTML = "[your tag]" + document.getSelection().baseNode.parentNode.innerHTML + "[/your tag]";
    break;

Testing and Loading Extension

Check How to Load an Extension for testing and extending this script.

References

  • Chrome Extension.
  • Background Page
  • Content Scripts
  • Context Menu
  • Message Passing
  • Extension API
  • DOM Selection

EDIT 1

You can use following code of chrome extension for

  • Tool Bar Instead of Context Menu
  • Replacing only selected text
  • Saving File to Sand Boxed Location

To use this code use any of your fav icon(s) and put them in chrome directory for every tag [univ] and use corresponding names in css file here

background-image: url(chrome-extension://MSG_@@extension_id/YOUR_ICON_NAME.png);

manifest.json

Registering css and java script for annotation tool.

{
    "name": "Annotation Tool",
    "description": "http://stackoverflow.com/questions/14244498/web-page-source-annotation-tool",
    "version": "1",
    "manifest_version": 2,
    "content_scripts": [
        {
            "matches": [
                "<all_urls>"
            ],
            "css": [
                "myscript.css"
            ],
            "js": [
                "jquery.js",
                "myscript.js"
            ],
            "all_frames": true
        }
    ],
    "permissions": [
        "contextMenus",
        "<all_urls>",
        "tabs"
    ],
    "icons": {
        "16": "screen.png",
        "48": "screen.png",
        "128": "screen.png"
    },
    "web_accessible_resources": [
        "icon1.png",
        "icon2.png"
    ]
}

myscript.css

Binding Icons here.

#root #univ {
    display: inline-block;
    z-index: 100000;
    height: 22px;
    width: 26px;
    background-image: url(chrome-extension://__MSG_@@extension_id__/icon1.png);
}
#root #addr {
    display: inline-block;
    z-index: 100000;
    height: 22px;
    width: 26px;
    background-image: url(chrome-extension://__MSG_@@extension_id__/icon2.png);
}

myscript.js

Code for updating selected text with custom tags.

//Intialize counters to default values
clicking = false;
selecting = false;

//Set the toolbar to some invalid position so it will not appear unless a selection is made
var currentMousePos = {
    x: -100,
    y: -100
};

$(document).mousedown(function () {
    //Click is started
    clicking = true;
});

//Tool bar to add
$('body').append("<div id='root' style='position: absolute; left:" + currentMousePos.x + "px; top:" + currentMousePos.y + "px; display: block;'><a id='univ' href='javascript:void(0);'>&nbsp;</a><a id='addr' href='javascript:void(0);' >&nbsp;</a></div>");


$(document).mouseup(function (event) {
    if (selecting) {
        //He is selecting text
        $("#root").attr("style", "position: absolute; left:" + currentMousePos.x + "px; top:" + currentMousePos.y + "px; display: block;");
    } else {
        //He just clicked
        $("#root").attr("style", "display: none;");
    }
    //Reset counters
    clicking = false;
    selecting = false;
});

$(document).mousemove(function () {
    if (clicking) {
        //He did not simply click , but he is selecting some text
        selecting = true;
        //Track current position to put toolbar
        currentMousePos.x = event.pageX;
        currentMousePos.y = event.pageY;
    }
});

$("div #addr").click(function () {
    //Get Selected text
    var selection = document.getSelection();
    //Add your tags and prepare replacing content
    var html = "[addr]" + selection + "[/addr]";
    if (selection.getRangeAt && selection.rangeCount) {
        //Chrome supports only one range fire fox supports multiple ranges
        range = document.getSelection().getRangeAt(0);
        //remove selection
        range.deleteContents();
        //Create a node
        node = range.createContextualFragment(html);
        //Add the custom node
        range.insertNode(node);
    }
});

$("div #univ").click(function () {
    //Get Selected text
    var selection = document.getSelection();
    //Add your tags and prepare replacing content
    var html = "[univ]" + selection + "[/univ]";
    if (selection.getRangeAt && selection.rangeCount) {
        //Chrome supports only one range fire fox supports multiple ranges
        range = document.getSelection().getRangeAt(0);
        //remove selection
        range.deleteContents();
        //Create a node
        node = range.createContextualFragment(html);
        //Add the custom node
        range.insertNode(node);
    }
});

Output1

Now You can replace any part of text

Output 2

Replace any web page

Saving file to chosen Location

It is possible to download the page using chrome.pageCapture API, but to a sand boxed location.

Sample Implementation on using pageCapture API

manifest.json

{
    "name": "Page Capture Demo",
    "description": "This demos Page Capture MHTML Functionality",
    "permissions": [
        "pageCapture"
    ],
    "browser_action": {
        "default_icon": "screen.png",
        "default_popup": "popup.html"
    },
    "manifest_version": 2,
    "version": "1"
}

popup.html

<html>

    <head>
        <script src="popup.js"></script>
    </head>

    <body>
        <div id="pushhere"></div>
    </body>

</html>

popup.js

function capture() {
    chrome.tabs.query({
        "active": true,
        "currentWindow": true,
        "status": "complete"
    }, function (tabs) {
        chrome.pageCapture.saveAsMHTML({
            "tabId": tabs[0].id
        }, function (data) {
            var reader = new FileReader();
            reader.onload = function (eventt) {
                console.log(eventt.target.result);
                document.getElementById('pushhere').innerHTML = eventt.target.result;
                //window.open(eventt.target.result);
            };
            reader.readAsText(data);
            //window.open(data);
        });
    });
}
window.onload = capture;

Test this code using steps above by picking your icons of choice, hope this helps :)

Edit 2

  • Accessing contents of HTML File(s) images, js and css files is possible from chrome extension
  • Accessing Local Disk System(Reading and storing data to them) is not supported from chrome extensions(for security reasons)
  • You can save files to sand-boxed location which however are not accessible for general access.


回答2:

A browser extension should be fine. A standalone application would need to utilize a full-fledged browser, which is cumbersome.

Only with a browser extension (in contrast to a bookmarklet) you would have the permission to save the results directly to the filesystem. You can add the "Annotate" button to the browser (G)UI as well.

However, storing thousands of HTML files on your harddisk might not be your aim. Instead, you could set up a simple database server to which you post the annotation results. A small bookmarklet with some ajax code would be enough on the clientside then.