How to get name of website from any string url [cl

2019-02-11 12:55发布

I have given String which contains any valid url. I have to find only name of website from given url. I have also ignore sub domains.

like

http://www.yahoo.com   =>    yahoo
www.google.co.in =>      google
http://in.com    =>      in
http://india.gov.in/ => india
https://in.yahoo.com/ => yahoo
http://philotheoristic.tumblr.com/  =>tumblr
http://philotheoristic.tumblr.com/
https://in.movies.yahoo.com/        =>yahoo

How to do this

4条回答
\"骚年 ilove
2楼-- · 2019-02-11 13:42

Regular expressions may help you:

 String str = "www.google.co.in";
 String [] res = str.split("(\\.|//)+(?=\\w)");
 System.out.println(res[1]);

A regular expression is a way to represent a set of strings. This set is composed by any string matching the expression. In the code above, the string used as split argument is the regular expression that matches: Any "." followed by an alphanumeric text OR "//" followed by an alphanumeric text. So these "." and "//" substrings are the separators used to split the string in parts, being the first one the site name.

In "www.google.co.in", the string would be splited this way: goole, co, in. Since the solution is using the first element of the spit array, the result is: google.

查看更多
smile是对你的礼貌
3楼-- · 2019-02-11 13:52

There is no any possible way to find out valid website name from url. But if you are trying to cut a particular part of url string, you can do this by string operation as follows

if(url.endsWith("co.in"){

  website = url.substring(indexOfLostThirdDot, indexofco.in)
}
查看更多
戒情不戒烟
4楼-- · 2019-02-11 13:53

Yo can make use of URL

From Documentation - http://docs.oracle.com/javase/tutorial/networking/urls/urlInfo.html

import java.net.*;
import java.io.*;

public class ParseURL {
    public static void main(String[] args) throws MalformedURLException {

        URL aURL = new URL("http://example.com:80/docs/books/tutorial"
                           + "/index.html?name=networking#DOWNLOADING");

        System.out.println("protocol = " + aURL.getProtocol());
        System.out.println("authority = " + aURL.getAuthority());
        System.out.println("host = " + aURL.getHost());
        System.out.println("port = " + aURL.getPort());
        System.out.println("path = " + aURL.getPath());
        System.out.println("query = " + aURL.getQuery());
        System.out.println("filename = " + aURL.getFile());
        System.out.println("ref = " + aURL.getRef());
    }
}

Here is the output displayed by the program:

protocol = http
authority = example.com:80
host = example.com                     // name of website
port = 80
path = /docs/books/tutorial/index.html
query = name=networking
filename = /docs/books/tutorial/index.html?name=networking
ref = DOWNLOADING

So by using aURL.getHost() you can get website name. To ignore sub domains you can split it with "." Therefore it becomes aURL.getHost().split(".")[0] to get only name.

查看更多
家丑人穷心不美
5楼-- · 2019-02-11 13:57

I found similar contents. although some different.

http://www.yahoo.com   =>    Yahoo
http://www.google.co.in =>      Google
http://in.com    => In.com Offers Videos, News, Photos, Celebs, Live TV Channels.....
http://india.gov.in/ => National Portal of India
https://in.yahoo.com/ => Yahoo India
http://philotheoristic.tumblr.com/  => Philotheoristic
https://in.movies.yahoo.com/ => Yahoo India Movies - Bollywood News, Movie Reviews &    Hindi Movie Videos

here is the code

public class TitleExtractor {
/* the CASE_INSENSITIVE flag accounts for
 * sites that use uppercase title tags.
 * the DOTALL flag accounts for sites that have
 * line feeds in the title text */
private static final Pattern TITLE_TAG =
    Pattern.compile("\\<title>(.*)\\</title>", Pattern.CASE_INSENSITIVE|Pattern.DOTALL);

/**
 * @param url the HTML page
 * @return title text (null if document isn't HTML or lacks a title tag)
 * @throws IOException
 */
public static String getPageTitle(String url) throws IOException {
    URL u = new URL(url);
    URLConnection conn = u.openConnection();

    // ContentType is an inner class defined below
    ContentType contentType = getContentTypeHeader(conn);
    if (!contentType.contentType.equals("text/html"))
        return null; // don't continue if not HTML
    else {
        // determine the charset, or use the default
        Charset charset = getCharset(contentType);
        if (charset == null)
            charset = Charset.defaultCharset();

        // read the response body, using BufferedReader for performance
        InputStream in = conn.getInputStream();
        BufferedReader reader = new BufferedReader(new InputStreamReader(in, charset));
        int n = 0, totalRead = 0;
        char[] buf = new char[1024];
        StringBuilder content = new StringBuilder();

        // read until EOF or first 8192 characters
        while (totalRead < 8192 && (n = reader.read(buf, 0, buf.length)) != -1) {
            content.append(buf, 0, n);
            totalRead += n;
        }
        reader.close();

        // extract the title
        Matcher matcher = TITLE_TAG.matcher(content);
        if (matcher.find()) {
            /* replace any occurrences of whitespace (which may
             * include line feeds and other uglies) as well
             * as HTML brackets with a space */
            return matcher.group(1).replaceAll("[\\s\\<>]+", " ").trim();
        }
        else
            return null;
    }
}

/**
 * Loops through response headers until Content-Type is found.
 * @param conn
 * @return ContentType object representing the value of
 * the Content-Type header
 */
private static ContentType getContentTypeHeader(URLConnection conn) {
    int i = 0;
    boolean moreHeaders = true;
    do {
        String headerName = conn.getHeaderFieldKey(i);
        String headerValue = conn.getHeaderField(i);
        if (headerName != null && headerName.equals("Content-Type"))
            return new ContentType(headerValue);

        i++;
        moreHeaders = headerName != null || headerValue != null;
    }
    while (moreHeaders);

    return null;
}

private static Charset getCharset(ContentType contentType) {
    if (contentType != null && contentType.charsetName != null && Charset.isSupported(contentType.charsetName))
        return Charset.forName(contentType.charsetName);
    else
        return null;
}

/**
 * Class holds the content type and charset (if present)
 */
private static final class ContentType {
    private static final Pattern CHARSET_HEADER = Pattern.compile("charset=([-_a-zA-Z0-9]+)", Pattern.CASE_INSENSITIVE|Pattern.DOTALL);

    private String contentType;
    private String charsetName;
    private ContentType(String headerValue) {
        if (headerValue == null)
            throw new IllegalArgumentException("ContentType must be constructed with a not-null headerValue");
        int n = headerValue.indexOf(";");
        if (n != -1) {
            contentType = headerValue.substring(0, n);
            Matcher matcher = CHARSET_HEADER.matcher(headerValue);
            if (matcher.find())
                charsetName = matcher.group(1);
        }
        else
            contentType = headerValue;
    }
}
}

Making use of this class is simple:

 String title = TitleExtractor.getPageTitle("http://en.wikipedia.org/");
 System.out.println(title);

here is the link:

http://www.gotoquiz.com/web-coding/programming/java-programming/how-to-extract-titles-from-web-pages-in-java/

I hope it is help you.

查看更多
登录 后发表回答