如何從任何字符串url獲取網站的名稱

我已經給出了包含任何有效url的字符串。我必須從給定的網址找到唯一的網站名稱。我也忽略子域。如何從任何字符串url獲取網站的名稱

像

http://www.yahoo.com => yahoo 
www.google.co.in =>  google 
http://in.com =>  in 
http://india.gov.in/ => india 
https://in.yahoo.com/ => yahoo 
http://philotheoristic.tumblr.com/ =>tumblr 
http://philotheoristic.tumblr.com/ 
https://in.movies.yahoo.com/  =>yahoo

如何做到這一點

來源

2014-06-16 xrcwrn

你不知道什麼關於字符串解析或正則表達式嗎？ –

正則表達式可以幫助你：

String str = "www.google.co.in"; 
String [] res = str.split("(\\.|//)+(?=\\w)"); 
System.out.println(res[1]);

正則表達式是表示一組字符串的方式。該組由與表達式匹配的任何字符串組成。在上面的代碼中，用作split參數的字符串是匹配的正則表達式：Any「。」接着是字母數字文本或「//」後跟字母數字文本。所以這些「。」和「//」子字符串是用於分割字符串的分隔符，第一個是網站名稱。

在「www.google.co.in」中，字符串將被拆分爲：goole, co, in。由於解決方案正在使用spit數組的第一個元素，因此結果爲：google。

來源

2014-06-16 05:30:00

我希望我能像你一樣瞭解經常的exp。你能解釋一下你的經常性的前例如何，我可以學到一些東西嗎？ –

@KickButtowski我編輯了我的答案以包含解釋。 –

謝謝，你知道任何容易理解的外國人常規exp教程嗎？ –

呦可以利用URL

從技術文檔 - http://docs.oracle.com/javase/tutorial/networking/urls/urlInfo.html

import java.net.*; 
import java.io.*; 

public class ParseURL { 
    public static void main(String[] args) throws MalformedURLException { 

     URL aURL = new URL("http://example.com:80/docs/books/tutorial" 
          + "/index.html?name=networking#DOWNLOADING"); 

     System.out.println("protocol = " + aURL.getProtocol()); 
     System.out.println("authority = " + aURL.getAuthority()); 
     System.out.println("host = " + aURL.getHost()); 
     System.out.println("port = " + aURL.getPort()); 
     System.out.println("path = " + aURL.getPath()); 
     System.out.println("query = " + aURL.getQuery()); 
     System.out.println("filename = " + aURL.getFile()); 
     System.out.println("ref = " + aURL.getRef()); 
    } 
}

這裏是由程序顯示的輸出：

protocol = http 
authority = example.com:80 
host = example.com      // name of website 
port = 80 
path = /docs/books/tutorial/index.html 
query = name=networking 
filename = /docs/books/tutorial/index.html?name=networking 
ref = DOWNLOADING

因此，通過使用aURL.getHost()你可以得到網站名稱。要忽略子域，您可以用"."分割它，因此它變成aURL.getHost().split(".")[0]以獲取名稱。

來源

2014-06-16 05:30:19

不錯的答案，但你怎麼會最終只是例子？ –

沒有任何可能的方法從url找出有效的網站名稱。但是，如果你正試圖削減URL字符串的特定部分，你可以通過字符串操作如下

if(url.endsWith("co.in"){ 

    website = url.substring(indexOfLostThirdDot, indexofco.in) 
}

來源

2014-06-16 06:20:32

我發現了相似的內容做到這一點。雖然有些不同。

http://www.yahoo.com => Yahoo 
http://www.google.co.in =>  Google 
http://in.com => In.com Offers Videos, News, Photos, Celebs, Live TV Channels..... 
http://india.gov.in/ => National Portal of India 
https://in.yahoo.com/ => Yahoo India 
http://philotheoristic.tumblr.com/ => Philotheoristic 
https://in.movies.yahoo.com/ => Yahoo India Movies - Bollywood News, Movie Reviews &amp; Hindi Movie Videos

這裏是代碼

public class TitleExtractor { 
/* the CASE_INSENSITIVE flag accounts for 
* sites that use uppercase title tags. 
* the DOTALL flag accounts for sites that have 
* line feeds in the title text */ 
private static final Pattern TITLE_TAG = 
    Pattern.compile("\\<title>(.*)\\</title>", Pattern.CASE_INSENSITIVE|Pattern.DOTALL); 

/** 
* @param url the HTML page 
* @return title text (null if document isn't HTML or lacks a title tag) 
* @throws IOException 
*/ 
public static String getPageTitle(String url) throws IOException { 
    URL u = new URL(url); 
    URLConnection conn = u.openConnection(); 

    // ContentType is an inner class defined below 
    ContentType contentType = getContentTypeHeader(conn); 
    if (!contentType.contentType.equals("text/html")) 
     return null; // don't continue if not HTML 
    else { 
     // determine the charset, or use the default 
     Charset charset = getCharset(contentType); 
     if (charset == null) 
      charset = Charset.defaultCharset(); 

     // read the response body, using BufferedReader for performance 
     InputStream in = conn.getInputStream(); 
     BufferedReader reader = new BufferedReader(new InputStreamReader(in, charset)); 
     int n = 0, totalRead = 0; 
     char[] buf = new char[1024]; 
     StringBuilder content = new StringBuilder(); 

     // read until EOF or first 8192 characters 
     while (totalRead < 8192 && (n = reader.read(buf, 0, buf.length)) != -1) { 
      content.append(buf, 0, n); 
      totalRead += n; 
     } 
     reader.close(); 

     // extract the title 
     Matcher matcher = TITLE_TAG.matcher(content); 
     if (matcher.find()) { 
      /* replace any occurrences of whitespace (which may 
      * include line feeds and other uglies) as well 
      * as HTML brackets with a space */ 
      return matcher.group(1).replaceAll("[\\s\\<>]+", " ").trim(); 
     } 
     else 
      return null; 
    } 
} 

/** 
* Loops through response headers until Content-Type is found. 
* @param conn 
* @return ContentType object representing the value of 
* the Content-Type header 
*/ 
private static ContentType getContentTypeHeader(URLConnection conn) { 
    int i = 0; 
    boolean moreHeaders = true; 
    do { 
     String headerName = conn.getHeaderFieldKey(i); 
     String headerValue = conn.getHeaderField(i); 
     if (headerName != null && headerName.equals("Content-Type")) 
      return new ContentType(headerValue); 

     i++; 
     moreHeaders = headerName != null || headerValue != null; 
    } 
    while (moreHeaders); 

    return null; 
} 

private static Charset getCharset(ContentType contentType) { 
    if (contentType != null && contentType.charsetName != null && Charset.isSupported(contentType.charsetName)) 
     return Charset.forName(contentType.charsetName); 
    else 
     return null; 
} 

/** 
* Class holds the content type and charset (if present) 
*/ 
private static final class ContentType { 
    private static final Pattern CHARSET_HEADER = Pattern.compile("charset=([-_a-zA-Z0-9]+)", Pattern.CASE_INSENSITIVE|Pattern.DOTALL); 

    private String contentType; 
    private String charsetName; 
    private ContentType(String headerValue) { 
     if (headerValue == null) 
      throw new IllegalArgumentException("ContentType must be constructed with a not-null headerValue"); 
     int n = headerValue.indexOf(";"); 
     if (n != -1) { 
      contentType = headerValue.substring(0, n); 
      Matcher matcher = CHARSET_HEADER.matcher(headerValue); 
      if (matcher.find()) 
       charsetName = matcher.group(1); 
     } 
     else 
      contentType = headerValue; 
    } 
} 
}

利用這一類的很簡單：

String title = TitleExtractor.getPageTitle("http://en.wikipedia.org/"); 
System.out.println(title);

這裏是鏈接：

http://www.gotoquiz.com/web-coding/programming/java-programming/how-to-extract-titles-from-web-pages-in-java/

我希望它是幫你。

來源

2014-06-16 08:18:29 sona

如何從任何字符串url獲取網站的名稱

回答

相關問題