2014-06-16 129 views
0

我已經給出了包含任何有效url的字符串。 我必須從給定的網址找到唯一的網站名稱。 我也忽略子域。如何從任何字符串url獲取網站的名稱

http://www.yahoo.com => yahoo 
www.google.co.in =>  google 
http://in.com =>  in 
http://india.gov.in/ => india 
https://in.yahoo.com/ => yahoo 
http://philotheoristic.tumblr.com/ =>tumblr 
http://philotheoristic.tumblr.com/ 
https://in.movies.yahoo.com/  =>yahoo 

如何做到這一點

+1

你不知道什麼關於字符串解析或正則表達式嗎? –

回答

2

正則表達式可以幫助你:

String str = "www.google.co.in"; 
String [] res = str.split("(\\.|//)+(?=\\w)"); 
System.out.println(res[1]); 

正則表達式是表示一組字符串的方式。該組由與表達式匹配的任何字符串組成。在上面的代碼中,用作split參數的字符串是匹配的正則表達式:Any「。」接着是字母數字文本或「//」後跟字母數字文本。 所以這些「。」和「//」子字符串是用於分割字符串的分隔符,第一個是網站名稱。

在「www.google.co.in」中,字符串將被拆分爲:goole, co, in。由於解決方案正在使用spit數組的第一個元素,因此結果爲:google

+0

我希望我能像你一樣瞭解經常的exp。你能解釋一下你的經常性的前例如何,我可以學到一些東西嗎? –

+1

@KickButtowski我編輯了我的答案以包含解釋。 –

+0

謝謝,你知道任何容易理解的外國人常規exp教程嗎? –

2

呦可以利用URL

從技術文檔 - http://docs.oracle.com/javase/tutorial/networking/urls/urlInfo.html

import java.net.*; 
import java.io.*; 

public class ParseURL { 
    public static void main(String[] args) throws MalformedURLException { 

     URL aURL = new URL("http://example.com:80/docs/books/tutorial" 
          + "/index.html?name=networking#DOWNLOADING"); 

     System.out.println("protocol = " + aURL.getProtocol()); 
     System.out.println("authority = " + aURL.getAuthority()); 
     System.out.println("host = " + aURL.getHost()); 
     System.out.println("port = " + aURL.getPort()); 
     System.out.println("path = " + aURL.getPath()); 
     System.out.println("query = " + aURL.getQuery()); 
     System.out.println("filename = " + aURL.getFile()); 
     System.out.println("ref = " + aURL.getRef()); 
    } 
} 

這裏是由程序顯示的輸出:

protocol = http 
authority = example.com:80 
host = example.com      // name of website 
port = 80 
path = /docs/books/tutorial/index.html 
query = name=networking 
filename = /docs/books/tutorial/index.html?name=networking 
ref = DOWNLOADING 

因此,通過使用aURL.getHost()你可以得到網站名稱。要忽略子域,您可以用"."分割它,因此它變成aURL.getHost().split(".")[0]以獲取名稱。

+0

不錯的答案,但你怎麼會最終只是例子? –

0

沒有任何可能的方法從url找出有效的網站名稱。但是,如果你正試圖削減URL字符串的特定部分,你可以通過字符串操作如下

if(url.endsWith("co.in"){ 

    website = url.substring(indexOfLostThirdDot, indexofco.in) 
} 
0

我發現了相似的內容做到這一點。雖然有些不同。

http://www.yahoo.com => Yahoo 
http://www.google.co.in =>  Google 
http://in.com => In.com Offers Videos, News, Photos, Celebs, Live TV Channels..... 
http://india.gov.in/ => National Portal of India 
https://in.yahoo.com/ => Yahoo India 
http://philotheoristic.tumblr.com/ => Philotheoristic 
https://in.movies.yahoo.com/ => Yahoo India Movies - Bollywood News, Movie Reviews & Hindi Movie Videos 

這裏是代碼

public class TitleExtractor { 
/* the CASE_INSENSITIVE flag accounts for 
* sites that use uppercase title tags. 
* the DOTALL flag accounts for sites that have 
* line feeds in the title text */ 
private static final Pattern TITLE_TAG = 
    Pattern.compile("\\<title>(.*)\\</title>", Pattern.CASE_INSENSITIVE|Pattern.DOTALL); 

/** 
* @param url the HTML page 
* @return title text (null if document isn't HTML or lacks a title tag) 
* @throws IOException 
*/ 
public static String getPageTitle(String url) throws IOException { 
    URL u = new URL(url); 
    URLConnection conn = u.openConnection(); 

    // ContentType is an inner class defined below 
    ContentType contentType = getContentTypeHeader(conn); 
    if (!contentType.contentType.equals("text/html")) 
     return null; // don't continue if not HTML 
    else { 
     // determine the charset, or use the default 
     Charset charset = getCharset(contentType); 
     if (charset == null) 
      charset = Charset.defaultCharset(); 

     // read the response body, using BufferedReader for performance 
     InputStream in = conn.getInputStream(); 
     BufferedReader reader = new BufferedReader(new InputStreamReader(in, charset)); 
     int n = 0, totalRead = 0; 
     char[] buf = new char[1024]; 
     StringBuilder content = new StringBuilder(); 

     // read until EOF or first 8192 characters 
     while (totalRead < 8192 && (n = reader.read(buf, 0, buf.length)) != -1) { 
      content.append(buf, 0, n); 
      totalRead += n; 
     } 
     reader.close(); 

     // extract the title 
     Matcher matcher = TITLE_TAG.matcher(content); 
     if (matcher.find()) { 
      /* replace any occurrences of whitespace (which may 
      * include line feeds and other uglies) as well 
      * as HTML brackets with a space */ 
      return matcher.group(1).replaceAll("[\\s\\<>]+", " ").trim(); 
     } 
     else 
      return null; 
    } 
} 

/** 
* Loops through response headers until Content-Type is found. 
* @param conn 
* @return ContentType object representing the value of 
* the Content-Type header 
*/ 
private static ContentType getContentTypeHeader(URLConnection conn) { 
    int i = 0; 
    boolean moreHeaders = true; 
    do { 
     String headerName = conn.getHeaderFieldKey(i); 
     String headerValue = conn.getHeaderField(i); 
     if (headerName != null && headerName.equals("Content-Type")) 
      return new ContentType(headerValue); 

     i++; 
     moreHeaders = headerName != null || headerValue != null; 
    } 
    while (moreHeaders); 

    return null; 
} 

private static Charset getCharset(ContentType contentType) { 
    if (contentType != null && contentType.charsetName != null && Charset.isSupported(contentType.charsetName)) 
     return Charset.forName(contentType.charsetName); 
    else 
     return null; 
} 

/** 
* Class holds the content type and charset (if present) 
*/ 
private static final class ContentType { 
    private static final Pattern CHARSET_HEADER = Pattern.compile("charset=([-_a-zA-Z0-9]+)", Pattern.CASE_INSENSITIVE|Pattern.DOTALL); 

    private String contentType; 
    private String charsetName; 
    private ContentType(String headerValue) { 
     if (headerValue == null) 
      throw new IllegalArgumentException("ContentType must be constructed with a not-null headerValue"); 
     int n = headerValue.indexOf(";"); 
     if (n != -1) { 
      contentType = headerValue.substring(0, n); 
      Matcher matcher = CHARSET_HEADER.matcher(headerValue); 
      if (matcher.find()) 
       charsetName = matcher.group(1); 
     } 
     else 
      contentType = headerValue; 
    } 
} 
} 

利用這一類的很簡單:

String title = TitleExtractor.getPageTitle("http://en.wikipedia.org/"); 
System.out.println(title); 

這裏是鏈接:

http://www.gotoquiz.com/web-coding/programming/java-programming/how-to-extract-titles-from-web-pages-in-java/

我希望它是 幫你。