3
這是我的代碼來拆分網址,但該代碼有問題。所有鏈接均以雙字出現,例如www.utem.edu.my/portal/portal。詞/門戶/門戶總是出現在任何鏈接中的兩倍。任何建議我提取網頁中的鏈接?如何分割網址?
public String crawlURL(String strUrl) {
String results = ""; // For return
String protocol = "http://";
// Assigns the input to the inURL variable and checks to add http
String inURL = strUrl;
if (!inURL.toLowerCase().contains("http://".toLowerCase()) &&
!inURL.toLowerCase().contains("https://".toLowerCase())) {
inURL = protocol + inURL;
}
// Pulls URL contents from the web
String contectURL = pullURL(inURL);
if (contectURL == "") { // If it fails, then try with https
protocol = "https://";
inURL = protocol + inURL.split("http://")[1];
contectURL = pullURL(inURL);
}
// Declares some variables to be used inside loop
String aTagAttr = "";
String href = "";
String msg = "";
// Finds A tag and stores its href value into output var
String bodyTag = contectURL.split("<body")[1]; // Find 1st <body>
String[] aTags = bodyTag.split(">"); // Splits on every tag
//To show link different from one another
int index = 0;
for (String s: aTags) {
// Process only if it is A tag and contains href
if (s.toLowerCase().contains("<a") && s.toLowerCase().contains("href")) {
aTagAttr = s.split("href")[1]; // Split on href
// Split on space if it contains it
if (aTagAttr.toLowerCase().contains("\\s"))
aTagAttr = aTagAttr.split("\\s")[2];
// Splits on the link and deals with " or ' quotes
href = aTagAttr.split(((aTagAttr.toLowerCase().contains("\""))? "\"" : "\'"))[1];
if (!results.toLowerCase().contains(href))
//results += "~~~ " + href + "\r\n";
/*
* Last touches to URl before display
* Adds http(s):// if not exist
* Adds base url if not exist
*/
if(results.toLowerCase().indexOf("about") != -1) {
//Contains 'about'
}
if (!href.toLowerCase().contains("http://") && !href.toLowerCase().contains("https://")) {
// http:// + baseURL + href
if (!href.toLowerCase().contains(inURL.split("://")[1]))
href = protocol + inURL.split("://")[1] + href;
else
href = protocol + href;
}
System.out.println(href);//debug
你有'if(!results.toLowerCase()。contains(href))// results + =「~~~」+ href +「\ r \ n」;'這會導致錯誤,因爲沒有如果應用到代碼的不同部分,而不是因爲某些東西被評論而沒有做任何事情噸。 – martijnn2008