好吧,我完成了我的Yelp掃描儀,一切都運行良好。我想現在要做的就是讓程序檢索每一個環節每一個企業的網址,進入該頁面,並掃描其是否包含:如何使用Jsoup從網站上的鏈接檢索網址?
xlink:href="#30x30_bullhorn"></use>
我非常有好吧,我將如何去做這件事,但是,我似乎無法找到一個jSoup方法來檢索鏈接的URL。在頁面的HTML中是否有地址會有url?我對HTML不太熟練,所以我看到的90%都是胡言亂語。這裏有一個例子鏈接,如果你想看看我指的是什麼。
https://www.yelp.com/search?find_loc=nj&start=10是主頁,我需要獲取頁面https://www.yelp.com/biz/la-cocina-newark的網址。橙色的擴音器就是我試圖讓它恢復的東西。這裏是我的代碼BTW:
import java.util.ArrayList;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.util.Scanner;
public class YelpScrapper
{
public static void main(String[] args) throws IOException, Exception
{
//Variables
String description;
String location;
int pages;
int parseCount = 0;
Document document;
Scanner keyboard = new Scanner(System.in);
//Perform a Search
System.out.print("Enter a description: ");
description = keyboard.nextLine();
System.out.print("Enter a state: ");
location = keyboard.nextLine();
System.out.print("How many pages should we scan? ");
pages = keyboard.nextInt();
String descString = "find_desc=" + description.replace(' ', '+') + "&";
String locString = "find_loc=" + location.replace(' ', '+') + "&";
int number = 0;
String url = "https://www.yelp.com/search?" + descString + locString + "start=" + number;
ArrayList<String> names = new ArrayList<String>();
ArrayList<String> address = new ArrayList<String>();
ArrayList<String> phone = new ArrayList<String>();
//Fetch Data From Yelp
for (int i = 0 ; i <= pages ; i++)
{
document = Jsoup.connect(url).get();
Elements nameElements = document.select(".indexed-biz-name span");
Elements addressElements = document.select(".secondary-attributes address");
Elements phoneElements = document.select(".biz-phone");
for (Element element : nameElements)
{
names.add(element.text());
}
for (Element element : addressElements)
{
address.add(element.text());
}
for (Element element : phoneElements)
{
phone.add(element.text());
}
for (int index = 0 ; index < 10 ; index++)
{
System.out.println("\nLead " + parseCount);
System.out.println("Company Name: " + names.get(parseCount));
System.out.println("Address: " + address.get(parseCount));
System.out.println("Phone Number: " + phone.get(parseCount));
parseCount = parseCount + 1;
}
number = number + 10;
}
}
}
檢查工具已幫助噸!它很精確地突出了它在頁面上的位置,所以我確切地知道在哪裏看。 –
@BrandonWoodruff。現代網頁如此複雜,以至於在沒有類似檢查員的情況下建造任何類型的刮板都是可怕的。 –