2015-03-19 49 views
2

所以基本上我想從網頁中提取所有網址,即使它們不是可點擊的鏈接。如何以純文本的形式提取網頁上的所有URL(鏈接)?

例如頁面的源代碼可能是:

<html> 

<title>Random Website I am Crawling</title> 

<body> 

Click <a href="http://clicklink.com">here</a> for foobar 

Another site is http://foobar.com 

</body> 

</html> 

兩個我都想要的URL來進行顯示,

http://clicklink.com and http://foobar.com 

我也不想你可以把它。

我目前的腳本抓住了網址,但似乎也抓住了一堆其他垃圾,使鏈接可點擊,無法存儲在數據庫中。

這是我目前的代碼。

<?php 

$db = new PDO('mysql:host=localhost;dbname=crawler;charset=utf8', 'crawler', '***', array(PDO::ATTR_EMULATE_PREPARES => false, 
                           PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION)); 

$url="http://www.frozencpu.com/"; 
$data=file_get_contents($url); 
$data = strip_tags($data,"<a>"); 
$d = preg_split("/<\/a>/",$data); 
foreach ($d as $k=>$u){ 
    if(strpos($u, "<a href=") !== FALSE){ 
    //echo $u; 
    //echo "<BR>"; 
     $u = preg_replace("/.*<a\s+href=\"/sm","",$u); 
     $u = preg_replace("/\".*/","",$u); 
     //echo $u; 
     //echo "<BR>"; 
     $db->exec("INSERT INTO urls(url, crawled) VALUES('$u', '0')"); 
    } 
} 

?> 

下面是一個例子輸出

http://www.facebook.com/pages/FrozenCPUcom/351841771499<BR>http://twitter.com/FrozenCPU<BR>/rss/frozencpu.rss<BR>http://www.frozencpu.com/index.html?id=CR9RnD2g<BR>http://www.frozencpu.com/index.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cart.html?id=CR9RnD2g<BR>http://www.frozencpu.com/account.html?id=CR9RnD2g<BR>http://www.frozencpu.com/tracking.html?id=CR9RnD2g<BR>http://www.frozencpu.com/help_center.html?id=CR9RnD2g<BR>http://www.frozencpu.com/manage_carts.html?id=CR9RnD2g<BR> 

*似乎罰款,直到這裏

Then it just junks up big time 

&nbsp;&nbsp;<a href='http://www.frozencpu.com/advanced_search.html?id=CR9RnD2g' class=small>Advanced Search<BR>http://www.frozencpu.com/brands/shop_by_brand.html?id=CR9RnD2g<BR>http://www.frozencpu.com/shop_category.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g30/Liquid_Cooling.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g57/EK_Products.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g59/XSPC_Products.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g60/LutroO_Products.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g12/Accessories.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g40/Air_Cooling.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g53/Apparel.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g34/Bay_Devices.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g54/Cabinet_Cooling.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g2/Cables.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g32/Caffeine.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g1/Cases.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g58/CaseLabs_Cases.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g45/Custom_Cases.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g43/Case_Parts-OEM.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g51/Connectors.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g48/CPU_Heatsinks.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g44/DIYMod_Parts.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g4/Electronics.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g36/Fans.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g47/Fan_Accessories.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g39/Gaming.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g6/Lighting.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g49/Phase_Change.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g11/Power_Supplies.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g55/Screws.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g35/SleevingHeatshrink.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g7/Sound_Dampening.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g52/Switches.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g8/Thermal_Interface.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g31/Travel_Cases.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g33/Ultra_Quiet.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g42/Window_Kits.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g50/Custom_Services.html?id=CR9RnD2g<BR>http://www.frozencpu.com/index.html?enable=1&id=CR9RnD2g<BR>http://www.frozencpu.com/products/2770/gc-01/Gift_Certificate.html?id=CR9RnD2g<BR>http://www.frozencpu.com/rebates.html?id=CR9RnD2g<BR>http://www.frozencpu.com/aboutus.html?id=CR9RnD2g<BR>http://www.frozencpu.com/resource.html?id=CR9RnD2g<BR>http://www.frozencpu.com/career.html?id=CR9RnD2g<BR>http://www.frozencpu.com/clearance/list/p1/Clearance-Page1.html?id=CR9RnD2g<BR>http://www.frozencpu.com/contactus.html?id=CR9RnD2g<BR>http://www.frozencpu.com/help_center.html?id=CR9RnD2g<BR>http://www.frozencpu.com/news.html?id=CR9RnD2g<BR>http://www.frozencpu.com/links.html?id=CR9RnD2g<BR>https://www.frozencpu.com/login.html?id=CR9RnD2g<BR>http://www.frozencpu.com/media.html?id=CR9RnD2g<BR>http://www.frozencpu.com/account.html?id=CR9RnD2g<BR>http://www.frozencpu.com/manage_carts.html?view_cart=Wish%2dList&wish_list=1&id=CR9RnD2g<BR>http://www.frozencpu.com/new_products.html?id=CR9RnD2g<BR>http://www.frozencpu.com/powder_coating.html?id=CR9RnD2g<BR>http://www.frozencpu.com/press.html?id=CR9RnD2g<BR>http://www.frozencpu.com/rebates.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cart.html?id=CR9RnD2g<BR>http://www.frozencpu.com/sitemap.html?id=CR9RnD2g<BR>http://www.frozencpu.com/testimonials.html?id=CR9RnD2g<BR>http://www.frozencpu.com/tracking.html?id=CR9RnD2g<BR>http://www.frozencpu.com/stores.html?id=CR9RnD2g<BR> 


      <a href='http://www.facebook.com/pages/FrozenCPUcom/351841771499' target=<BR> 
      <a href='http://twitter.com/FrozenCPU' target=<BR> 
      <a href='/rss/frozencpu.rss' target=<BR>https://www.resellerratings.com 
<BR>https://www.securitymetrics.com/sitecertsummary.adp?s=67%2e228%2e74%2e232&amp;i=340380<BR>mailto:[email protected]?subject=WESTERN%20UNION<BR>http://www.frozencpu.com/products/23382/ex-wat-303/XSPC_Raystorm_RX240_V3_Extreme_Universal_CPU_Water_Cooling_Kit_w_D5_Variant_Pump_Included_and_Free_Dead-Water.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/23382/ex-wat-303/XSPC_Raystorm_RX240_V3_Extreme_Universal_CPU_Water_Cooling_Kit_w_D5_Variant_Pump_Included_and_Free_Dead-Water.html?id=CR9RnD2g 

            The XSPC Raystorm RX240 V3 Universal CPU Water Cooling Kit comes complete with everything you will need to cool your CPU. This kit is designed to handle your CPU and can be expanded to handle more blocks as well. 

The kit uses the newest XSPC CPU block, the Raystorm as the core cooling component. This block has a pure copper base and is a top o... 
            3 In Stock, Ships Today Till 6pm EST 
            $259.99 
           <BR>http://www.frozencpu.com/products/17220/ex-wat-223/XSPC_Copper_Raystorm_AX240_Extreme_Intel_CPU_Water_Cooling_Kit_w_Twin_D5_w_Free_Dead-Water.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/17220/ex-wat-223/XSPC_Copper_Raystorm_AX240_Extreme_Intel_CPU_Water_Cooling_Kit_w_Twin_D5_w_Free_Dead-Water.html?id=CR9RnD2g 

            The RayStorm Copper Twin D5 AX240 kit is the most powerful 240 kit XSPC have ever made. It includes a special Copper edition of our RayStorm block, our fantastic new AX240 radiator and two D5 Vario pumps in series. 

The RayStorm Copper has the same great performance as our award winning RayStorm block, but with an all metal design. The acetal top... 
            7 In Stock, Ships Today Till 6pm EST 
            $399.99 
           <BR>http://www.frozencpu.com/products/22914/cas-495/PrimoChill_Hasher_-_Rugged_Crypto_Stackable_Mining_Rack_R-HRC.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/22914/cas-495/PrimoChill_Hasher_-_Rugged_Crypto_Stackable_Mining_Rack_R-HRC.html?id=CR9RnD2g 

            PrimoChill once again provides a good lookin, easy solution to the unimaginable. Introducing, one hell of a crypto rack, The Hasher! 

Built out of rugged, 1in anodized extruded aluminum t-slot, the PrimoChill Hasher is tough but cool enough to keep out of the basement. It combines not only functionality but order to the chaos that other mining r... 
            5 In Stock, Ships Today Till 6pm EST 
            $129.99 
           <BR>http://www.frozencpu.com/products/13815/ele-933/Add2PSU_Multiple_Power_Supply_Adapter.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/13815/ele-933/Add2PSU_Multiple_Power_Supply_Adapter.html?id=CR9RnD2g 

            Small, lightweight, and true Plug N Play, the Add2Psu adapter allows you to add more power to your computer. No cutting wires or soldering, no compromising the integrity or function of your PC. 

Now there is a way to add more power to your PC. Finally a true plug and play way to manage additional power for those big video cards, bigger hard drive... 
            290 In Stock, Ships Today Till 6pm EST 
            $19.95 
           <BR>http://www.frozencpu.com/products/25635/ex-wat-335/Larkooler_SkyWater_330L_All-In-One_Liquid_Cooling_Kit_LCS0030.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/25635/ex-wat-335/Larkooler_SkyWater_330L_All-In-One_Liquid_Cooling_Kit_LCS0030.html?id=CR9RnD2g 

            The SkyWater 330L is a new liquld cooling system with a variable speed pump and Fans in desktop PC. The water cooling system is designed for the best thermal solution of CPU, the most important component of your PC. The SkyWater 330L provides a low noise at low speed fans , high performance at high speed fans and reliable liquid cooling system. 

... 
            4 In Stock, Ships Today Till 6pm EST 
            $129.99 
           <BR>http://www.frozencpu.com/products/26337/ex-blc-1942/Aquacomputer_Kryographics_GTX_980_Full_Coverage_Liquid_Cooling_Block_-_Copper_Acrylic_Glass_23614.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/26337/ex-blc-1942/Aquacomputer_Kryographics_GTX_980_Full_Coverage_Liquid_Cooling_Block_-_Copper_Acrylic_Glass_23614.html?id=CR9RnD2g 

            Combined GPU/RAM/VRM-cooler for graphics cards of the type nvidia GTX 980 with 4 GB RAM according to reference design. 
This cooler combines the features of a graphics chip cooler and RAM-coolers in an elegant and very flat watercooler. Additionally the voltage regulators are also cooled effectively. 

The kryographics for GTX 980 water block offe... 
            5 In Stock, Ships Today Till 6pm EST 
            $129.99 
           <BR>http://www.frozencpu.com/products/19760/bus-348/Lamptron_CW611_36W_-_6_Channel_Aluminum_Liquid_Cooling_Controller_-_Black_CW611.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/19760/bus-348/Lamptron_CW611_36W_-_6_Channel_Aluminum_Liquid_Cooling_Controller_-_Black_CW611.html?id=CR9RnD2g 

            Introducing the Lamptron CW611 Water Cooling fan controller! The first in a series of advanced control 5.25&#8243; bay devices that allow complete control over your entire PC cooling system. You can use this controller to be used with fans, liquid cooling pumps, as well as flow meters. The first in a new series of controllers this is sure to get ... 
            52 In Stock, Ships Today Till 6pm EST 
            $99.99 
           <BR>http://www.frozencpu.com/products/9350/fan-583/Noiseblocker_NB-BlackSilentFan_XM2_40mmx10mm_Ultra_Quiet_Fan_-_3800_RPM_-_14_dBA.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/9350/fan-583/Noiseblocker_NB-BlackSilentFan_XM2_40mmx10mm_Ultra_Quiet_Fan_-_3800_RPM_-_14_dBA.html?id=CR9RnD2g 

            The Noiseblocker NB-BlackSilentFan XM2 40mmx10mm Ultra Quiet Fan, manufactured by Noiseblocker, Germany's quietest fan manufacturer, the BlackSilentFan series features extraordinary life spans and near silent operation. Using the NB-Longlife advanced sleeve bearing and matched with the NB-EKA drive, the BlackSilentFan series runs more than double ... 
            20 In Stock, Ships Today Till 6pm EST 
            $12.95 
           <BR>http://www.frozencpu.com/products/25250/cst-1779/Phanteks_Enthoo_Luxe_Full_Tower_Chassis_w_Window_-_White_PH-ES614L_WT.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/25250/cst-1779/Phanteks_Enthoo_Luxe_Full_Tower_Chassis_w_Window_-_White_PH-ES614L_WT.html?id=CR9RnD2g 

            Staying true to the Phanteks’ Enthoo line, the Luxe features a sandblasted front and top panel. Ambient lighting run from top to front of the case on both sides. Even though smaller in size, the Enthoo Luxe boost many features from the award-winning Enthoo Primo. The Luxe comes pre-installed with a 200mm front fan and 2x PH-F140SP fans. Phanteks’ E... 
            In Stock, Ships Today Till 6pm EST 
            $159.99 
           <BR>http://www.frozencpu.com/products/25721/ex-wat-337/MagiCool_DIY_Complete_Single_120mm_Liquid_Cooling_Kit_MC-G12V1.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/25721/ex-wat-337/MagiCool_DIY_Complete_Single_120mm_Liquid_Cooling_Kit_MC-G12V1.html?id=CR9RnD2g 

            The MagiCool DIY Complete Liquid Cooling Kit comes with everything you need to set your system up on liquid. The CPU block is compatible with all current sockets giving you flexibility for now and for future upgrades as well. The radiator is a slim profile variant allowing for maximum case compatibility. 
Compression fittings are provided for dur... 
            5 In Stock, Ships Today Till 6pm EST 
            $124.99 
           <BR>http://www.frozencpu.com/products/26065/ex-blc-1936/Alphacool_NexXxoS_GPX_Nvidia_Geforce_GTX_970_M03_Liquid_Cooling_Blockw_Backplate_11199.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/26065/ex-blc-1936/Alphacool_NexXxoS_GPX_Nvidia_Geforce_GTX_970_M03_Liquid_Cooling_Blockw_Backplate_11199.html?id=CR9RnD2g 

            With the new NexXxoS GPX coolers Alphacool is again a step ahead! Optimum performance and quality in a new cooling design for a great price! 

A new sophisticated injection system means the GPU is actively cooled. All other chips are sufficiently cooled by the passive cooler which is also in contact with the watercooling block for extra efficiency... 
            3 In Stock, Ships Today Till 6pm EST 
            $94.99 
           <BR>http://www.frozencpu.com/products/14175/bus-285/Alphacool_Heatmaster_II_Liquid_Cooling_PCB_Control_Board_26153.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/14175/bus-285/Alphacool_Heatmaster_II_Liquid_Cooling_PCB_Control_Board_26153.html?id=CR9RnD2g 

            The new generation of cooling control from Alphacool: The Heatmaster II 

The new Alphacool Heatmaster II was developed in Germany over multiple years, and has continuously been improved considering the experiences from the first version. Hence we are now, after a development and testing period of almost 3 years, able to present the best Heatmaste... 
            4 In Stock, Ships Today Till 6pm EST 
            $84.99 
           <BR>http://www.frozencpu.com/products/23748/ex-tub-3052/EK_ZMT_Tubing_-_38_ID_58OD_-_1_Foot_-_Black_EK-Tube_ZMT_Matte_Black_15995mm.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/23748/ex-tub-3052/EK_ZMT_Tubing_-_38_ID_58OD_-_1_Foot_-_Black_EK-Tube_ZMT_Matte_Black_15995mm.html?id=CR9RnD2g 

            EK ZMT (Zero Maintainance Tubing) is a high quality, zero maintainance industrial grade EPDM rubber tubing in stylish matte black. 

This tubing is - just like Norprene - designed to withstand harsh conditions for a very long period of time, offering a truly exceptional lifespan even under UV, ozone and heat exposure for many years. 

Unlike most... 
            62 In Stock, Ships Today Till 6pm EST 
            $2.50 
           <BR>http://www.frozencpu.com/products/25897/ex-wat-342/XSPC_Raystorm_EX360_Extreme_Universal_CPU_Water_Cooling_Kit_w_DDC_Photon_and_Free_Dead-Water.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/25897/ex-wat-342/XSPC_Raystorm_EX360_Extreme_Universal_CPU_Water_Cooling_Kit_w_DDC_Photon_and_Free_Dead-Water.html?id=CR9RnD2g 

            The XSPC Raystorm DDC Photon EX360 Universal CPU Water Cooling Kit comes complete with everything you will need to cool your CPU. This kit is designed to handle your CPU and can be expanded to handle more blocks as well. 

The kit uses the newest XSPC CPU block, the Raystorm as the core cooling component. This block has a pure copper base and is... 
            5 In Stock, Ships Today Till 6pm EST 
            $254.99 
           <BR>http://www.frozencpu.com/products/26379/fan-1397/Alphacool_Susurro_120mm_x_25mm_Fan_-_1700RPM_24684.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/26379/fan-1397/Alphacool_Susurro_120mm_x_25mm_Fan_-_1700RPM_24684.html?id=CR9RnD2g 

            A new generation of fans joins the Alphacool range. The Susurro, Spanish for Whisper. 

A fundamental review of known fan designs was used to manufacture the Susurro. The perfect harmony between the AlphaCool blue and deep blacks make a great impression. The transparent black fan is optimized to cause virtually no noise. 

But don’t be persuaded ... 
            2 In Stock, Ships Today Till 6pm EST 
            $14.99 
           <BR>http://www.frozencpu.com/products/18800/ex-res-486/Alphacool_Clip-On_Reservoir_Mount_2_Piece_Set_w_5mm_LED_Support_-_50mm.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/18800/ex-res-486/Alphacool_Clip-On_Reservoir_Mount_2_Piece_Set_w_5mm_LED_Support_-_50mm.html?id=CR9RnD2g 

            The best Alphacool reservoir mounts of all times! 

Many reservoir mounts were designed for the original tube reservoirs from the beginning of the PC water cooling sector. During the last years though, the reservoirs became larger, sized for more capacity and metal was integrated for the end caps. This resulted in heavier reservoirs, making the co... 
            1 In Stock, Ships Today Till 6pm EST 
            $10.99 
           <BR>http://www.frozencpu.com/news.html?id=CR9RnD2g<BR>https://www.frozencpu.com/login.html?id=CR9RnD2g<BR>https://www.frozencpu.com/login.html?id=CR9RnD2g<BR>https://www.frozencpu.com/login.html?id=CR9RnD2g<BR>https://www.frozencpu.com/login.html?gu=1&id=CR9RnD2g<BR>http://www.frozencpu.com/help/h25/Ordering_with_a_PO.html?id=CR9RnD2g<BR>http://www.frozencpu.com/testimonials.html?id=CR9RnD2g<BR>http://www.frozencpu.com/index.html?id=CR9RnD2g<BR>http://www.frozencpu.com/sitemap.html?id=CR9RnD2g<BR>http://www.frozencpu.com/help_center.html?id=CR9RnD2g<BR>http://www.frozencpu.com/contactus.html?id=CR9RnD2g<BR>http://www.frozencpu.com/problem.html?id=CR9RnD2g<BR>http://www.frozencpu.com/help/h15/Legal.html?id=CR9RnD2g<BR>http://www.frozencpu.com/help/h13.html?id=CR9RnD2g<BR>http://www.getfirefox.com<BR> 

回答

2

如果你希望所有的URL,你不能只是看看裏面<a href=,特別是鑑於該物業的<a>href不會總是標籤內的第一件事。像<a target=_blank href=http://google.com>這樣的標籤將被忽略。

如果你想搜索的所有URL不管你可以簡單地忽略標籤,並期待在一般的URL模式的情況下,像這樣的東西:

$urls = preg_match_all('/[a-z]+:\/\/[a-zA-Z0-9?+.=%:\/]+/', $content, $matches); 

這可能需要拋光的很多,但應該做竅門讓事情開始。 但是,請注意,這隻會匹配完整的網址。鏈接到相關頁面如<a href="index.html">顯然不會匹配。

由於Regular Expressions are not a recommended solution to parse HTML,恐怕您將不得不尋求更合適的解決方案,例如DOMDocument()來打開頁面並充分查找URL。

1

對於所有類型的URL匹配下面的代碼可以幫助你:

<?php 

$content = '<html> 

<title>Random Website I am Crawling</title> 

<body> 

Click <a href="http://clicklink.com">here</a> for foobar 

Another site is http://foobar.com 

</body> 

</html>'; 

$regex = "((https?|ftp)\:\/\/)?"; // SCHEME 
$regex .= "([a-z0-9+!*(),;?&=\$_.-]+(\:[a-z0-9+!*(),;?&=\$_.-]+)[email protected])?"; // User and Pass 
$regex .= "([a-z0-9-.]*)\.([a-z]{2,4})"; // Host or IP 
$regex .= "(\:[0-9]{2,5})?"; // Port 
$regex .= "(\/([a-z0-9+\$_-]\.?)+)*\/?"; // Path 
$regex .= "(\?[a-z+&\$_.-][a-z0-9;:@&%=+\/\$_.-]*)?"; // GET Query 
$regex .= "(#[a-z_.-][a-z0-9+\$_.-]*)?"; // Anchor 


$matches = array(); //create array 
$pattern = "/$regex/"; 

preg_match_all($pattern, $content, $matches); 

print_r(array_values(array_unique($matches[0]))); 
echo "<br><br>"; 
echo implode("<br>", array_values(array_unique($matches[0]))); 


/* 
* With your code 
*/ 

$db = new PDO('mysql:host=localhost;dbname=crawler;charset=utf8', 'crawler', '***', array(PDO::ATTR_EMULATE_PREPARES => false, 
                           PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION)); 
$url="http://www.frozencpu.com/"; 
$data=file_get_contents($url); 
$matches = array(); 

preg_match_all($pattern, $data, $matches); 
$array = array_values(array_unique($matches[0])); 
    $count = count($array); 

    for($i = 0; $i < $count; $i++) { 
      $db->exec("INSERT INTO urls(url, crawled) VALUES('{$array[$i]}', '0')"); 
} 

    ?> 

這裏是更新代碼,似乎工作,但速度非常慢。

<?php 

$db = new PDO('mysql:host=localhost;dbname=crawler;charset=utf8', 'crawler', '***', array(PDO::ATTR_EMULATE_PREPARES => false, 
                           PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION)); 

$url="http://proxylists.connectionincognito.com/"; 
$content=file_get_contents($url); 

$regex = "((https?|ftp)\:\/\/)?"; // SCHEME 
$regex .= "([a-z0-9+!*(),;?&=\$_.-]+(\:[a-z0-9+!*(),;?&=\$_.-]+)[email protected])?"; // User and Pass 
$regex .= "([a-z0-9-.]*)\.([a-z]{2,4})"; // Host or IP 
$regex .= "(\:[0-9]{2,5})?"; // Port 
$regex .= "(\/([a-z0-9+\$_-]\.?)+)*\/?"; // Path 
$regex .= "(\?[a-z+&\$_.-][a-z0-9;:@&%=+\/\$_.-]*)?"; // GET Query 
$regex .= "(#[a-z_.-][a-z0-9+\$_.-]*)?"; // Anchor 


$matches = array(); //create array 
$pattern = "/$regex/"; 

preg_match_all($pattern, $content, $matches); 

$unique = array_unique($matches[0]); 

foreach ($unique as $url) { 

//Insert if none exist 

$stmt = $db->prepare("SELECT * FROM urls WHERE url='$url'"); 
$stmt->bindParam(1, $_GET['id'], PDO::PARAM_INT); 
$stmt->execute(); 
$row = $stmt->fetch(PDO::FETCH_ASSOC); 

if(! $row) 
{ 

$db->exec("INSERT INTO urls(url, crawled) VALUES('$url', '0')"); 
} 
//Insert end code 
} 
?> 

參考:

http://php.net/manual/en/function.preg-match.php

+0

謝謝你的回答,它似乎到目前爲止工作很好! 快速問題,我在這些操作上的加載時間非常緩慢〜15秒左右。 – Nick 2015-03-19 05:04:37

+0

這是預料之中嗎?我將用新的代碼編輯我的文章! – Nick 2015-03-19 05:04:57

+0

也許有一件事讓它變得如此緩慢,應該是即時的 – Nick 2015-03-19 05:05:09

相關問題