2016-08-13 52 views
0

我需要從下面的span元素中檢索文本,而不將其分割爲文本部分。使用xpath或css查詢從跨度中檢索文本

<span class="a-size-base review-text">I purchased this from Fry's Electronics. 
 
<br/> 
 
<br/> 
 
The picture is quite good after tweaking the settings. An HDMI feed from my PC results in very clear text with no distortion. Be sure to turn down the sharpness to avoid artifacts around text. I think this screen may offer 4:4:4 chroma subsampling based on the attached test image. I'm very pleased with the viewing angles and the screen is definitely usable for more than just straight ahead viewing. 
 
<br/> 
 
<br/> 
 
I wasn't planning on using the Smart features, but the Netflix app works really well and is responsive enough to not become annoyed. The wifi streaming playback is very smooth, but navigating the folder structure is horribly slow. The interface insists on creating thumbnails for each movie file, which takes forever if you have a directory with many files. I would much rather just see a detailed list without thumbnails. When you finally do find your desired movie the playback is very good. If you keep the directory contents small (~10 items or fewer) you may not have any problems. 
 
<br/> 
 
<br/> 
 
The unit is very thin and light and setup was a breeze. You just have to put in 4 screws to attach the base and then you're ready to go. The power adapter comes with a "brick" style converter. The remote is well laid out and the menus are easy to navigate without feeling cumbersome. 
 
<br/> 
 
<br/> 
 
The stand is 8" deep x 22.25" wide. The TV stands 26.5" from table top to the top of the bezel with stand attached. The TV is 42.75" wide from outside bezel edge to outside bezel edge. 
 
<br/> 
 
<br/> 
 
Overall I'm very pleased with what this offers in the $400-500 range. (I actually paid $398 but that was after some customer service adjustments at Fry's). 
 
<br/> 
 
<br/> 
 
NOTE: If you see any strange distortion in the images it's likely a result of the camera, image compression, and resizing. Some of the strange patterns seen in the images are not present when viewing in person. 
 
</span>

但是在應用我的XPath查詢

// * [包含(CONCAT( 「」,@class, 「」),CONCAT( 「」,「回顧-text」, 「」))] /文()

我得到這個:

Text='I purchased this from Fry's Electronics.' 
 
Text='' 
 
Text='The picture is quite good after tweaking the settings. An HDMI feed from my PC results in very clear text with no distortion. Be sure to turn down the sharpness to avoid artifacts around text. I think this screen may offer 4:4:4 chroma subsampling based on the attached test image. I'm very pleased with the viewing angles and the screen is definitely usable for more than just straight ahead viewing.' 
 
Text='' 
 
Text='I wasn't planning on using the Smart features, but the Netflix app works really well and is responsive enough to not become annoyed. The wifi streaming playback is very smooth, but navigating the folder structure is horribly slow. The interface insists on creating thumbnails for each movie file, which takes forever if you have a directory with many files. I would much rather just see a detailed list without thumbnails. When you finally do find your desired movie the playback is very good. If you keep the directory contents small (~10 items or fewer) you may not have any problems.' 
 
Text='' 
 
Text='The unit is very thin and light and setup was a breeze. You just have to put in 4 screws to attach the base and then you're ready to go. The power adapter comes with a "brick" style converter. The remote is well laid out and the menus are easy to navigate without feeling cumbersome.' 
 
Text='' 
 
Text='The stand is 8" deep x 22.25" wide. The TV stands 26.5" from table top to the top of the bezel with stand attached. The TV is 42.75" wide from outside bezel edge to outside bezel edge.' 
 
Text='' 
 
Text='Overall I'm very pleased with what this offers in the $400-500 range. (I actually paid $398 but that was after some customer service adjustments at Fry's).' 
 
Text='' 
 
Text='NOTE: If you see any strange distortion in the images it's likely a result of the camera, image compression, and resizing. Some of the strange patterns seen in the images are not present when viewing in person.'

我想找回一個文字塊無破損。 我使用這個XPath儀http://www.freeformatter.com/xpath-tester.html

回答

0

scrapy選擇器的一個方便的功能是選擇鏈接,這樣你就可以用CSS選擇開始,然後應用XPath字符串的方法,如string()normalize-space()

下面是一個例子scrapy 1.1 shell會話:

~$ scrapy shell 
2016-08-16 12:20:57 [scrapy] INFO: Scrapy 1.1.1 started (bot: scrapybot) 
2016-08-16 12:20:57 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter'} 
(...) 
In [1]: html = '''<span class="a-size-base review-text">I purchased this from Fry's Electronics. 
    ...: <br/> 
    ...: <br/> 
    ...: The picture is quite good after tweaking the settings. An HDMI feed from my PC results in very clear text with no distortion. Be sure to turn down the sharpness to avoid artifacts around text. I think this screen may offer 4:4:4 chroma subsampling based on the attached test image. I'm very pleased with the viewing angles and the screen is definitely usable for more than just straight ahead viewing. 
    ...: <br/> 
    ...: <br/> 
    ...: I wasn't planning on using the Smart features, but the Netflix app works really well and is responsive enough to not become annoyed. The wifi streaming playback is very smooth, but navigating the folder structure is horribly slow. The interface insists on creating thumbnails for each movie file, which takes forever if you have a directory with many files. I would much rather just see a detailed list without thumbnails. When you finally do find your desired movie the playback is very good. If you keep the directory contents small (~10 items or fewer) you may not have any problems. 
    ...: <br/> 
    ...: <br/> 
    ...: The unit is very thin and light and setup was a breeze. You just have to put in 4 screws to attach the base and then you're ready to go. The power adapter comes with a "brick" style converter. The remote is well laid out and the menus are easy to navigate without feeling cumbersome. 
    ...: <br/> 
    ...: <br/> 
    ...: The stand is 8" deep x 22.25" wide. The TV stands 26.5" from table top to the top of the bezel with stand attached. The TV is 42.75" wide from outside bezel edge to outside bezel edge. 
    ...: <br/> 
    ...: <br/> 
    ...: Overall I'm very pleased with what this offers in the $400-500 range. (I actually paid $398 but that was after some customer service adjustments at Fry's). 
    ...: <br/> 
    ...: <br/> 
    ...: NOTE: If you see any strange distortion in the images it's likely a result of the camera, image compression, and resizing. Some of the strange patterns seen in the images are not present when viewing in person. 
    ...: </span>''' 

In [2]: import scrapy 

In [3]: selector = scrapy.Selector(text=html) 

In [4]: selector.css('span.review-text').xpath('string()').extract_first() 
Out[4]: 'I purchased this from Fry\'s Electronics.\n\n\nThe picture is quite good after tweaking the settings. An HDMI feed from my PC results in very clear text with no distortion. Be sure to turn down the sharpness to avoid artifacts around text. I think this screen may offer 4:4:4 chroma subsampling based on the attached test image. I\'m very pleased with the viewing angles and the screen is definitely usable for more than just straight ahead viewing.\n\n\nI wasn\'t planning on using the Smart features, but the Netflix app works really well and is responsive enough to not become annoyed. The wifi streaming playback is very smooth, but navigating the folder structure is horribly slow. The interface insists on creating thumbnails for each movie file, which takes forever if you have a directory with many files. I would much rather just see a detailed list without thumbnails. When you finally do find your desired movie the playback is very good. If you keep the directory contents small (~10 items or fewer) you may not have any problems.\n\n\nThe unit is very thin and light and setup was a breeze. You just have to put in 4 screws to attach the base and then you\'re ready to go. The power adapter comes with a "brick" style converter. The remote is well laid out and the menus are easy to navigate without feeling cumbersome.\n\n\nThe stand is 8" deep x 22.25" wide. The TV stands 26.5" from table top to the top of the bezel with stand attached. The TV is 42.75" wide from outside bezel edge to outside bezel edge.\n\n\nOverall I\'m very pleased with what this offers in the $400-500 range. (I actually paid $398 but that was after some customer service adjustments at Fry\'s).\n\n\nNOTE: If you see any strange distortion in the images it\'s likely a result of the camera, image compression, and resizing. Some of the strange patterns seen in the images are not present when viewing in person.\n' 

In [5]: print(selector.css('span.review-text').xpath('string()').extract_first()) 
I purchased this from Fry's Electronics. 


The picture is quite good after tweaking the settings. An HDMI feed from my PC results in very clear text with no distortion. Be sure to turn down the sharpness to avoid artifacts around text. I think this screen may offer 4:4:4 chroma subsampling based on the attached test image. I'm very pleased with the viewing angles and the screen is definitely usable for more than just straight ahead viewing. 


I wasn't planning on using the Smart features, but the Netflix app works really well and is responsive enough to not become annoyed. The wifi streaming playback is very smooth, but navigating the folder structure is horribly slow. The interface insists on creating thumbnails for each movie file, which takes forever if you have a directory with many files. I would much rather just see a detailed list without thumbnails. When you finally do find your desired movie the playback is very good. If you keep the directory contents small (~10 items or fewer) you may not have any problems. 


The unit is very thin and light and setup was a breeze. You just have to put in 4 screws to attach the base and then you're ready to go. The power adapter comes with a "brick" style converter. The remote is well laid out and the menus are easy to navigate without feeling cumbersome. 


The stand is 8" deep x 22.25" wide. The TV stands 26.5" from table top to the top of the bezel with stand attached. The TV is 42.75" wide from outside bezel edge to outside bezel edge. 


Overall I'm very pleased with what this offers in the $400-500 range. (I actually paid $398 but that was after some customer service adjustments at Fry's). 


NOTE: If you see any strange distortion in the images it's likely a result of the camera, image compression, and resizing. Some of the strange patterns seen in the images are not present when viewing in person. 


In [6]: print(selector.css('span.review-text').xpath('normalize-space()').extract_first()) 
I purchased this from Fry's Electronics. The picture is quite good after tweaking the settings. An HDMI feed from my PC results in very clear text with no distortion. Be sure to turn down the sharpness to avoid artifacts around text. I think this screen may offer 4:4:4 chroma subsampling based on the attached test image. I'm very pleased with the viewing angles and the screen is definitely usable for more than just straight ahead viewing. I wasn't planning on using the Smart features, but the Netflix app works really well and is responsive enough to not become annoyed. The wifi streaming playback is very smooth, but navigating the folder structure is horribly slow. The interface insists on creating thumbnails for each movie file, which takes forever if you have a directory with many files. I would much rather just see a detailed list without thumbnails. When you finally do find your desired movie the playback is very good. If you keep the directory contents small (~10 items or fewer) you may not have any problems. The unit is very thin and light and setup was a breeze. You just have to put in 4 screws to attach the base and then you're ready to go. The power adapter comes with a "brick" style converter. The remote is well laid out and the menus are easy to navigate without feeling cumbersome. The stand is 8" deep x 22.25" wide. The TV stands 26.5" from table top to the top of the bezel with stand attached. The TV is 42.75" wide from outside bezel edge to outside bezel edge. Overall I'm very pleased with what this offers in the $400-500 range. (I actually paid $398 but that was after some customer service adjustments at Fry's). NOTE: If you see any strange distortion in the images it's likely a result of the camera, image compression, and resizing. Some of the strange patterns seen in the images are not present when viewing in person. 
+0

謝謝@paul trmbrth。好的解決方案 – Brayoni

0

整個<span>元素轉換爲string

string(
    //*[contains(concat(" ", @class, " a-size-base review-text"), concat(" ", "review-text", " "))] 
) 

注意,這只是第一個<span>元素的作品符合標準。在XPath 2.0中,可以使用string-join()這將<span>元素的任意數量的工作:

string-join( 
    //*[contains(concat(" ", @class, " a-size-base review-text"), concat(" ", "review-text", " "))]/text(), 
    "" 
) 
+0

我使用** ** LXML只支持_xpath 1.0_,所以我不能用'字符串join'。如果我將整個元素轉換爲'string'。 _xpath_查詢似乎返回一個字符串而不是一個列表。 – Brayoni

+0

以下內容返回scrapy shell上的列表。 'response.xpath('// * [contains(concat(「」,@class,「」),concat(「」,「review-text」,「」))]')。extract()'。 – Brayoni

0

我不得不進行後期處理,以除去使用python的正則表達式的HTML標籤。

re.sub(r'<span class="a-size-base review-text">|<br>|</span>', "", text) 

然而,我試過@ har07的建議;

  • scrapy使用LXML僅支持的XPath 1.0因此我不能趁string-join它可在的XPath 2.0
  • 當我嘗試我無法從我的XPath查詢得到選擇列表string