2017-07-22 21 views
-1

(我已更新此帖子以反映更準確的問題圖片,其中包括提供更多信息,我最初將其忽略)從大括號和方括號中包含的複雜html中提取子字符串在python中使用正則表達式正則表達式

所有我試圖得到所需的字符串導致AttributeError:'NoneType'對象沒有屬性'組'。

這裏是我的代碼:

image = re.search("photo: /\[[^\]]+\]/", text)   
image = image.group(1) 

我還在努力學習regex,但是這一個已經被我扔對於太長的循環。

我想抓取包含照片鏈接的JSON部分。這是該"uploadTime"排除"id"先於一切:

這裏是有問題的一塊JSON的:

photo: [{ 
    "id": "http:\/\/img.ksl.com\/mx\/mplace-classifieds.ksl.com\/1159545-1498418383-59832.jpg", 
    "uploadTime": { 
     "sec": 1498418386, 
     "usec": 192000 
    }, 
    "extension": "jpg", 
    "md5": "6fac68fbcbdb31d17af7be277ab673be", 
    "height": 600, 
    "width": 800, 
    "description": "", 
    "originalFilePath": "", 
    "originalFileName": "photo_0D993ADA-8AFC-4A79-8F9B-18E6F6C30B94.jpg" 
}, { 
    "id": "http:\/\/img.ksl.com\/mx\/mplace-classifieds.ksl.com\/1159545-1498418389-472609.jpg", 
    "uploadTime": { 
     "sec": 1498418392, 
     "usec": 118000 
    }, 
    "extension": "jpg", 
    "md5": "6470e562d650099a1cafe9281f951c21", 
    "height": 600, 
    "width": 800, 
    "description": "", 
    "originalFilePath": "", 
    "originalFileName": "photo_335B7BC0-F6DE-4E19-8489-3AA7B3920144.jpg" 
}, { 
    "id": "http:\/\/img.ksl.com\/mx\/mplace-classifieds.ksl.com\/1159545-1498418397-06491.jpg", 
    "uploadTime": { 
     "sec": 1498418400, 
     "usec": 161000 
    }, 
    "extension": "jpg", 
    "md5": "5f2df3edfed164c062e739c0c3258970", 
    "height": 600, 
    "width": 800, 
    "description": "", 
    "originalFilePath": "", 
    "originalFileName": "photo_9C57A971-9748-4DBD-919D-8D532C8D7C1A.jpg" 
}, { 
    "id": "http:\/\/img.ksl.com\/mx\/mplace-classifieds.ksl.com\/1159545-1498418403-391642.jpg", 
    "uploadTime": { 
     "sec": 1498418406, 
     "usec": 936000 
    }, 
    "extension": "jpg", 
    "md5": "098dfa4d40e33c6897f62edc471670dd", 
    "height": 600, 
    "width": 800, 
    "description": "", 
    "originalFilePath": "", 
    "originalFileName": "photo_A55BD209-3BFB-447E-AE59-40CF656664A8.jpg" 
}, { 
    "id": "http:\/\/img.ksl.com\/mx\/mplace-classifieds.ksl.com\/1159545-1498418409-263588.jpg", 
    "uploadTime": { 
     "sec": 1498418412, 
     "usec": 789000 
    }, 
    "extension": "jpg", 
    "md5": "50b69c1db486f4bb6af723f7395a360b", 
    "height": 600, 
    "width": 800, 
    "description": "", 
    "originalFilePath": "", 
    "originalFileName": "photo_8BCDC2F0-8CBA-442C-98F5-0389455C8014.jpg" 
}, { 
    "id": "http:\/\/img.ksl.com\/mx\/mplace-classifieds.ksl.com\/1159545-1498418415-54882.jpg", 
    "uploadTime": { 
     "sec": 1498418418, 
     "usec": 462000 
    }, 
    "extension": "jpg", 
    "md5": "34296cda28b212a6c5590f233a2dca09", 
    "height": 600, 
    "width": 800, 
    "description": "", 
    "originalFilePath": "", 
    "originalFileName": "photo_726D1636-E3A9-4515-9B95-55161FAAF730.jpg" 
}, { 
    "id": "http:\/\/img.ksl.com\/mx\/mplace-classifieds.ksl.com\/1159545-1498418421-389128.jpg", 
    "uploadTime": { 
     "sec": 1498418424, 
     "usec": 518000 
    }, 
    "extension": "jpg", 
    "md5": "265087f19c17a99561a817f02a097b21", 
    "height": 600, 
    "width": 800, 
    "description": "", 
    "originalFilePath": "", 
    "originalFileName": "photo_09B01A71-46F2-4D8F-9153-CE0F0017495A.jpg" 
}] 

這JSON片是一個較大字符串的一部分:

<script type="text/javascript"> 
     var listingData = {}; 
     var userData = {}; 

     window.detailPage = window.detailPage || {}; 
        window.detailPage.listingData = { 
       id: 44782446, 
       status: "Active", 
       createTime: 1498418380, 
       displayTime: 1500694902, 
       expireTime: 1503286902, 
       title: "Yamaha RX-V461", 
       description: "Great Audio\/Video 5.1 surround receiver. Great condition ", 
       city: "South Jordan", 
       state: "UT", 
       zip: 84095, 
       contactName: "Robert", 
       contactHomePhone: "801-635-6040", 
       contactCellPhone: "801-635-6040", 
       contactEmail: "hasEmail", 
       lat: 40.5693, 
       lon: -111.9672, 
       latLon: "40.5693,-111.9672", 
       price: 50, 
       category: "Electronics", 
       subCategory: "Home Audio Receivers", 
       marketType: "Sale", 
       sellerType: "Private", 
       photo: [{"id":"http:\/\/img.ksl.com\/mx\/mplace-classifieds.ksl.com\/1159545-1498418383-59832.jpg","uploadTime":{"sec":1498418386,"usec":192000},"extension":"jpg","md5":"6fac68fbcbdb31d17af7be277ab673be","height":600,"width":800,"description":"","originalFilePath":"","originalFileName":"photo_0D993ADA-8AFC-4A79-8F9B-18E6F6C30B94.jpg"},{"id":"http:\/\/img.ksl.com\/mx\/mplace-classifieds.ksl.com\/1159545-1498418389-472609.jpg","uploadTime":{"sec":1498418392,"usec":118000},"extension":"jpg","md5":"6470e562d650099a1cafe9281f951c21","height":600,"width":800,"description":"","originalFilePath":"","originalFileName":"photo_335B7BC0-F6DE-4E19-8489-3AA7B3920144.jpg"},{"id":"http:\/\/img.ksl.com\/mx\/mplace-classifieds.ksl.com\/1159545-1498418397-06491.jpg","uploadTime":{"sec":1498418400,"usec":161000},"extension":"jpg","md5":"5f2df3edfed164c062e739c0c3258970","height":600,"width":800,"description":"","originalFilePath":"","originalFileName":"photo_9C57A971-9748-4DBD-919D-8D532C8D7C1A.jpg"},{"id":"http:\/\/img.ksl.com\/mx\/mplace-classifieds.ksl.com\/1159545-1498418403-391642.jpg","uploadTime":{"sec":1498418406,"usec":936000},"extension":"jpg","md5":"098dfa4d40e33c6897f62edc471670dd","height":600,"width":800,"description":"","originalFilePath":"","originalFileName":"photo_A55BD209-3BFB-447E-AE59-40CF656664A8.jpg"},{"id":"http:\/\/img.ksl.com\/mx\/mplace-classifieds.ksl.com\/1159545-1498418409-263588.jpg","uploadTime":{"sec":1498418412,"usec":789000},"extension":"jpg","md5":"50b69c1db486f4bb6af723f7395a360b","height":600,"width":800,"description":"","originalFilePath":"","originalFileName":"photo_8BCDC2F0-8CBA-442C-98F5-0389455C8014.jpg"},{"id":"http:\/\/img.ksl.com\/mx\/mplace-classifieds.ksl.com\/1159545-1498418415-54882.jpg","uploadTime":{"sec":1498418418,"usec":462000},"extension":"jpg","md5":"34296cda28b212a6c5590f233a2dca09","height":600,"width":800,"description":"","originalFilePath":"","originalFileName":"photo_726D1636-E3A9-4515-9B95-55161FAAF730.jpg"},{"id":"http:\/\/img.ksl.com\/mx\/mplace-classifieds.ksl.com\/1159545-1498418421-389128.jpg","uploadTime":{"sec":1498418424,"usec":518000},"extension":"jpg","md5":"265087f19c17a99561a817f02a097b21","height":600,"width":800,"description":"","originalFilePath":"","originalFileName":"photo_09B01A71-46F2-4D8F-9153-CE0F0017495A.jpg"}], 
       standardFeaturedDates: [], 
       favorited: 1, 
       pageViews: 68    }; 

      window.detailPage.sellerData = { 
       sellerId: 1159545, 
       sellerAccountAge: "Nov 2010", 
       moreListingsFromSeller: [{"id":44782211,"displayTime":1500694907,"price":100,"title":"Moto Gear 3 Helmets and Alpine Star Tech 6 Boots S","photo":"http:\/\/img.ksl.com\/mx\/mplace-classifieds.ksl.com\/1159545-1498417151-456217.jpg"},{"id":44782400,"displayTime":1500694904,"price":30,"title":"Belts Pouch, Canteen Holsters For 2 Canteens","photo":"http:\/\/img.ksl.com\/mx\/mplace-classifieds.ksl.com\/1159545-1498418072-282620.jpg"}]    }; 

      window.detailPage.userData = { 
       testUser: Boolean(0) 
      }; 
          </script> 

我如何提取我想要的那件?

感謝您看我的問題!

+2

哇。這不是HTML。這是json。你應該使用JSON解析器。 –

+0

此外,您還沒有指定您想要哪種語言的解決方案。 –

+0

可能是發佈[此問題]的用戶@AlexR的假帳戶(https://stackoverflow.com/questions/45257932/how-to-extract-一小時之前,從複雜的javascript-text-html-container-nested-in-betw)。 –

回答

0

也許你不會再問了嗎?如果這個工作,你知道你打算標記它接受,呃?

如果您願意,可以將這個醜陋的東西視爲json'parsables'列表。我把它放在一個我可以讀入腳本的文件中。在閱讀它時,我敲掉了最初的幾個字符,最後的']'只留下可解析的項目,減去它們的初始位。

然後我做了分割,不要忘記分割的第一塊不會被使用。將每個片段送入json解析器產生了一個字典,從中可以選擇所需的項目。

>>> import json 
>>> import re 
>>> complete = open('temp.json').read()[6:-1] 
>>> pieces = re.split(r',?\s*\{\s*"id"', complete) 
>>> for piece in pieces[1:]: 
...  items = json.loads('{"id"'+piece) 
...  items['id'] 
... 
'http://img.ksl.com/mx/mplace-classifieds.ksl.com/1159545-1498418383-59832.jpg' 
'http://img.ksl.com/mx/mplace-classifieds.ksl.com/1159545-1498418389-472609.jpg' 
'http://img.ksl.com/mx/mplace-classifieds.ksl.com/1159545-1498418397-06491.jpg' 
'http://img.ksl.com/mx/mplace-classifieds.ksl.com/1159545-1498418403-391642.jpg' 
'http://img.ksl.com/mx/mplace-classifieds.ksl.com/1159545-1498418409-263588.jpg' 
'http://img.ksl.com/mx/mplace-classifieds.ksl.com/1159545-1498418415-54882.jpg' 
'http://img.ksl.com/mx/mplace-classifieds.ksl.com/1159545-1498418421-389128.jpg' 

編輯:改變了問題。非常相似,除了這是一個更多的工作來剔除包含所需信息的行。這是額外的match操作的目的。還要注意前一行已經改變。

>>> import json 
>>> import re 
>>> complete = open('temp.json').read() 
>>> m = re.match(r'.*?photo:\s+\[([^]]+)]', complete, re.DOTALL) 
>>> pieces = re.split(r',?\s*\{\s*"id"', m.groups(0)[0]) 
>>> for piece in pieces[1:]: 
...  items = json.loads('{"id"'+piece) 
...  items['id'] 
...  
'http://img.ksl.com/mx/mplace-classifieds.ksl.com/1159545-1498418383-59832.jpg' 
'http://img.ksl.com/mx/mplace-classifieds.ksl.com/1159545-1498418389-472609.jpg' 
'http://img.ksl.com/mx/mplace-classifieds.ksl.com/1159545-1498418397-06491.jpg' 
'http://img.ksl.com/mx/mplace-classifieds.ksl.com/1159545-1498418403-391642.jpg' 
'http://img.ksl.com/mx/mplace-classifieds.ksl.com/1159545-1498418409-263588.jpg' 
'http://img.ksl.com/mx/mplace-classifieds.ksl.com/1159545-1498418415-54882.jpg' 
'http://img.ksl.com/mx/mplace-classifieds.ksl.com/1159545-1498418421-389128.jpg' 
+0

我不欣賞態度,但我非常感謝您的時間和解決方案,謝謝比爾。但是,我還沒有能夠在我的機器上重現結果,因爲我沒有給你足夠的上下文進入我的代碼! (Doh!)我提供的這段代碼是「javascript/text」較大部分的代碼;我無法隔離我給你們的JSON解析器。 – Thania

+0

我更新了這個問題,以反映我正在努力與什麼。 – Thania

+0

沒有違法意圖。我真的不在乎你是否僞造你的身份。無法抗拒戲弄。 –

相關問題