2015-10-21 28 views
3

JSFiddleJavaScript正則表達式匹配某些字符串,但未能對其他看似相同的字符串

我使用Facebook的API,從我縣的警察局網頁日常犯罪報告來拉。他們遵循大多也是標準化的格式,用下面的模式是什麼,我都會響起來的,和一些討厭的矛盾:

  1. 標題是3-4線後跟兩個新行字符\n\n之間(代碼削減了這一點,不屬於下面的輸出結果)
  2. 不同類型的犯罪組合在一起,第一行是描述犯罪類型的大寫字符串。每個類別由兩個新行字符\n\n分隔。
  3. 實際的罪行遵循上述類別標題,每個(大部分時間)由一個新行字符\n
  4. 分離作爲一個「僞影」的任何的它們從複製和粘貼,幾次有各種Unicode字符替換連字符,包括\u2013\u2014\u2015
  5. 報道開始字符串「BEAT」,或在極少數情況下所有罪行「垮掉」

是我遇到的問題是,有時下面的代碼捕獲一個類別標題det在上面#2中,但在其他帖子中,(似乎)完全相同的字符串和情況並沒有被捕獲。我使用在服務角代碼可以看出下面

me.parsePosts = function() { 
    var posts = facebookService.getRandomPosts(); // Just a method to return 5 random reports for now 
    angular.forEach(posts, function(post) { 
     // Some reports are incorrectly double spaced and inconsistent 
     // with spacing and capitalization 
     var fixedPost = post.message 
          .replace(/^Beat/, 'BEAT') // They were a little inconsistent back in the day 
          .replace('\n\n###', '') // All posts end with a useless ### 
          .replace('\u2013', '-') // Pesky unicode characters! 
          .replace('\u2014', '-') 
          .replace('\u2015', '-') 
          .replace('\n\nARRESTED', '\nARRESTED') // would help if this was consistent 
          .replace(/(?:\\[rn ]|[\r\n ]+)BEAT/gi, '\nBEAT'), // same with the reports... 
      postSplit = fixedPost.split('\n\n'), // split up the post into potential categories 
      header = postSplit.splice(0,1); // I don't want the standard header of the post 

     // Pass in postSplit .join()'d back together for debugging 
     me.getCategoriesFromPost(postSplit, postSplit.join('\n\n')); 
    }); 
}; 

me.getCategoriesFromPost = function(postArray, post) { 
    var categoryRegexp = /[A-Z\-&\/: ]+$/, 
     categories = [], uniqCategories = []; 

    angular.forEach(postArray, function(a) { 
     var split = a.split('\n'), // Extract the category from the list of crimes 
      potentialCategory = split[0].trim(); // There's often an unwanted trailing space 

     if (potentialCategory.match(categoryRegexp)) { 
      categories.push(potentialCategory); 
     } 
    }); 

    // Every blue moon they repost a category twice, I just want one 
    // and I'll merge the two together afterwards 
    uniqCategories = categories.filter(function(a,b) { 
     return categories.indexOf(a) == b; 
    }); 

    console.log(uniqCategories); // log off all the categories in the post 
    console.log(post); // Display the actual post so i can visibly verify it all worked 
}; 

因此,作爲一個例子,在一個交:

console.log(uniqCategories);original raw text as received from facebookService.getRandomPosts()):

BURGLARY COMMERCIAL 
BEAT E1 SPRINT WIRELESS, 7300 ASSATEAGUE DR, 3/19 0426: Unknown suspect(s) gained entry to the business by breaking the glass door. The suspect(s) stole electronics. 14-25638 
BEAT D6 MONTPELIER LIQUORS, 7500 MONTPELIER RD, 3/19 0513: Unknown suspect(s) gained entry to the business by breaking the glass door. The suspect(s) stole liquor, lottery tickets, and an ATM machine. 14-25641 
BEAT D4 MACY’S, 10300 LITTLE PATUXENT PKWY, 3/19 0501: Two unknown male suspects, wearing masks, gained entry to the business by breaking the glass door. The suspects were interrupted by a store employee and fled without taking anything. 14-25642 
SUSPECT VEHICLE: black Dodge pickup 

BURGLARY NON COMMERCIAL 
BEAT B3 6600 ASPERN DR, 3/17 2354: Four suspects gained entry to the residence via unknown means. No sign of forced entry. 14-25220 
ARRESTED: 
Karlin Lamont Harris, 23, of Pirch Way in Elkridge, charged with fourth-degree burglary 
Steven Lee Hubbard, 29, of Edgewater, charged with fourth-degree burglary 
Jessie Tyler Holt, 22, of Pine Tree Rd in Jessup, charged with fourth-degree burglary 
Brittney Victoria McEnaney, 26, of Pasadena, charged with fourth-degree burglary 
BEAT C1 6900 BENDBOUGH CT, 3/18 1400: Unknown suspect(s) gained entry to the residence via the front door. No sign of forced entry. The suspect(s) stole jewelry. 14-25392 
BEAT B4 7100 DEEP FALLS WAY, 3/18 1100-1440: Unknown suspect(s) gained entry to the residence by forcing a rear basement window. The suspect(s) stole jewelry and electronics. 14-25404 

VEHICLE THEFT & ATTEMPTS 
BEAT E2 7-11, 9600 WASHINGTON BLVD, 3/18 0409: 
05 Acura Tag 1AV8629 14-25277 (Keys left in vehicle.) 

而且console.log(post);返回

["BURGLARY COMMERCIAL", "BURGLARY NON COMMERCIAL", "VEHICLE THEFT & ATTEMPTS"] 

還在另一篇文章中,console.log(uniqCategories);original raw text as received from facebookService.getRandomPosts()):

ROBBERY COMMERCIAL 
BEAT B3 ZIPS DRY CLEANING, 6500 OLD WATERLOO RD, 3/22 1900: An unknown suspect entered the business through an unlocked rear door. The suspect threatened an employee and demanded cash. The employee complied. The suspect fled the business. 14-26959 
SUSPECT: B/M, 5’8-5’9, black hoodie and pants, backpack 

ROBBERY NON COMMERCIAL 
BEAT E7 7-11 PARKING LOT, 9100 MAIER RD, 03/23 1632: Suspect stole cash from an acquaintance and caused an abrasion with an unknown sharp object. Police are investigation the possibility it may be drug related. 14-27243 
SUSPECT: B/M, 5’8, 200 lbs, dreadlocks 

BURGLARY COMMERCIAL 
BEAT E1 MEGATELECOM, 8600 WASHINGTON BLVD #106, 3/22 0933: Unknown suspect(s) gained entry to the business by breaking a window. The suspect(s) stole electronics. 14-26793 
BEAT F3 CATTAIL CREEK COUNTRY CLUB, 3600 CATTAIL CREEK DR, 03/22 1600- 03/23 0630: Unknown suspect(s) gained entry to a garage through an unlocked door. The suspect(s) stole golf carts. 14-27127 

BURGLARY NON COMMERCIAL 
BEAT E2 9300 BREAMORE CT, 03/21 1210 ATTEMPT: Two suspects attempted to gain entry via a rear slider. The resident yelled and the suspects fled, but were later caught by police. 14-26458 
ARRESTED: 
Travis Donte Mackell, 23, of Baltimore, charged with fourth-degree burglary 
Maurice Debuiel Aye, 26, of Baltimore, charged with fourth-degree burglary 
BEAT D3 5500 COLUMBIA RD, 3/21: An unknown suspect gained entry to the residence through an unlocked rear slider. The suspect woke the resident, who ultimately got the suspect to leave. It appears he may have entered the wrong residence. 14-26712 
SUSPECT: B/M, 5’8, 200 lbs 
BEAT B4 7500 HEARTHSIDE WAY, 3/22 1700- 1800: Three unknown black male suspects stole a bicycle, which was unsecured on a bike rack. 14-27185 
BEAT E3 9100 BRYANT AVE, 3/23 2213: Unknown suspects gained entry to the residence by prying open the kitchen window. Nothing appeared to be taken. 14-27308 
BEAT B3 8000 KEETON RD, 3/23 1930- 2230: Unknown suspect(s) gained entry to the residence through an unlocked window. The suspect(s) stole a computer and jewelry. 14-27314 
BEAT A3 9000 FREDERICK RD, 3/23 0205: The suspect kicked in an acquaintance’s door after a verbal altercation and assaulted him. 14-27361 
ARRESTED: Michael Wilson Sittig, 34, of Frederick Road in Ellicott City, charged with second-degree assault, third- and fourth-degree burglary, malicious destruction of property, and disorderly conduct 

VEHICLE THEFT & ATTEMPTS 
BEAT D2 5100 ELIOTS OAK DR, 03/22 2130- 3/23 0700: 
12 Hyundai Sonata Red MD 5AN2945 14-27135 

console.log(post)只返回:

["ROBBERY COMMERCIAL", "VEHICLE THEFT & ATTEMPTS"] 

我希望它返回["ROBBERY COMMERCIAL", "ROBBERY NON COMMERCIAL", "BURGLARY COMMERCIAL", "BURGLARY NON COMMERCIAL", "VEHICLE THEFT & ATTEMPTS"]

在這種情況下,很顯然,我的代碼前者的實例相匹配的BURGLARY COMMERCIALBURGLARY NON COMMERCIAL,但不是後者。是什麼賦予了?另外,請隨時糾正我,告訴我我在.replace()的牆上做的都是錯的,如果有的話,還有更好的辦法。感謝一幫幫忙!

+0

這可能有助於在循環的開始記錄了'POST'(任何修改都做之前),看看是否還有別的東西......就像一個標籤?或者打勾(而不是單引號),這會在一個奇怪的地方切斷內容。難以確定 – ochi

+0

我已將從FB接收到的原始文本添加到修剪和修改輸出上方的帖子,並通過指向Pastebin的鏈接將此問題進一步解決。對於它的價值,我對這些字符串重新定義了'JSON.stringify()',並且在類別之前只看到了'\ n \ n'。 – Scott

+0

你能提供一個JSfiddle,所以我們可以運行一些測試嗎? –

回答

1

你被你的分裂之前錯過了幾個分隔符的替代品。即,我添加:

post.message 
... 
.replace(/\s*\n\s\n/g, '\n\n') 
.replace(/\s BEAT/g, 'BEAT') ... 

參見updated fiddle

TL; DR;(更新基於評論)

如果你看一下原來replace(...)函數調用後的消息和.split('\n\n')之前,他們中的一些以新行具有在最後跟着一個空格,然後又空白,新隊。

沒有你原來replace()走上的照顧。此外,一些只有換行符,空白,換行符模式(&爲什麼正則表達式中的第一個空格有一個*)。然後,郵件中的一些關鍵字BEAT前面有一個或多個空格,因此我們將刪除這些以確保BEAT始終以換行符開頭。

如果未註釋掉小提琴日誌線和註釋掉此修復程序,你會看到在每一步元素的數組。

在其中的一個,你會看到一個數組元素不僅包含了我們所期望的(一個報告),但下一個類別被嵌入有作爲(這就是爲什麼你會看到更少)。

然後,我只是想看看有什麼是對那些行結尾不同,檢查是否replace()功能照顧他們的split(...)調用之前...

讓我知道如果你要我更好地解釋它。

+0

雖然語法有點粗糙,但我能夠使用正則表達式,它確實有效。在把所有的錢都扔給你之前,你能解釋爲什麼第一個'.replace()'是必須的/它完成了什麼? – Scott

+0

@Scott更新回答與解釋 – ochi

+0

感謝您的解釋! – Scott

2

String.replace取代第一次發生。您需要用正則表達式更改所有String.replace以替換所有的事件。像這樣的東西(雖然我不知道的unicode字符在正則表達式的工作方式):

post.message 
    .replace(/^Beat/ig, 'BEAT') // They were a little inconsistent back in the day 
    .replace('/\n\n###/g', '') // All posts end with a useless ### 
    .replace('/\u2013/g', '-') // Pesky unicode characters! 
    .replace('/\u2014/g', '-') 
    .replace('/\u2015/g', '-') 
    .replace('/\n\nARRESTED/g', '\nARRESTED') // would help if this was consistent 
    .replace(/(?:\\[rn ]|[\r\n ]+)BEAT/gi, '\nBEAT'), // same with the reports... 
+0

從什麼我讀,在Unicode的正則表達式的作品,因爲我用它在我的字符串以同樣的方式,但無論如何,只需用'.replace(/不管/ G「修復」)取代一切'似乎並沒有做到這一點。見的jsfiddle:http://jsfiddle.net/pv38Lyo2/2/ – Scott

+0

除了星芒層字符,這需要特殊處理的ES5,Unicode轉義工作在正則表達式怎麼樣,將在一個字符串。 – nhahtdh

相關問題