2014-05-25 40 views
0

我正在編寫一個腳本來解析技術類別下列出的每家公司的納斯達克文件。這是一個用逗號分隔的CSV。但是,有時一家公司的名字被列爲XXX,Inc.。這個逗號在腳本中混淆了我的製表,所以它得到了錯誤的值。我正在解析公司股票代碼,所以',Inc.'會搞亂地方。跳過CSV文件中的某些字符

我對Python相當陌生,所以我沒有太多經驗,但我一直在盡我所能,並且已經獲得它來讀取和寫入CSV,但這個解析問題對我來說很困難。這是我目前有:

try: 
    # py3 
    from urllib.request import Request, urlopen 
    from urllib.parse import urlencode 
except ImportError: 
    # py2 
    from urllib2 import Request, urlopen 
    from urllib import urlencode 

import csv 
import urllib.request 
import string 

def _request(): 
    url = 'http://www.nasdaq.com/screening/companies-by-industry.aspx?industry=Technology&render=download' 
    req = Request(url) 
    resp = urlopen(req) 
    content = resp.read().decode().strip() 
    content1 = content.replace('"', '') 
    return content1 

def symbol_quote(): 
    counter = 1 
    recursive = 9*counter 

    values = _request().split(',') 
    values2 = values[recursive] 
    return values2 
    counter += 1 


def csvwrite(): 
    import csv 
    path = "symbol_comp.csv" 
    data = [symbol_quote()] 
    parsing = False 

    with open(path, 'w', newline='') as csv_file: 
     writer = csv.writer(csv_file, delimiter=' ') 
     for line in data: 
      writer.writerow(line) 

我沒有說得那麼它循環和行爲根據計數器但因爲沒有一點現在。這個解析問題更加緊迫。

任何人都可以請一個新手出來嗎?

+2

哇,停下來。你正在使用'csv.writer'來寫*你的數據,而不是'csv.reader'來讀*你的數據(它將處理轉義逗號 - 通過括住引號它來)。 – roippi

回答

0

變化_request()使用csv.reader()cStringIO.StringIO(),並返回一個csv.reader對象,您可以遍歷:

try: 
    # py3 
    from urllib.request import Request, urlopen 
    from urllib.parse import urlencode 
except ImportError: 
    # py2 
    from urllib2 import Request, urlopen 
    from urllib import urlencode 

import csv, cStringIO 
##import urllib.request 
import string 

def _request(): 
    url = 'http://www.nasdaq.com/screening/companies-by-industry.aspx?industry=Technology&render=download' 
    req = Request(url) 
    resp = urlopen(req) 
    sio = cStringIO.StringIO(resp.read().decode().strip()) 
    reader = csv.reader(sio) 
    return reader 

用法:

data = _request() 
print 'fields:\n{}\n'.format('|'.join(data.next())) 
for n, row in enumerate(data): 
    print '|'.join(row) 
    if n == 5: break 

# fields: 
# Symbol|Name|LastSale|MarketCap|ADR TSO|IPOyear|Sector|Industry|Summary Quote| 
# 
# VNET|21Vianet Group, Inc.|25.87|1137471769.46|43968758|2011|Technology|Computer Software: Programming, Data Processing|http://www.nasdaq.com/symbol/vnet| 
# TWOU|2U, Inc.|13.28|534023394.4|n/a|2014|Technology|Computer Software: Prepackaged Software|http://www.nasdaq.com/symbol/twou| 
# DDD|3D Systems Corporation|54.4|5630941606.4|n/a|n/a|Technology|Computer Software: Prepackaged Software|http://www.nasdaq.com/symbol/ddd| 
# JOBS|51job, Inc.|64.32|746633699.52|11608111|2004|Technology|Diversified Commercial Services|http://www.nasdaq.com/symbol/jobs| 
# WUBA|58.com Inc.|37.25|2959078388.5|n/a|2013|Technology|Computer Software: Programming, Data Processing|http://www.nasdaq.com/symbol/wuba| 
# ATEN|A10 Networks, Inc.|10.64|638979699.12|n/a|2014|Technology|Computer Communications Equipment|http://www.nasdaq.com/symbol/aten| 
相關問題