2013-06-26 43 views
2

我試圖讀取供應商提供的基於HTML的.xls文件,並將其轉換爲csv以導入不同的進程。我發現了很多可以讀取和轉換的解決方案,最流行的是使用OLEDB來讀取它。我上週在VS2010中工作,但後來安裝了VS2012/.NET4.5,並且突然無法識別源文件我什麼也沒有做,可以再次得到它的功能 - 我甚至嘗試在不同的機器上安裝VS2010,它不會表現出來(所以我不知道它如何在原機上運行)。如果按原樣運行代碼,則cnn.Open()將引發一個異常,指出「外部表格未處於預期格式。」如果我將連接字符串更改爲註釋掉的行,它將讀取文件,但不正確(並非所有內容都被讀取並且數據未正確填充)。使用C#讀取基於HTML的XLS時出現的問題

因此,總之,什麼是最好的方式(最好沒有第三方庫/應用程序)閱讀本文底部的文件使用C#?

下面的代碼

string excelFilePath = @"C:\Users\Dan\test.xls"; 
string csvOutputFile = @"C:\Users\Dan\output.csv"; 
int worksheetNumber = 1; 
// connection string 
var cnnStr = String.Format("Provider=Microsoft.ACE.OLEDB.12.0;Data Source={0};Extended Properties=\"Excel 12.0;IMEX=1;HDR=NO\"", excelFilePath); 
//var cnnStr = String.Format("Provider=Microsoft.ACE.OLEDB.12.0;Data Source={0};Extended Properties=\"HTML Import;IMEX=1;HDR=NO\"", excelFilePath); 

var cnn = new OleDbConnection(cnnStr); 
// get schema, then data 
var dt = new DataTable(); 
try 
{ 
    cnn.Open(); 
    var schemaTable = cnn.GetOleDbSchemaTable(OleDbSchemaGuid.Tables, null); 
    if (schemaTable.Rows.Count < worksheetNumber) throw new ArgumentException("The worksheet number provided cannot be found in the spreadsheet"); 
    string worksheet = schemaTable.Rows[worksheetNumber - 1]["table_name"].ToString().Replace("'", ""); 
    string sql = String.Format("select * from [{0}]", worksheet); 
    var da = new OleDbDataAdapter(sql, cnn); 
    da.Fill(dt); 
} 
catch (Exception e){} 
finally{cnn.Close();} 

// write out CSV data 
using (var wtr = new StreamWriter(csvOutputFile)) 
{ 
    foreach (DataRow row in dt.Rows) 
    { 
     bool firstLine = true; 
     foreach (DataColumn col in dt.Columns) 
     { 
      if (!firstLine) { wtr.Write(","); } else { firstLine = false; } 
       var data = row[col.ColumnName].ToString().Replace("\"", "\"\""); 
       wtr.Write(String.Format("\"{0}\"", data)); 
      } 
      wtr.WriteLine(); 
      } 
    } 

下面是我從閱讀文件,發送給我們提供了一個.xls擴展名。

<html xmlns:v="urn:schemas-microsoft-com:vml" 
xmlns:o="urn:schemas-microsoft-com:office:office" 
xmlns:x="urn:schemas-microsoft-com:office:excel" 
xmlns="http://www.w3.org/TR/REC-html40"><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"> 
<meta name="ProgId" content="Excel.Sheet"/> 
<meta name="Generator" content="Microsoft Excel 10"/> 
<!--[if !mso]> 
<style> 
v\\:* {behavior:url(#default#VML);}"); 
o\\:* {behavior:url(#default#VML);}"); 
x\\:* {behavior:url(#default#VML);}"); 
.shape {behavior:url(#default#VML);}"); 
</style>"); 
<![endif]--> 
<!--[if gte mso 9]><xml> 
<x:ExcelWorkbook> 
<x:ExcelWorksheets> 
<x:ExcelWorksheet> 
<x:Name>report</w:Name> 
<x:WorksheetOptions> 
<x:ProtectContents>False</w:ProtectContents> 
<x:ProtectObjects>False</w:ProtectObjects> 
<x:ProtectScenarios>False</w:ProtectScenarios> 
</w:WorksheetOptions> 
</w:ExcelWorksheet> 
</w:ExcelWorksheets> 
<x:ProtectStructure>False</w:ProtectStructure> 
<x:ProtectWindows>False</w:ProtectWindows> 
</w:ExcelWorkbook>"); 
</xml><![endif]--> 
<head> 

<style> 
br {mso-data-placement:same-cell;} 
</style> 
</head> 
<body> 

<style> 
table { 
mso-displayed-decimal-separator:"\."; 
mso-displayed-thousand-separator:"\,"; 
} 
</style> 
<table width="100%"> 
<tr> 
<td align=center colspan=6 valign=top> 
<span class="pageHead"> 
<nobr><h1>Status</h1></nobr></span> 
</td> 
</tr> 
<tr> 
<td align=center colspan=6 valign=top> 
<span class="pageHead"><nobr> 
Generated by User 
</nobr></span> 
</td></tr> 
<tr> 
<td>&nbsp;</td> 
</tr> 
<tr> 
<td>&nbsp;</td> 
</tr> 
</table> 
<table border="1" cellspacing="0" cellpadding="0" width="100%"> 
<tr> 
<th>Owner</th> 
<th>Project Id</th> 
<th>Event Id</th> 
<th>Event Title</th> 
<th>Event Status</th> 
<th>EventSummary</th> 
</tr> 
<tr> 
<td>User</td> 
<td>1</td> 
<td>test1</td> 
<td>event1</td> 
<td>Completed</td> 
<td>1</td> 
</tr> 
<tr> 
<td>User</td> 
<td>2</td> 
<td>test2</td> 
<td>event2</td> 
<td>Pending Selection</td> 
<td>1</td> 
</tr> 
<tr> 
<td>User</td> 
<td>3</td> 
<td>test3</td> 
<td>event3</td> 
<td>Completed</td> 
<td>1</td> 
</tr> 
<tr> 
<td>User</td> 
<td>4</td> 
<td>test4</td> 
<td>event4</td> 
<td>Completed</td> 
<td>1</td> 
</tr> 
<tr> 
<td>User</td> 
<td>5</td> 
<td>test5</td> 
<td>event5</td> 
<td>Completed</td> 
<td>1</td> 
</tr> 
<tr> 
<td>User</td> 
<td>6</td> 
<td>test6</td> 
<td>event6</td> 
<td>Completed</td> 
<td>1</td> 
</tr> 
<tr> 
<td>User</td> 
<td>7</td> 
<td>test7</td> 
<td>event7</td> 
<td>Completed</td> 
<td>1</td> 
</tr> 
<tr> 
<td>User</td> 
<td>8</td> 
<td>test8</td> 
<td>event8</td> 
<td>Completed</td> 
<td>1</td> 
</tr> 
<tr> 
<td>User</td> 
<td>9</td> 
<td>test9</td> 
<td>event9</td> 
<td>Completed</td> 
<td>1</td> 
</tr> 
<tr> 
<td>User</td> 
<td>10</td> 
<td>test10</td> 
<td>event10</td> 
<td>Completed</td> 
<td>1</td> 
</tr> 
<tr> 
<td>User</td> 
<td>11</td> 
<td>test11</td> 
<td>event11</td> 
<td>Completed</td> 
<td>1</td> 
</tr> 
<tr> 
<td>User</td> 
<td>12</td> 
<td>test12</td> 
<td>event12</td> 
<td>Completed</td> 
<td>1</td> 
</tr> 
<tr> 
<td>User</td> 
<td>13</td> 
<td>test13</td> 
<td>event13</td> 
<td>Completed</td> 
<td>1</td> 
</tr> 
<tr> 
<td>User</td> 
<td>14</td> 
<td>test14</td> 
<td>event14</td> 
<td>Completed</td> 
<td>1</td> 
</tr> 
<tr> 
<td>User</td> 
<td>15</td> 
<td>test15</td> 
<td>event15</td> 
<td>Completed</td> 
<td>1</td> 
</tr> 
</table> 

</body></html> 

回答

0

啊哈!因此,在查看原始數據和其他一些示例之後,我意識到表單中有兩個單獨的表格,並且OLEDB驅動程序將它解釋爲兩張單獨的表單。我將工作表變量更改爲2,然後檢索到我真正感興趣的數據的第二個「表」。因此,通過遍歷所有「工作表」,我應該能夠從此表中獲取所有數據。

相關問題