Wednesday, February 6, 2008

Extract Tables from HTML page and store it in data set using Regular Expressions

Some times we need to extract information from HTML pages ,for example extracting table from HTML page
here you will find how to do it using regular expression , this code is written using C# :

private static DataSet ConvertHTMLTablesToDataSet(string HTML)
{
DataTable dt;
DataSet ds = new DataSet();
dt = new DataTable();
string TableExpression = "<table[^>]*>(.*?)</table>";
string HeaderExpression = "<th[^>]*>(.*?)</th>";
string RowExpression = "<tr[^>]*>(.*?)</tr>";
string ColumnExpression = "<td[^>]*>(.*?)</td>";
bool HeadersExist = false;
int iCurrentColumn = 0;
int iCurrentRow = 0;

MatchCollection Tables = Regex.Matches(HTML,
TableExpression,
RegexOptions.Singleline |
RegexOptions.Multiline |
RegexOptions.IgnoreCase);


foreach (Match Table in Tables)
{
iCurrentRow = 0;
HeadersExist = false;
dt = new DataTable();

if (Table.Value.Contains("<th"))
{
HeadersExist = true;

MatchCollection Headers = Regex.Matches(Table.Value,
HeaderExpression,
RegexOptions.Singleline |
RegexOptions.Multiline |
RegexOptions.IgnoreCase);

foreach (Match Header in Headers)
{
dt.Columns.Add(Header.Groups[1].ToString());
}

}
else
{

int myvar2222 = Regex.Matches(
Regex.Matches(
Regex.Matches(
Table.Value,
TableExpression,
RegexOptions.Singleline
| RegexOptions.Multiline |
RegexOptions.IgnoreCase)[0].ToString(),
RowExpression, RegexOptions.Singleline |
RegexOptions.Multiline |
RegexOptions.IgnoreCase)[0].ToString(),
ColumnExpression,
RegexOptions.Singleline |
RegexOptions.Multiline |
RegexOptions.IgnoreCase).Count;

for (int iColumns = 1; iColumns <= myvar2222; iColumns++)
{
dt.Columns.Add("Column " + System.Convert.ToString(iColumns));
}

}

MatchCollection Rows = Regex.Matches(Table.Value,
RowExpression,
RegexOptions.Singleline |
RegexOptions.Multiline | RegexOptions.IgnoreCase);

foreach (Match Row in Rows)
{

if (!((iCurrentRow == 0) & HeadersExist))
{
DataRow dr = dt.NewRow();
iCurrentColumn = 0;

MatchCollection Columns = Regex.Matches(Row.Value,
ColumnExpression,
RegexOptions.Singleline |
RegexOptions.Multiline |
RegexOptions.IgnoreCase);

foreach (Match Column in Columns)
{
dr[iCurrentColumn] = Column.Groups[1].ToString();
iCurrentColumn++;
}

dt.Rows.Add(dr);
}
iCurrentRow++;
}
ds.Tables.Add(dt);

}

return ds;
}

This code i found it through google but i converted it it C# ...
kick it on DotNetKicks.com

16 comments:

kickaha said...

hi,

im assuming you found this code in VB.NET and converted. if thats the case, could you please post the original code?
many thanks!

Mahmoud Alam said...

ok , this what do you need ...

http://cid-38f3ba7902b36885.skydrive.live.com/self.aspx/class1%20html%20tables%20to%20ds%20by%20regex

kickaha said...

many thanks!!!

Mahmoud Alam said...

welcome any time ......

Anonymous said...

Good one, but I found that its not working for nested tables.

For example, if you have table_2 in table_1, the regular expression returned only table_2.

Mahmoud Alam said...

ummmmmmmmmmm

Good notice - i will work to solve this problem soon

Thanks for your comment .....

Anonymous said...

awesome man, thanks

Anonymous said...

Hey, I'm kind of interested in the problem with nested tables. Is there any news of that?

Anonymous said...

Your function works very well, if possible, please email me once your new code done in c# to handle nested tables, thank you so much!!!

my email address schizoidia@yahoo.com.hk

Anonymous said...

Good job! But I encountered a similar problem with nested tables like others. Is there any way to solve this problem? I'll be much gratitude if you post such a solution :)

amackay99@gmail.com said...

Works great for me (no nested tables). Thanks!

Richard Brown said...

Great post, thanks. It really helped me out. :)

Anonymous said...

Man you are a life saver

Matt Ray said...

Thanks for the code - this fit the bill perfectly. I was abl to adjust the regex for some elements to populate table names for better consuming of the data of the flip side.

djmaca said...

Really good piece, man. Thanks.

sunny said...

Do you have this code in vb.net? if so can u post the same