我已经检查了几个帖子上关于让所有的HTML标签之间的所有单词堆栈溢出! 所有的人都搞糊涂了起来! 有些人建议的正则表达式专门为一个单一的标签,而有些人提到的解析技术! 我基本上是试图使Web爬虫! 为此,我已经得到了我在一个字符串取到我的程序链接的HTML! 我也从提取的HTML,我保存在我的数据串的链接! 现在我想通过所有的链接,我从我的字符串中提取的页面上的深度和提取词抓取! 我有两个问题! 我如何可以获取每个网页忽略标签和Java Script的话? 其次,如何将我递归通过链接爬行?
这就是我在字符串中获得HTML:
public void getting_html_code_of_link()
{
string urlAddress = "http://google.com";
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(urlAddress);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
if (response.StatusCode == HttpStatusCode.OK)
{
Stream receiveStream = response.GetResponseStream();
StreamReader readStream = null;
if (response.CharacterSet == null)
readStream = new StreamReader(receiveStream);
else
readStream = new StreamReader(receiveStream, Encoding.GetEncoding(response.CharacterSet));
data = readStream.ReadToEnd();
response.Close();
readStream.Close();
Console.WriteLine(data);
}
}
这是怎么了提取网址我给的链接refrences:
public void regex_ka_kaam()
{
StringBuilder sb = new StringBuilder();
//Regex hrefs = new Regex("<a href.*?>");
Regex http = new Regex("http://.*?>");
foreach (Match m in http.Matches(data))
{
sb.Append(m.ToString());
if (http.IsMatch(m.ToString()))
{
sb.Append(http.Match(m.ToString()));
sb.Append(" ");
//sb.Append("<br>");
}
else
{
sb.Append(m.ToString().Substring(1, m.ToString().Length - 1)); //+ "<br>");
}
}
Console.WriteLine(sb);
}