How can I write a regular expression to replace links with no link text like this:
<a href="http://www.somesite.com"></a>
with
<a href="http://www.somesite.com">http://www.somesite.com</a>
?
This is what I was trying to do to capture the matches, and it isn't catching any. What am I doing wrong?
string pattern = "<a\\s+href\\s*=\\s*\"(?<href>.*)\">\\s*</a>";
-
I could be wrong, but I think you simply need to change the quantifier within the
hrefgroup to be lazy rather than greedy.string pattern = @"<a\s+href\s*=\s*""(?<href>.*?)"">\s*</a>";(I've also changed the type of the string literal to use @, for better readability.)
The rest of the regex appears fine to me. That you're not capturing any matches at all makes me think otherwise, but there could be a problem in the rest of the code (or even the input data - have you verified that?).
-
I would suggest
string pattern = "(<a\\b[^>]*href=\"([^\"]+)\"[^>]*>)[\\s\\r\\n]*(</a>)";This way also links with their
hrefattribute somewhere else would be captured.Replace with
"$1$2$3"The usual word of warning: HTML and regex are essentially incompatible. Use with caution, this might blow up.
-
I wouldn't use a regex - I'd use the Html Agility Pack, and a query like:
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[.='']")) { link.InnerText = link.GetAttribute("href"); }womp : +1 for my daily dose of learning something new.Tomalak : +1 for avoiding regex shallows. -
Marc Gravell has the right answer, regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.