Friday, May 6, 2011

How can I write a regular expression to capture links with no link text?

How can I write a regular expression to replace links with no link text like this:

<a href="http://www.somesite.com"></a>

with

<a href="http://www.somesite.com">http://www.somesite.com</a>

?

This is what I was trying to do to capture the matches, and it isn't catching any. What am I doing wrong?

string pattern = "<a\\s+href\\s*=\\s*\"(?<href>.*)\">\\s*</a>";
From stackoverflow
  • I could be wrong, but I think you simply need to change the quantifier within the href group to be lazy rather than greedy.

    string pattern = @"<a\s+href\s*=\s*""(?<href>.*?)"">\s*</a>";
    

    (I've also changed the type of the string literal to use @, for better readability.)

    The rest of the regex appears fine to me. That you're not capturing any matches at all makes me think otherwise, but there could be a problem in the rest of the code (or even the input data - have you verified that?).

  • I would suggest

    string pattern = "(<a\\b[^>]*href=\"([^\"]+)\"[^>]*>)[\\s\\r\\n]*(</a>)";
    

    This way also links with their href attribute somewhere else would be captured.

    Replace with

    "$1$2$3"
    

    The usual word of warning: HTML and regex are essentially incompatible. Use with caution, this might blow up.

  • I wouldn't use a regex - I'd use the Html Agility Pack, and a query like:

    foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[.='']")) {
        link.InnerText = link.GetAttribute("href");
    }
    
    womp : +1 for my daily dose of learning something new.
    Tomalak : +1 for avoiding regex shallows.
  • Marc Gravell has the right answer, regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.

0 comments:

Post a Comment

Note: Only a member of this blog may post a comment.