How to do a RegEx match open tags except XHTML self-contained tags

To match open HTML tags but exclude self-closing XHTML tags using Regular Expressions (RegEx), you can use the following pattern:

RegEx Pattern

<([a-zA-Z][a-zA-Z0-9]*)(?![^>]*\/>)>

Explanation of the Pattern

  1. <: Matches the opening < of an HTML tag.
  2. ([a-zA-Z][a-zA-Z0-9]*): Captures the tag name, which starts with a letter and can contain letters and numbers.
    • [a-zA-Z]: Ensures the tag starts with a letter.
    • [a-zA-Z0-9]*: Allows for additional alphanumeric characters.
  3. (?![^>]*\/>): Negative lookahead to exclude tags containing / before the closing > (self-closing tags).
    • [^>]*: Matches any characters except > (ensures we’re still inside the tag).
    • \/>: Looks for the self-closing />.
  4. >: Matches the closing > of the opening tag.

Example HTML Snippet

<div>
<img src="image.jpg" />
<input type="text" />
<span>Text</span>
<br />
<p>Paragraph</p>
</div>

Matches

Using the RegEx <([a-zA-Z][a-zA-Z0-9]*)(?![^>]*\/>)>:

  1. Matches:
    • <div>
    • <span>
    • <p>
  2. Excludes:
    • <img src="image.jpg" />
    • <input type="text" />
    • <br />

Code Example

JavaScript Example

const html = `
<div>
<img src="image.jpg" />
<input type="text" />
<span>Text</span>
<br />
<p>Paragraph</p>
</div>
`;

const regex = /<([a-zA-Z][a-zA-Z0-9]*)(?![^>]*\/>)>/g;
const matches = html.match(regex);

console.log(matches);
// Output: [ '<div>', '<span>', '<p>' ]

Python Example

import re

html = '''
<div>
<img src="image.jpg" />
<input type="text" />
<span>Text</span>
<br />
<p>Paragraph</p>
</div>
'''

regex = r'<([a-zA-Z][a-zA-Z0-9]*)(?![^>]*\/>)>'
matches = re.findall(regex, html)

print(matches)
# Output: ['div', 'span', 'p']

Notes

  • This pattern does not account for malformed HTML.
  • It assumes tags are properly closed or self-closed.
  • It doesn’t distinguish between <tag> and <tag attr="value">. Both are matched as long as they are not self-closing.

No images available.