How to do a RegEx match open tags except XHTML self-contained tags
To match open HTML tags but exclude self-closing XHTML tags using Regular Expressions (RegEx), you can use the following pattern:
RegEx Pattern
<([a-zA-Z][a-zA-Z0-9]*)(?![^>]*\/>)>
Explanation of the Pattern
<: Matches the opening<of an HTML tag.([a-zA-Z][a-zA-Z0-9]*): Captures the tag name, which starts with a letter and can contain letters and numbers.[a-zA-Z]: Ensures the tag starts with a letter.[a-zA-Z0-9]*: Allows for additional alphanumeric characters.
(?![^>]*\/>): Negative lookahead to exclude tags containing/before the closing>(self-closing tags).[^>]*: Matches any characters except>(ensures we’re still inside the tag).\/>: Looks for the self-closing/>.
>: Matches the closing>of the opening tag.
Example HTML Snippet
<div>
<img src="image.jpg" />
<input type="text" />
<span>Text</span>
<br />
<p>Paragraph</p>
</div>
Matches
Using the RegEx <([a-zA-Z][a-zA-Z0-9]*)(?![^>]*\/>)>:
- Matches:
<div><span><p>
- Excludes:
<img src="image.jpg" /><input type="text" /><br />
Code Example
JavaScript Example
const html = `
<div>
<img src="image.jpg" />
<input type="text" />
<span>Text</span>
<br />
<p>Paragraph</p>
</div>
`;
const regex = /<([a-zA-Z][a-zA-Z0-9]*)(?![^>]*\/>)>/g;
const matches = html.match(regex);
console.log(matches);
// Output: [ '<div>', '<span>', '<p>' ]
Python Example
import re
html = '''
<div>
<img src="image.jpg" />
<input type="text" />
<span>Text</span>
<br />
<p>Paragraph</p>
</div>
'''
regex = r'<([a-zA-Z][a-zA-Z0-9]*)(?![^>]*\/>)>'
matches = re.findall(regex, html)
print(matches)
# Output: ['div', 'span', 'p']
Notes
- This pattern does not account for malformed HTML.
- It assumes tags are properly closed or self-closed.
- It doesn’t distinguish between
<tag>and<tag attr="value">. Both are matched as long as they are not self-closing.
No images available.