How to count words in XML and HTML files?

For HTML files, you can use the command-line tool html2text (even with several files at once), together with wc:

html2text *.html | wc --words

For XML files, you can use the command-line tool xml_grep instead, again together with wc. It allows to specify an XPath-like expression to define which parts of the XML text to count. Examples for TMX files:

xml_grep --text_only --cond "tuv[@xml:lang='de']/seg" languages.tmx | wc --words

xml_grep --text_only --cond seg languages.tmx | wc --words

I also shared this solution at Stackoverflow; see there for alternative approaches as well.


Posted

in

,

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.