For HTML files, you can use the command-line tool html2text
(even with several files at once), together with wc
:
html2text *.html | wc --words
For XML files, you can use the command-line tool xml_grep
instead, again together with wc
. It allows to specify an XPath-like expression to define which parts of the XML text to count. Examples for TMX files:
xml_grep --text_only --cond "tuv[@xml:lang='de']/seg" languages.tmx | wc --words
xml_grep --text_only --cond seg languages.tmx | wc --words
I also shared this solution at Stackoverflow; see there for alternative approaches as well.
Leave a Reply