character encoding - HTML files with no http-equiv meta tag and the charset may be other than UTF-8 -
we using jsoup - excellent thanks.
we may html files no http-equiv meta tag , charset may other utf-8. how best handle please. can have list of encodings , try them not sure how tell programatically if wrong. jsoup throw ioexception?
jsoup try determine encoding content type header or http equiv tag, if have none of them use utf8. not sure if jsoup can more here.
but can try approach:
implement class reads files you. there can take care of encoding issues. result such class should give proper encoded string or @ least encoding that's used input.
(html input) --> [encoding class] --normalized encoding--> [jsoup] --> (whatever)
jsoup can parse input known encoding.
i guess changes on html-creation thing not possible, isn't it?
some further readings:
- http://illegalargumentexception.blogspot.co.uk/2009/05/java-rough-guide-to-character-encoding.html#javaencoding_autodetect
- character encoding detection algorithm
- what accurate encoding detector? (includes list of implementation)
- java text file encoding
- detect (or best guess of) incoming string encoding in java
Comments
Post a Comment