Remove certain special HTML characters from string in PHP -
i scraping information website , wondering how ignore or replace special html characters such "á", "á", "’" , "&". these characters cannot scraped database. have replaced " " using this:
$nbsp = utf8_decode('á'); $mystring = str_replace($nbsp, '', $mystring); but cannot seem same these other characters. scraping website using xpath. returns exact content looking keeps html characters not want don't seem allowed database.
thanks this.
it sounds you've got collation issue. suggest ensuring database collation set utf8_ci, , web page's content encoding set utf-8. may solve problem.
the best way strip special characters run string through htmlspecialchars(), case-insensitive regex find , replace using following pattern:
&([a-z]{2,8}+|#[0-9]{2,5}|#x[0-9a-f]{2,4}); this should match named html entities (e.g. Ω or ) decimal (e.g. Ӓ) , hex-based (e.g. &x0bee;) entities. regex strip them out completely.
alternatively, use output of htmlspecialchars() store weird characters intact. not ideal, works.
Comments
Post a Comment