Today I want to share my experience on fixing a bug on MangaUpdates. First I have to explain the bug.
We use various site to empower All Manga Mobile. We scrap manga sources from MangaReader, MangaFox, etc to manga and images. And we go to MangaUpdates for manga information. Why do we use MangaUpdates? Because it has the most reliable manga rating and status (Ongoing, Discontinued, Hiatus or Completed) in the internet. I visit MangaUpdates at least once a day!
We use Jsoup for scraping the site. Jsoup is very handy and easy to use. I've been using it for years (just 2 years actually). I will not discuss how I use Jsoup here since it will take a lot of time and page :(
The problem is: we Jsoup accept very specific charset. We can use "UTF-8", "ISO-8859-1", and "LATIN-1" as input charset . But it will fail for something trivial like "LATIN1" or "LATIN 1." And what's worse, recent MangaUpdates, which usually give correct charset name, decide to give us "LATIN" as its charset name. In turn our server broke for days and I figured it out a bit too late. I made a hack so we won't accept any URL from MangaUpdates for a while until I had the time to fix it.
And the time came just now. What I did is basically getting the byte of the Jsoup response and parse it manually by assuming its charset is "LATIN-1" or "ISO-8859-1" instead of "LATIN." Here's some snippets
My code used to look like this:
Document mangaUpdatesDoc = Jsoup.connect(url).get();
Now it looks like this:
Connection con = Jsoup.connect(url); Response resp = con.execute(); String html = new String(resp.bodyAsBytes(), "ISO-8859-1"); Document mangaUpdatesDoc = Jsoup.parse(html);
A bit long and inefficient. But at least it works!
Hope it helps!