Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All node response parsers assume utf8 encoding #825

Open
cmawhorter opened this issue Dec 28, 2015 · 5 comments
Open

All node response parsers assume utf8 encoding #825

cmawhorter opened this issue Dec 28, 2015 · 5 comments

Comments

@cmawhorter
Copy link

I noticed all parsers assume utf8.

e.g. text parser:

module.exports = function(res, fn){
  res.text = '';
  res.setEncoding('utf8');
  res.on('data', function(chunk){ res.text += chunk; });
  res.on('end', fn);
};

It seems like they all should get the encoding from the response content-type header and fall back to binary (application/octet-stream) if none exists.

Can PR, if desired.

@kornelski
Copy link
Contributor

I don't think octet-stream is appropriate, because the default encoding varies by MIME type, e.g. JSON is utf8 by default. To make things worse some MIME specs are out of touch with reality (e.g. latin1 for text/* IIRC), and in the browser world anything but utf8 is asking for trouble.

So I think it should be limited to using encoding that is explicitly specified, and utf8 otherwise. PR welcome.

@cmawhorter
Copy link
Author

i've moved on from the lib so i might not be the best for a pr but here's an example of why assuming utf8 is a bad idea.

// superagent (i couldn't get res.body to populate no matter what i did by my point remains valid)
var Encoding = require('encoding-japanese');
require('superagent').get('http://i1.theportalwiki.net/img/6/6b/Portal2_japanese.txt').end(function(err, res) { 
  console.log('as-is; incorrect output', res.text); 
});
require('superagent').get('http://i1.theportalwiki.net/img/6/6b/Portal2_japanese.txt').end(function(err, res) { 
  console.log('correct conversion; incorrect output', Encoding.convert(res.text, { from: 'UTF16', to: 'UNICODE', type: 'string' })); 
});
require('superagent').get('http://i1.theportalwiki.net/img/6/6b/Portal2_japanese.txt').end(function(err, res) { 
  console.log('this shouldnt work and it doesnt', Encoding.convert(res.text, { from: 'UTF8', to: 'UNICODE', type: 'string' })); 
});
// wreck returns Buffer, which is the correct way to store data of unknown encoding i.e. everything on the web
var Encoding = require('encoding-japanese');
require('wreck').get('http://i1.theportalwiki.net/img/6/6b/Portal2_japanese.txt', function(err, res, payload) { console.log('correct conversion; correct output', Encoding.convert(payload, { from: 'UTF16', to: 'UNICODE', type: 'string' })); });
require('wreck').get('http://i1.theportalwiki.net/img/6/6b/Portal2_japanese.txt', function(err, res, payload) { console.log('equiv to what superagent does; incorrect output', payload.toString('utf8')); });

There is no reason to setEncoding on the response stream. You can just concat buffers (as image parser does) and then call toString('utf8') on the resulting buffer if that's what you really want to do.

And FWIW JSON.parse works on Buffers:

var buf = new Buffer('{"blah":"i am a buffer"}')
JSON.parse(buf);

@kornelski
Copy link
Contributor

I see that in general there could be different approaches to handling files without any encoding specified, but the example URL you've mentioned clearly specifies that it's a UTF-8 file:

curl -I http://i1.theportalwiki.net/img/6/6b/Portal2_japanese.txt
HTTP/1.1 200 OK
Date: Wed, 30 Dec 2015 14:06:38 GMT
Content-Type: text/plain; charset=utf-8

So handling it as UTF-8 is perfectly correct in that case. If that file is not really a UTF-8 plain text file, it's the fault of the server for sending wrong information.

@cmawhorter
Copy link
Author

Yes, you're correct. And I can't come up with a way to retrieve that page using superagent and get correct output. any ideas?

any chance this behavior could get a mention in the docs?

@kornelski
Copy link
Contributor

I'm going to leave this bug open to add support for encoding="" argument in MIME types.

As for working around incorrect server response for that that URL, I don't think it's possible in node version of superagent. In the browser version you could set xhr.responseType to arraybuffer and convert encoding yourself.

@kornelski kornelski added this to the 2.0 milestone Jan 7, 2016
@kornelski kornelski modified the milestone: 2.0 May 29, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants