Skip to content

build_index.py results in UnicodeDecodeError if terminal encoding is not set to UTF-8 #30

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
bmerkle opened this issue Aug 31, 2024 · 2 comments

Comments

@bmerkle
Copy link

bmerkle commented Aug 31, 2024

Please provide us with the following information:

This issue is for a: (mark with an x)

- [x] bug report -> please search issues before submitting
- [ ] feature request
- [x] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

run python build_index.py on a windows machine with german settings (this has cp1252 as default)
the program fails with a UnicodeDecodeError as can be seen in the logs below

The problem can be easily fixed if you set the codepage to utf-8 in the terminal/shell/powershell,
e.g. in powershell:
[Console]::OutputEncoding = [System.Text.Encoding]::UTF8
[Console]::InputEncoding = [System.Text.Encoding]::UTF8

We should add this information to the docs.
I can create a PR for this if you consider the information usefull (I do :-) )

Any log messages given by the failure

Failed Build
(.venv) PS C:\work\Azure-Samples\rag-data-openai-python-promptflow\tutorial> python build_index.py
Data directory 'C:\work\Azure-Samples\rag-data-openai-python-promptflow\tutorial\data/product-info/' exists and contains 20 files.
Crack and chunk files from local path: C:\work\Azure-Samples\rag-data-openai-python-promptflow\tutorial\data/product-info/
Start embedding using connection with id = ...
Start creating index from embeddings.
Successfully created index at C:\work\Azure-Samples\rag-data-openai-python-promptflow\tutorial\tutorial-index-mlindex
Method indexes: This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class Index: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Exception in thread Thread-19 (_readerthread):
Traceback (most recent call last):
File "c:\Program Files\Python311\Lib\threading.py", line 1045, in _bootstrap_inner
self.run()
File "c:\Program Files\Python311\Lib\threading.py", line 982, in run
self._target(*self._args, **self._kwargs)
File "c:\Program Files\Python311\Lib\subprocess.py", line 1599, in _readerthread
buffer.append(fh.read())
^^^^^^^^^
File "c:\Program Files\Python311\Lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 271: character maps to
Uploading tutorial-index-mlindex (0.0 MBs): 100%|#####################################################################################################################################| 1296/1296 [00:00<00:00, 1996.11it/s]

Fix. e.g. for powershell
[Console]::OutputEncoding = [System.Text.Encoding]::UT
[Console]::InputEncoding = [System.Text.Encoding]::UTF8

Expected/desired behavior

with the fix above it runs fine e.g.
(.venv) PS C:\work\Azure-Samples\rag-data-openai-python-promptflow\tutorial> python build_index.py
Data directory 'C:\work\Azure-Samples\rag-data-openai-python-promptflow\tutorial\data/product-info/' exists and contains 20 files.
Crack and chunk files from local path: C:\work\Azure-Samples\rag-data-openai-python-promptflow\tutorial\data/product-info/
Start embedding using connection with id = ...
Start creating index from embeddings.
Successfully created index at C:\work\Azure-Samples\rag-data-openai-python-promptflow\tutorial\tutorial-index-mlindex
Method indexes: This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class Index: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.

OS and Version?

Windows 7, 8 or 10. Linux (which distribution). macOS (Yosemite? El Capitan? Sierra?)
not OS specific

Versions

not version specific

Mention any other details that might be useful


Thanks! We'll be in touch soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
@bmerkle and others