Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug : Clinical submitters facing issue with UTF-8-BOM Input files #66

Open
bhavikbhagat93 opened this issue Mar 13, 2025 · 1 comment
Open
Assignees
Labels
Helpdesk Helpdesk requests handled by dev / bioinfo team

Comments

@bhavikbhagat93
Copy link

bhavikbhagat93 commented Mar 13, 2025

Description

As Data submitter for clinical submission,
I want to submit the clinical data files with UTF-8-BOM encoding
but i am facing error while upload.

Troubleshooting errors

Acceptance criteria

As a user, i should be allowed to upload file with no encoding restrictions on virusseq

@bhavikbhagat93 bhavikbhagat93 added the bug Something isn't working label Mar 13, 2025
@bhavikbhagat93 bhavikbhagat93 added Helpdesk Helpdesk requests handled by dev / bioinfo team and removed bug Something isn't working labels Apr 3, 2025
@edsu7
Copy link

edsu7 commented Apr 11, 2025

Investigation

Encoding

The following were observations based on testing:

ls *.tsv | xargs -I {} sh -c "echo {};hexdump -C {} | head -n3"
csv_utf8_DH_VirusSeq_Portal.tsv
00000000  73 74 75 64 79 5f 69 64  09 73 70 65 63 69 6d 65  |study_id.specime|
00000010  6e 20 63 6f 6c 6c 65 63  74 6f 72 20 73 61 6d 70  |n collector samp|
00000020  6c 65 20 49 44 09 47 49  53 41 49 44 20 61 63 63  |le ID.GISAID acc|
csv_utf8-BOM_DH_VirusSeq_Portal.tsv
00000000  ef bb bf 73 74 75 64 79  5f 69 64 09 73 70 65 63  |...study_id.spec|
00000010  69 6d 65 6e 20 63 6f 6c  6c 65 63 74 6f 72 20 73  |imen collector s|
00000020  61 6d 70 6c 65 20 49 44  09 47 49 53 41 49 44 20  |ample ID.GISAID |
deBOMed_csv_utf8-BOM_DH_VirusSeq_Portal.tsv
00000000  73 74 75 64 79 5f 69 64  09 73 70 65 63 69 6d 65  |study_id.specime|
00000010  6e 20 63 6f 6c 6c 65 63  74 6f 72 20 73 61 6d 70  |n collector samp|
00000020  6c 65 20 49 44 09 47 49  53 41 49 44 20 61 63 63  |le ID.GISAID acc|

Note csv_utf8-BOM_DH_VirusSeq_Portal.tsv has ...study or ef bb bf compared to the other two.

Behaviour on clinical portal

BOM prevents file from being uploaded however if encoding if fixed, file succeeds.

Image

Image

Image

Solution

Suggestion

  • Adding encoding step in submission that converts submitted TSV file (regardless of encoding) into UTF-8.

Python script used

import os
import argparse

def convert_encoding(bom_file,new_file):
    with open(bom_file, 'r', encoding='utf-8-sig') as infile:
        content = infile.read()

    # Write the content back with utf-8 encoding (no BOM)
    with open(new_file, 'w', encoding='utf-8') as outfile:
        outfile.write(content)

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='Convert UTF-8-SIG encoded file to UTF-8.')
    parser.add_argument('-i','--input_file', help='Path to the input file (UTF-8 with BOM)',required=True)
    parser.add_argument('-o','--output_file',default=False,help='Path to save the output file (UTF-8 without BOM)')

    args = parser.parse_args()

    bom_file=os.path.abspath(args.input_file)
    if args.output_file:
        deBOMed_file=os.path.abspath(args.output_file)
    else:
        deBOMed_file="%s/deBOMed_%s" % (os.path.dirname(bom_file),os.path.basename(bom_file))

    convert_encoding(bom_file, deBOMed_file)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Helpdesk Helpdesk requests handled by dev / bioinfo team
Projects
None yet
Development

No branches or pull requests

4 participants