Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

csharp: BindStream/Bind statements fail with 'Fatal error. System.AccessViolationException: Attempted to read or write protected memory.' #2265

Open
schweers-qb opened this issue Oct 21, 2024 · 8 comments
Assignees
Labels
Type: question Usage question

Comments

@schweers-qb
Copy link

schweers-qb commented Oct 21, 2024

What would you like help with?

Hi there! I've tried and failed to correctly execute the BindStream method from the C# C wrapper API with multiple types of input streams in order to bulk ingest from an Arrow stream into DuckDb. If I use the mock stream implementation found here, my code will work. Otherwise, 'real' file and memory stream reads fail every time with the following error:

Fatal error. System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
   at Apache.Arrow.C.CArrowArrayExporter.ReleaseArray(Apache.Arrow.C.CArrowArray*)

I can read that stack trace just fine and understand the problem it describes, but I can't easily debug the failure of the native code in regard to free'ing memory from my current IDE (Rider). I realize the C# code is less documented and less mature than some of the other runtimes supported by ADBC (and DuckDB), so I suspect I might be calling the API or creating my stream without following a specific pattern that may be required? Here's an example of how I'm attempting to use the API that leads to this error. My code is calling the C API from C# running on .NET 8 on macOS:

public static async Task Main()
    {
        // Load the DuckDB driver following the pattern from the test directory
        using var driver = CAdbcDriverImporter.Load("/local/path/libduckdb-osx-universal/libduckdb.dylib",
            "duckdb_adbc_init");
        
        // Open a local persistent DB
        using var duckDb =
            driver.Open(new Dictionary<string, string> { { "path", "/local/path/test.db" } });
        using var connection = duckDb.Connect(null);
        
        // Create a new table on ingest
        using var statement = connection.BulkIngest("temp_table", BulkIngestMode.Create);
        
        // Create a file stream that reads in some sample Arrow stream data from the arrow-experiments repo
        // Note: if I mock this stream as mentioned above, I can get this code to work. In that scenario, no native memory access is involved in the stream binding
        await using var fileStream = new FileStream("/local/path/arrow-commits.arrows", FileMode.Open,
            FileAccess.ReadWrite);
        
        // Read the stream, bind it, execute the ingest
        using var reader = new ArrowStreamReader(fileStream);
        statement.BindStream(reader);
        await statement.ExecuteUpdateAsync();
    }
@schweers-qb schweers-qb added the Type: question Usage question label Oct 21, 2024
@CurtHagenlocher
Copy link
Contributor

Thanks for the problem report. This looks like a bug -- possibly in the Arrow C interop code -- and I can reproduce it.

@schweers-qb
Copy link
Author

schweers-qb commented Oct 21, 2024

@CurtHagenlocher you're welcome! It's the same problem for Bind I might add. I'll swap it to a bug I don't think I can do that, but if there's a preferred way to signal this is a bug, happy to do it!

@CurtHagenlocher
Copy link
Contributor

The exception is being thrown when DuckDB releases a pointer that gets passed to it. In my quick repro, the array in question claims to have three children but only the first of the child pointers seems to be valid.

        [Fact]
        public async Task InsertFromFile()
        {
            // Write temporary file
            Schema schema = new Schema([new Field("key", Int32Type.Default, false), new Field("value", StringType.Default, false)], null);
            RecordBatch recordBatch = new RecordBatch(schema, [
                new Int32Array.Builder().AppendRange([1, 2, 3]).Build(),
                new StringArray.Builder().AppendRange(["foo", "bar", "baz"]).Build()
                ], 3);
            string tempFile = _duckDb.CreateTempPath();
            using (var writeFile = new FileStream(tempFile, FileMode.Create, FileAccess.Write))
            {
                using var writer = new ArrowStreamWriter(writeFile, schema);
                writer.WriteRecordBatch(recordBatch);
            }

            // Create database
            using var database = _duckDb.OpenDatabase("insert_from_file.db");
            using var connection = database.Connect(null);
            using var statement = connection.BulkIngest("temp_table", BulkIngestMode.Create);

            // Read temporary file into new table
            using var readStream = new FileStream(tempFile, FileMode.Open, FileAccess.ReadWrite);
            using var reader = new ArrowStreamReader(readStream);
            statement.BindStream(reader);
            await statement.ExecuteUpdateAsync();
        }

@schweers-qb
Copy link
Author

@CurtHagenlocher appreciate the confirming code. How would you recommend to proceed here - open a bug at DuckDB against their ADBC implementation?

@CurtHagenlocher
Copy link
Contributor

I think it's premature to assume that the bug is in DuckDB. I need to debug the repro to figure out what exactly is going wrong. That's not likely to happen during my working day, as I have no shortage of commitments but I'll probably be able to do it this evening.

@schweers-qb
Copy link
Author

Ah I see, I read that to assume you meant DuckDB was the problem but I get that may not actually be the case. If you have a recommended way to debug from the C# side through the native ADBC and DuckDB code, please let me know. IE: however you figured this out:

the array in question claims to have three children but only the first of the child pointers seems to be valid.

Thanks for the support on whatever timeline you can manage!

@CurtHagenlocher CurtHagenlocher self-assigned this Oct 21, 2024
@schweers-qb schweers-qb changed the title csharp: BindStream/Bind statements fail with 'Fatal error. System.AccessViolationException: Attempted to read or write protected memory.'' csharp: BindStream/Bind statements fail with 'Fatal error. System.AccessViolationException: Attempted to read or write protected memory.' Oct 21, 2024
@CurtHagenlocher
Copy link
Contributor

CurtHagenlocher commented Oct 22, 2024

Ah, I should have been able to figure this out without debugging. There's a functionality gap described by apache/arrow#36057 which prevents this from working. Unfortunately, the failure mode is nearly impossible to figure out without a debugger. There's a draft PR at apache/arrow#40992 which should resolve the problem (and which I apparently need to prioritize).

You can work around the problem by cloning the data, which is of course a pretty high price to pay. I did that with

        class ClonedArrayStream : IArrowArrayStream
        {
            readonly IArrowArrayStream _arrowArrayStream;

            public ClonedArrayStream(IArrowArrayStream arrowArrayStream)
            {
                _arrowArrayStream = arrowArrayStream;
            }

            public Schema Schema => _arrowArrayStream.Schema;

            public async ValueTask<RecordBatch?> ReadNextRecordBatchAsync(CancellationToken cancellationToken = default)
            {
                var next = await _arrowArrayStream.ReadNextRecordBatchAsync(cancellationToken);
                return next?.Clone();
            }

            public void Dispose() => _arrowArrayStream.Dispose();
        }

and then binding the cloned stream instead of the original stream.

@schweers-qb
Copy link
Author

@CurtHagenlocher ok thank you for the explanation and the working sample. That's similar to what I had working myself if I copied the data / stream around, which like you said, isn't ideal 😄 but doable. This is fine for my needs at the moment though and hoping the other changes land as well to make this zero copy. Thank you again for the support!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: question Usage question
Projects
None yet
Development

No branches or pull requests

2 participants