Skip to content

Errno::EINVAL using each_statement with a LARGE, local NTriples file #440

@kspurgin

Description

@kspurgin

NOTE: This issue could be partially addressed by clarification in the documentation/examples. It could also be improved by refactoring so that a more useful/informative error message is raised in this situation. A full fix would add support for line-by-line reading of NTriples files, without reading the entire file in at once as a String

What happened1

I tried to use this Gem to parse the NTriples statements in TGNOut_1Subjects.nt file (locally renamed to 1Subjects.nt) from the TGN explicit.zip I downloaded from http://vocab.getty.edu/

This file has 26,854,584 lines so I had no intention of reading the whole file into memory and do not need the entire thing as a graph. I thought this was a good way to handle the parsing of the NTriples data so I could selectively do stuff with the statements I'm interested in, one at a time.

I read through the documentation prior to trying this, looking for any warnings about problems with large files, and did not find any information about performance/in-memory requirements or limits aside from info about caching which I categorized as irrelevant to my local-only application.

Given that NTriples is a line-based format, and the examples showing use of Reader.open and each_statement, I assumed wrongly2 that the each_statement pattern of working with an NTriples file was further evidence I could iterate through the statements one at a time.

My initial code:

RDF::Reader.open("1Subjects.nt") do |reader|
  reader.each_statement do |statement|
    binding.pry
  end
end

Running my script immediately gets:

/Users/kristina/.rbenv/versions/3.1.0/lib/ruby/gems/3.1.0/gems/rdf-3.2.11/lib/rdf/util/file.rb:332:in `read': Invalid argument @ io_fread - 1Subjects.nt (Errno::EINVAL)
	from /Users/kristina/.rbenv/versions/3.1.0/lib/ruby/gems/3.1.0/gems/rdf-3.2.11/lib/rdf/util/file.rb:332:in `block in open_file'
	from /Users/kristina/.rbenv/versions/3.1.0/lib/ruby/gems/3.1.0/gems/rdf-3.2.11/lib/rdf/util/file.rb:322:in `open'
	from /Users/kristina/.rbenv/versions/3.1.0/lib/ruby/gems/3.1.0/gems/rdf-3.2.11/lib/rdf/util/file.rb:322:in `open_file'
	from /Users/kristina/.rbenv/versions/3.1.0/lib/ruby/gems/3.1.0/gems/rdf-3.2.11/lib/rdf/reader.rb:221:in `open'
	from /Users/kristina/.rbenv/versions/3.1.0/lib/ruby/gems/3.1.0/gems/rdf-3.2.11/lib/rdf/reader.rb:212:in `open'
	from tgn.rb:29:in `<main>'

Why it happened

The file at lib/rdf/util/file.rb:332 is the Ruby File object yielded by Kernel.open, and calling .read on the file indeed tries to read the entire file into memory, to be passed as a gigantic string to RemoteDocument.new.

Proposed solutions

Full fix

The NTriples data I work with always has one statement per line of the file (which I thought was a critical feature of the format), so ideally this could be fixed by handling each_statement from an NTriples::Reader by reading the file line-by-line instead of all-at-once (or providing some option to force this -- I looked for one in the code and API docs and didn't find it).

Prevention of issue without full fix

The issue could have been prevented by being clear that the NTriples::Reader is going to (try to) read the whole file into memory as one String in the documentation examples.

If the documentation was clear about that, I wouldn't have tried this and run into this issue

Mitigation of issue by refactoring to throw a more informative error message

Errno::EINVAL means "Invalid argument. This is used to indicate various kinds of problems with passing the wrong argument to a library function." (src)

Neither my code nor the rdf Gem has passed a wrong argument, so this error is very unclear in this context. The io_fread failing because of a bad argument is somewhere in Ruby's C code and thus pretty obscure and uninformative to the average Ruby user.

Footnotes

  1. Not providing system details because the issue isn't system specific (beyond the fact that my system (like most?) falls over trying to read a 26 million line file into memory as a String, as I expected it would)

  2. But not unreasonably, given the general Ruby pattern of open to create an IO-type object, and then an each... method to iterate part-by-part through the whole thing without having to hold the entire thing in memory

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions