Skip to content

Conversation

brokad
Copy link
Contributor

@brokad brokad commented Aug 14, 2021

Semantic detection PoC

This defines a framework for more advanced, statistics based, ways of importing data into synth. This paves the way for more automation in the process of writing synth schemas tailored to a specific data source.

Underpinning this is the semdet crate which aims to provide synth with the ability to do fast, zero-copy, in-memory trainable analytics for table instances provided by the user as an import data source. It is built on arrow, ndarray and tch.

The PoC is an end-to-end implementation of a dummy model that detects the most likely fake generator based on a simple dictionary lookup. The example is simple enough that we can get it done very quickly and yet involves enough moving parts to evidence the possibility of implementing more complex data driven inference mechanisms.

How to test it

cargo test --features torch in semdet/ will run the dummy E2E scenario and should be successful.

Roadmap to readiness

  • Composable API for the embedding of input data as valid module inputs
  • Composable API for handling prediction targets in our domain-specific application
  • Load a 'pre-trained' dummy module embedded at compile-time
  • Document the Encoder/Decoder/Module APIs
  • Attach to the CLI's import logic
    • Project down string columns from sqlx query results
  • Windows build needs fixing
  • Make tch optional so the built binary does not have to carry a dynamic dependency into libtorch

@brokad brokad marked this pull request as ready for review August 23, 2021 08:13
@brokad brokad force-pushed the feat/semantic-detection branch 2 times, most recently from 0e9e05e to 004c467 Compare August 25, 2021 08:52
@brokad brokad force-pushed the feat/semantic-detection branch from 004c467 to 8bfa1d8 Compare August 30, 2021 07:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant