I wrote my first post about parsing webpages to structured data with a LLM in January, using local models. Now its October and I did it again with current models and libraries. Boy, what a difference.
Since you seem to know your stuff, why do LLMs need so much data anyway? Humans don't. Why can't we make models aware of their own uncertainty, e.g. feeding the variance of the next token distribution back into the model, as a foundation to guide their own learning. Maybe with that kind of signal, LLMs could develop 'curiosity' and 'rigorousness' and seek out the data that best refines them themselves. Let the AI make and test its own hypotheses, using formal mathematical systems, during training.
IANAL, but it means the commit itself is public domain. When integrated into a code base with a more restrictive license, you can still use that isolated snippet in whatever way you want.
More interesting question is whether one could remove the GPL restrictions on public code by telling AI to rewrite the code from scratch, providing only the behavior of the code.
This could be accomplished by making AI generate a comprehensive test suite first, and then let it write the code of the app seeing only the test suite.
Hmm, so basically automated clean room reimplementation, using coding agents? Our concepts of authorship, copying, and equivalence are getting a real workout these days!
you'd need a pretty good opsec and non-search capable agent and logs of all its actions/chain of thought/process to be able to truly claim cleanroom implementation tho
The logs and traceability are the secret sauce here. It's one thing to have an artifact that mysteriously replicates the functionality of a well known IP-protected product without just straight up copying it. It's another thing to be able to demonstrate that said artifact was generated solely from information in the public domain or otherwise legally valid to use.
if its of your interest, i was investigating this and found out all the big labs like openai offer and indemnity clause for enterprise customers, that is supposed to assure you that it doesn't output non-compliant license code (like copyrighted or AGPL or whatever), BUT you have to accept them keeping all your logs, give them access, and let them and their lawyers do build their own case in case of getting sued.
I guess they're mostly selling insurance to bigCo's, and saying, hey we have the money to go to law, and the interests to win such a case, so we'll handle it
This is not everyone, but if you've been writing scientific papers with LaTeX, you may have come across this issue.
You go to an online database (Inspire or ADS) to fetch some references for your paper. Then you have to copy/paste the entry twice, the key to your LaTeX document and the BibTeX entry to your .bib file. Doing redundant things is annoying, right? autobib removes the need to do the latter. You still have to look up the key online and cite it in your LaTeX document, but autobib downloads the entry automatically to your .bib file.
Apart from the visible changes to the user interface, I completely swapped out the foundation. iminuit consists - at its core - of Python bindings to the Minuit2 C++ library.
We used to generate those bindings with Cython, but Cython is very bad at generating bindings for C++. It does not support all modern features and imposes restrictions on what you can wrap. It is also an external code generator that you have to install.
Cython was a real problem, so we switched to the excellent pybind11 library. It is C++ header-only library. Generating Python bindings with that is a breeze and it supports all possible C++ constructs. We lost at lot of weight and awkward complexity by switching out the foundation.