Excellent project! I see that the agent modifies the google docs using an interesting technique: convert doc to html, AI operates over the HTML and then diff the original html with ai-modified html, send the diff as batchUpdate to gdocs.
IMO, this is a better approach than the one used by Anthropic docx editing skill.
1. Did you compare this one with other document editing agents? Did you have any other ideas on how to make AI see and make edits to documents?
2. What happens if the document is a big book? How do you manage context when loading big documents?
PS:I'm working on an AI agent for Zoho Writer(gdocs alternative) and I've landed on a similar html based approach. The difference is I ask the AI to use my minimal commands (addnode, replacenode, removenode) to operate over the HTML and convert them into ops.
re. comparing with other editing agents - actually, I didn't find any that could work with google docs. Many workflows were basically "replace the whole document" - and that was a non-starter.
re. what happens if its a big book - each "tab" in the google doc is a folder with its own document.xml. A top-level index.xml captures the table of contents across tabs. The agent reads index.xml and then decides what else to read. I am now improving this by giving it xpath expressions so it can directly pick the specific sections of interest.
Philosophically, we wanted "declarative" instead of "imperative". Our key design - the agent needs to "think" in terms of the business, and not worry about how to edit the document. We move all the reconcilliation logic in the library, and free the agent from worrying about the google doc. Same approach in other libraries as well.
How to expose my product suite's API to AI has been a roller coster ride. First it was tool calling hooks, then MCP, then later folks found out AI is better at coding so MCPs suddenly became code-mode, then people realized skills are better at context and eventually now Google has launched cli approach.
Remember this repo is not an agent. It's just a cli tool to operate over gsuite documents that happens to have an MCP command and a bunch of skills prebundled.
That's a new one. I guess the hope is agents are good at navigating cli and it also democratizes the ecosystem to be used by any agent as opposed to Microsoft (which only allows Copilot to work in its ecosystem)
It honestly doesn't make any sense. Interestingly, India was bold enough to move its government infra to Zoho's office suite cutting all reliance on Microsoft. It's only sane that other countries do the same.
Didn't have to modify. OG Pi is made to be extended & used as a building block so closer to say I wrapped it
As for skills, yes and it's a virtual filesystem
Good is really good at engineering great software and really sucks at making them enterprise ready.
It's why they've been failing with GCP, Google Tables (shutdown now I guess), Analytics or any product that aims for enterprise consumption. Note: they are really good at making consumer softwares though (take the success of Google Photos or Gsearch)
Google isn't even good at engineering great software.
They have some good people working on some good projects. If you look at the relation between software-quality of their average product and number of developers they have... yeah I don't know. Maybe hiring tons of new-grads that are good at leetcode and then forcing them to use golang... is not what actually makes high quality software.
I could believe that they are good at doing research though.
Most of the core products at Google are still written in pre-C++11.
I wish these services would be rewritten in Go!
That’s where a lot of the development time goes: trying to make incredibly small changes that cause cascading bugs and regressions a massive 2000s C++ codebase that doesn’t even use smart pointers half the time.
Also, I think the outside world has a very skewed view on Go and how development happens at Google. It’s still a rather bottom up, or at least distributed company. It’s hard to make hundreds of teams to actually do something. Most teams just ignored those top-down “write new code in Go” directives and continued using C++, Python, and Java.
I wouldn't say most. Google is known for constantly iterating on its code internally to the point of not getting anything done other than code churn. While there is use of raw pointers, I'd argue it's idiomatic to still use raw pointers in c++ for non owning references that are well scoped. Using shared pointers everywhere can be overkill. That doesn't mean the codebase is pre c++11 in style.
Rewriting a codebase in another language that has no good interop is rarely a good idea. The need to replicate multiple versions of each internal library can become incredibly taxing. Migrations need to be low risk at Google scale and if you can't do it piecewise it's often not worth attempting either. Also worth noting that java is just as prevelant if not moreso in core products.
Failing with GCP? GCP has had accelerating growth the past few years, larger than the other two, and widening profit. I've used all three major clouds and overall I would choose GCP, particularly these days for their data/AI stack
> Good is really good at engineering great software
was
While they sucked at bringing products to market and sustaining them, they indeed used to have a good reputation at software engineering. However they are burning that up in the AI pivot, though it's not yet very visible externally.
Excellent work! To put out the importance of the project - as of today there is not many google docs/word online alternative that is completely open source.
I'm yet to dig the code on how pagination is implemented but if the page breaks mimick word's - this is huge!
It's fascinating that browsers are one of the most robust and widely available sandboxing system and we are yet to make a claude-code/gemini-cli like agent that runs inside the browser.
Browsers as agent environment opens up a ton of exciting possibilities. For example, agents now have an instant way to offer UIs based on tech governed by standards(HTML/CSS) instead of platform specific UI bindings. A way to run third party code safely in wasm containers. A way to store information in disk with enough confidence that it won't explode the user's disk drive. All this basically for free.
My bet is that eventually we'll end up with a powerful agentic tool that uses the browser environment to plan and execute personal agents or to deploy business agents that doesn't access system resources any more than browsers do at the moment.
But there is! ChatGPT.com has a canvas feature, and that can be used to render HTML and javascript, including UI controls. It's pretty neat, albeit limited.
Generated via ChatGPT, this canvas shows a basic pyramid and has sliders that you can use to change the pyramid, and download the glTF to your local machine. You can also click the edit w/ ChatGPT and tweak the UI however you're able to prompt it into doing.
> It's fascinating that browsers are one of the most robust and widely available sandboxing system and we are yet to make a claude-code/gemini-cli like agent that runs inside the browser.
It's easily explained by the fact that all the javascript code is exposed in a browser and all the network connections are trivially inspectable and blockable. It's much harder to collect data and do shady things with that level of inspectability. And it's much harder to ban alternative clients for the main paid offer. Especially if AI companies want to leave the door open to pushing ads to your conversations.
I use my phone when I want to measure stuff. Not an app, just the physical phone as a ruler. Almost always the dimensions of whatever phone I've got is published on the internet. It's a quick hack and better than carrying around A4 papers ;)
This. A language that doesn't adapt (accumulate shitpile of baggage from other languages over changing times) will be a dead language eventually.
English will always have my respect for being open/inclusive and adaptive.
Interesting fact: If you are looking for a spoken language with the cleanest/composable grammar - it's Sanskrit. The panini grammar is actually like a programming language where sentences are just compositions of lower level similar units.
But like I said it's practically dead (not used as a spoken language). But interestingly used as a proxy language for translation and other nlp tasks due to it's clean grammar :)
It'd be great if it supports a wasm/web backend as well.
I bet a lot of trivial text capabilities (grammar checking, autocomplete, etc) will benefit from this rather than sending everything to a hosted model.
It's possible right now with onnx / transformers.js / tensorflow.js - but none of them are quite there yet in terms of efficiency. Given the target for microcontrollers, it'd be great to bring that efficiency to browsers as well.
You can compile to wasm, I have done so via the XNNPACK backend - you might have to tweak the compilation settings and upgrade the XNNPACK submodule/patch some code. But this only supports CPU, not a WebGPU or WebGL backend.
IMO, this is a better approach than the one used by Anthropic docx editing skill.
1. Did you compare this one with other document editing agents? Did you have any other ideas on how to make AI see and make edits to documents?
2. What happens if the document is a big book? How do you manage context when loading big documents?
PS:I'm working on an AI agent for Zoho Writer(gdocs alternative) and I've landed on a similar html based approach. The difference is I ask the AI to use my minimal commands (addnode, replacenode, removenode) to operate over the HTML and convert them into ops.
This works pretty well for me.
reply