Seems like this is based on Google's own internal code search tooling, something most engineers at Google rely on for every day code-level work.
I personally can't even begin to imagine how I'd navigate the gigantic codebase without it.
It’s also used for https://source.chromium.org. I now host my monorepo on Cloud Source Repositories because it has a super nice integration with the rest of their products.
what is the constant phone-home activity on that opaque container they send as SourceGraph.. It is occassionally the case that devs have too-fast machines, so their code isn't seen on ordinary equipment. With SourceGraph and other inner-network-devs tools, the amount of chatty traffic and build dependancies seems seriously off-putting, trending to useless with ordinary network.
This seems like a bad attitude. Perhaps you could constructively ask for a sourcegraph-lite that does less, in return for less deps / networking complexity?
I am a dev at Sourcegraph, I'd be very open to any feedback.
You can firewall off Sourcegraph 100% for complete confidence, and aside from the first admin's email address (so we can notify them of any security updates) we only send back aggregated anonymous usage statistics which we are extremely transparent about: https://docs.sourcegraph.com/admin/pings
You are 100% correct, I really messed up here by suggesting that option. I misread our own docs. It would only disable event counts from being sent (e.g. instead of "how many jump-to-definitions were performed in a day?" we would just send a boolean "did one or more jump-to-definition occur in a day?" based on my reading of the code[1]) -- not what I thought it did. Will send a PR to clarify the docs on this so I don't mess up like this again..
I'm human and screw up, frequently; this instance just happened to be on the ridiculously important topic of privacy -- hopefully you will forgive me for that, I wasn't trying to be malicious but certainly in retrospect I can see this being interpreted as such.. :/
The right option to turn it all off is just this one, since we only send ping data as part of the version update check you disable that and it's all off. And you can confirm this in the code as I just did here[2][3]: https://docs.sourcegraph.com/admin/config/site_config#update... And as I mentioned previously you can always firewall off Sourcegraph 100%.
As an aside, I can promise you that I wouldn't have continued to work at Sourcegraph for the last 5 years if I thought our business was selling or collecting identifiable user data in ANY form. We only collect just enough information to help prioritize what features we improve and (aside from the first admin's email as I noted already above) it is all 100% anonymous and aggregated numbers that we are extremely transparent about[4]. Our person running analytics is also constantly trying to make this more transparent[5] because we all are very security and privacy aware and know the #1 way to convince people to not run software is to make them think you are spying on them or using their data in ways they would not want.
It's obvious to me this should be more clear in our docs, I'm going to forward all of this conversation onto the rest of our team to make sure we improve our docs here.
If you already know how to index, this is a completely open source alternative, likely with less bells and whistles.
I worked at Google and miss Code search. But I have lots of ideas as well how one can go beyond the status quo for code reading and debugging. Join if interested.
Googler here. We have the same Code Search tool internally, this is honestly one of my favorite things about working at Google. Great to see this open sourced.
This is missing tons of functionality and layers that the internal one has tho, like all of the automatic code analysis and linting, coverage and fuzzing integration, etc
They seem to have reduced the information density and killed readability as well with the "material ui" redesign. The old code search UI was perfect, with enough contrast to allow you to quickly grep through xrefs to locate relevant entries. The page itself was lightweight as well.
Compare that to the redesign where each xref jump has me staring at a spinner for half a second, the xref bar has no visual separation between type, filename, and code snippet, all buttons visually indistinguishable with a blue on white color scheme, the "layers" dropdown is replaced with a mishmash of buttons scattered across the layout, etc.
I really hope they're not forcing this abhorrent redesign on their developers as well.
Really loved this interface (also cs.chromium.org) while I worked at Google. It was easy for me to orient myself, find what uses this and that, where it's being used, and then it had whole "debugging" facility:
You select your binary on borg (think kubernetes/docker), and it'll fetch from the binary with which CL (think like perforce "CL") it was built, and/or additional cherrypicked CL's, then it'll somehow go back in-time and represent how the source code looked then.
later one can (I tried it in Java, but I believe it's available for other languages too), you can inject statements right around the begining of function (a way of breakpoint), and that statement can be something like - let's log how this function was called - you were able to reference nearby statements. This could be set from the command-line, and took a bit mastery (and was bit afraid first time using it, or more like had chilling effect on me), but then my task (with 10 or 11 instances) reported these log lines, and I was able to see them in the browser.
(I have no experience with GCP, or the public face of Google Cloud, so I don't know what's available there), but this was freakin cool.
That "codesearch" is only superficially related to this one. The main feature of _this_ codesearch that makes it so useful is the cross references to callers, callees, and overrides. Ye olde codesearch has more in common with things like livegrep.
> The main feature of _this_ codesearch that makes it so useful is the cross references to callers, callees, and overrides. Ye olde codesearch has more in common with things like livegrep.
However this part of internal codesearch is the one part that is actually (partially) open sourced: kythe.io
Working with chromium/v8, I can honestly say google's code search infra is one of the most valuable resources available. I really hope they open source the backend at some point.
The backend is open sourced, it is Kythe.io. It supports go, c++, java out of the box, for some definition of out of the box. Maybe even typescript. Also cross-references protobufs work generated code of you make the stars align ;)
As for UI, treetide/underhood I mention elsewhere is the only open option now.
But Kythe comes with command line utils and an API you can query directly as well.
What is missing from the open source is a production-ready parallel serving table builder. There is one in golang which uses Apache Beam, but last time I checked the go workers are not well supported on the Flink runner. It didn't even work properly on the GCP runner. Hope this would change.
Question for Googlers or others: What do you think is the most well-written piece of software produced by Google? I would like to study how the world's best engineers write code. (Preferably C++, as it's the language i'm most familiar with)
By necessity, Abseil is full of dark template magic that would very rarely be used elsewhere in the codebase. That's the point - it encapsulates a lot of useful abstractions and allows them to be used without the client code author thinking about the guts of the abstraction. But it makes it pretty unusual relative to typical Google C++.
True for much of it, but if you look at something like cord.h, it's almost free of template programs. Google C++ application code isn't all that spiffy, to be honest. I would say most of the code is dedicated to stuff that nobody outside of Google is going to care about. I think the base libraries are more interesting.
Note that you'd only be seeing the final result, not the whole process by studying source code. Also, I'd say definition of good code varies by domain.
Nice. I'm grateful for this being posted on HN, because discoverability of that page seems to be zero (I couldn't find any link to it from opensource.google. It doesn't even have page title so googling it would be more complicated too.
I don't know the state of it, but Kythe is open source: https://kythe.io/
But in reality you probably want something more like SourceGraph which packages everything up nicely so that you don't need to worry about it, or something more specialized.
cs.opensource.google runs on Google's search infrastructure, so it's unlikely to be open sourced. https://github.com/google/zoekt is open source, but lacks cross-referencing, and has a more spartan UI.
If you want to do code search on your private git, mercurial, svn, cvs and other repositories try a fully open source opengrok (https://github.com/oracle/opengrok).
It’s easy to self install and use, with good documentation with added bonus very fast.
If you think about Google's codebase size, an IDE wouldn't cut it. You could load and analyze dependencies/imports as you go, but that would make for a terrible user experience (think about IntelliJ indexing task every time you want to check the definition of something).
Also, Code Search has baked in a lot of goodies. History layer, cross-references, call sites, ... and it's snappy. Moreover, is really well integrated with all the other internal tools used for coverage, code analysis, issue tracking, web text editor, ... .
I think an IDE (like IntelliJ IDEA) can't reach that level of integration with several other systems unless you fully buy into the ecosystem a company like JetBrain proposes you (their issue tracker, their code review tool, ...).
So, summarizing, it's a tool made by Googlers for Googlers' needs and it's amazing using it every day for all the above reasons.
You can search for non explicit dependencies. e.g. if you're removing a command line flag in a C++ binary. You can search for all uses of that flag for all users of your binary to make sure it is safe to remove.
Nice to see GN there. I wish more people knew about it.
For me is as powerful as Bazel, but without the need for a JVM and all the insanity that comes with it in a desktop/dev environment.
The syntax is great, powerful (insane customization) and together with Ninja theres nothing like it.
Its in C++ and even being as powerful as Bazel, its a light, standalone library that can handle a huge amount of source code, dependencies, tools and configurations.
Having tried to battle GN configs... I don't agree.
I was working on a big source tree and got frustrated that it kept rebuilding files that hasn't changed just because I switched git branches to look at one file, and then suddenly "Yay, another 18 hour full rebuild!".
I tried to fix it and found there is no option to ignore file timestamps, and some guy has tried to patch it to do that[1]... But the patch requires putting an option in GN files which seems to break them wherever I put it... I tried to patch GN, but it wouldn't ever seem to pass that option through... Ended up patching Ninja to always have the option on, but then random other operations broke (like simple file copies).
A day wasted, and problem not solved. Maybe my use case isn't common, or a bad workman blames his tools, but for me at least it wasn't a nice experience.
sidebar question - Anyone know how they've made the interaction/animation on this page [1] ? Feel like it is a great way to show lot of info in a concise way.
Agreed, it is a very nice little interaction! It seems like they're animating the bubbles around a circle while randomly fluctuating the speed and radius at which they rotate. Clicking on a bubble centers it by setting the rotation radius to `0` and expanding the size.
Would be interested to know how they expand the bubbles as your cursor moves closer.
First impression is that it enables discoverability of code across the open sourced Google projects, but trying to find this page even on Google search is not a thing yet. Is that intentional?
so far it doesn't seem to index a lot of stuff. I searched from some terms out of my kubernetes/openshift dependencies and it didn't find them. Is this correct?
Sourcegraph CEO here. This is the same underlying code search offered for a while by Google Cloud Source Repositories for private code, and it’s cool to see this usable for Google’s own open-source code, too.
Lots of Xooglers and current Googlers use Sourcegraph, too. Just mentioning Sourcegraph because I’ve seen several other folks mention us in the comments (thanks!).
Yes! Publish a graphql schema that is parseable by apollo graphql please! I tried to use your API on my companies internal source graph setup and had to hand roll my API calls because of these errors.
Seems like this is in the works already, but the boolean operators in OpenGrok are so intuitive and powerful. I use them every single day and the lack of support in Sourcegraph immediately disqualified it for us. For example yesterday I was looking for Dropwizard Managed classes not annotated with @Singleton so I did:
Managed && !"@Singleton"
(I'm omitting the fully-qualified class name for brevity)
If I also wanted to look for HealthCheck classes I could update the query to:
(Managed || HealthCheck) && !"@Singleton"
I think it also helps that OpenGrok has a separate input for filtering file paths (completely splitting the "where" and the "what" parts of the query). And this file path search supports the same boolean operators. So if I want to narrow my search to two particular repositories I could put CrmSearch || AutomationPlatform into the File Path input. And because this input only handles file paths, I don't need to remember any special syntax. Whereas if you clump the entire query into a single input, then users need a way to tell you whether a search term applies to file paths or file contents.
Engineer at Sourcegraph here. Adding boolean operators is a priority on our roadmap, and expected to go live between May and July this year. On separate inputs: definitely something we've also identified and are actively working on. One recent experimental addition is "Interactive mode" that lets you enter patterns separately for repos, files, and patterns, and so on. There's a dropdown next to the query bar to try it out--there are some kinks, and we're currently working on making this a polished feature. Thanks for the feedback, and stay tuned!
Yes, the extension supports C, C++, and C#! Sourcegraph supports over 30 languages out of the box using our basic code intelligence (search based heuristics and ctags).
https://github.com/sourcegraph/sourcegraph/blob/master/LICEN... says: "LICENSE.apache (Apache License) applies to all files in this repository, except for those in the enterprise/ and web/src/enterprise/ directories, which are covered by LICENSE.enterprise."
Thus, Apache 2.0 and some custom license requiring you to accept the terms, have a correct number of seats, and does not allow you to "copy, merge, publish, distribute, sublicense, and/or sell the Software."
I am not sure what all of this means, though. Better check out the licenses yourself :)
It's open core (Apache 2 + some non-OSS parts for enterprise features). All of the code is public and we develop in the open at https://github.com/sourcegraph/sourcegraph.
Sourcegraph only support git repository so it's not very useful for enterprise with mercurial, svn or other distributed version control systems.
There is another open source application for code search opengrok [1] (it's completely open source unlike sourcegraph and supports multiple version controls beside git).
Take a look. It's easy to install and operate on bare metal, cloud and containers, instead of convoluted sourcegraph way of kubernetes or docker.
Sourcegraph is open core like how GitLab and VS Code are open core. You can run "Sourcegraph OSS" and get limited features, or you can run Sourcegraph (see https://docs.sourcegraph.com/#quickstart) and get all the features, but you need a license key when you hit the user limit.
I really hate that some of the elements on the page are translated into a different language, seemingly based on my IP. When did it become acceptable to ignore my browser or my system language settings? The same thing happens on other Google services (like Google Groups), but I noticed this trend on other websites too.
This annoys me to no end. I live in Hong Kong. I speak English. We have 2 official languages here, one of which is English. I travel frequently to Japan as well, with infrequent trips to either Europe or North America.
My 'preferences' and settings are a total disaster. I end up having to go onto the gray market to buy gift cards and prepaid credit cards as I seemingly never can buy stuff online when I want to, as I'm either in the wrong place, or in the wrong language. But I know I'm still me.
What is with this '100% of people in this location read/speak the same?'
What if I want to learn Russian, but I'm in China? Why cant I just tell my computer to show me Russian, and the browser tells the site give me Russian if you have it?
Why is this so hard?
I really dislike things that try to make it easy for me, as all they do is prevent me from being able to function.
The actual mechanism your User Agent tells the server which languages you are interested in is even more robust [0] ! It is a weighted list of preferences.
The reasons I've heard from web developers on why they don't use this is because they believe that the user probably never set that up right, and that multiple people could be using the web browser so they need to be able to do the right thing.
What I typically do is select the best matching language from the Accept-Language HTTP header, and then override it with a session-specific value IF one is supplied. Example:
You can see PART of the problem here from the web developers perspective. This isn't a negotiation so you have no way to know which languages the server supports. If your preferences aren't totally inclusive you'll get something "wrong". This can be solved by exposing that information (as not done above) and allowing the user to override it (as above).
The problem is that every big website operator wants to make it work correctly for you, and they (1) have different definitions for success and (2) assume you're incompetent.
In the first point, there's someone with a requirements document that assumes every country has one official language and everyone in that country speaks that language, and so feels successful and internationalization-ready when a geo-IP served page is automatically switched to the "correct" language (much like "Falsehoods programmers believe about names").
Second, configuring a computer's locale to set a browser's request headers correctly is beyond the technical expertise of many users. It would be better if things were consistent, but at the point where some locales were set incorrectly and some were uniquely set intentionally your analytics would have showed that you improved the situation on average by trying to guess the locale (screwing over users who knew how to use their computer) than by respecting it and eventually getting everyone to understand how to set their desired language.
if you install a hungarian firefox, the accept language header will reflect this (or it did when i tried it last time). Non-expert users also often choose software in their language mutation. I dont have numbers but i wouldnt be surprised if a lot of browsers were sending correct accept language headers.
i dont know IE but it was in a very good position to guess the language of the user as well.
Agreed, but then they should give the users a way of overriding that. Google in particular knows what users want because they force you to answer it when you create your account. I've set every single location setting to UK, and yet when I travel abroad they insist of ignoring it. No excuse there.
but for some unbeknownst reason, some components of the page are in Russian. I'm not in Russia; nothing in my browser request indicates I'd like to read Russian.
I live in the US and have a local US IP. A year or so ago, I made a site for a side project using vanilla HTML. No frameworks, no JS. Every word on the page was in English and could be found in an English dictionary.
When I first stood the site up and tested it, Chrome would always break in as soon as it loaded, with a popup to translate the site into _English_ from _Romanian_!
I was able to suppress this only by turning on every single language hint in META.
I live in Switzerland and it is really annoying, as my IP switches between the German and the French part every few weeks.
I also wonder, what is happening in places, where people traditionally where always from different language groups. I don't know if then there is always a single common language.
No it's definitely a mix, like my exemple in parent comment describe. If I change my accept header to remove EN priority I get the french translation that matches my french IP.
For me, on the main page it's mostly tooltips (More Elements in the navbar, Help in the searchbar) and the blue Show Project link. It's worse on the project pages where the description is in English, but the entire table (including dates) is localized.
Hasn't that been acceptable since the dawn of the web?
Lots of people log errors to some sort of monitoring system. I can't remember seeing any localisation/translation API that would log an error rather than just silently serve English. I infer from this that just serving English is universally accepted and considering it an error is so rare that I've yet to see an API that caters to it.
I see someone downvoted and today'a s frustrating day, so let me ask for more.
English is basically the world's default language (like it or not). Sites that translate partially, but sometimes show English text instead of the language specified by the browser expect the user to understand the world's default.
A language inferred from geoip is the user's area's default language. Sites that show that language instead of that specified by the browser expect the user to understand the area's default.
These two behaviours seem really quite similar to me. Their technical backgrounds differ, but the resulting behaviour is much the same. One has been widely accepted since the dawn of the web, AFAICT, which leads me to believe that the other has been just as acceptable for just as long. And thus, my answer to "when did it become acceptable" is "it always was, you just didn't notice".
It's part of the "localization" push by governments, news/media, some consumers and tech companies themselves. I guess it's okay for passive consumers, but for tech or advanced/active consumers, it's annoying. I'm pretty sure most major sites/apps/etc all localize. So your google search results, youtube frontpage, etc will be different based on your location.
Localization is not a problem; not giving users control over their locale is. If you travel to a foreign country and suddenly can't read anything on the websites you regularly visit, that's pretty bad. If multiple languages are spoken in your country and you're forced to use one you don't speak, that's bad as well. Websites should never assume they know better which locale their users want than the users themselves.
I didn't say it was a problem for most users ( aka passive users ). I said it's a problem because they make it difficult/impossible for tech/active/advanced users to switch it.
> Websites should never assume they know better which locale their users want than the users themselves.
Yes. You just restated my comment. Not allowing tech/active/advanced users the option is the problem. I love comments that appear to debunked what instead you wrote but just write it in a different way and pretend it is new.
Maybe you think that what I wrote was already implicit in your comment, but I'm still not seeing it, and since you got downvoted, evidently a few others felt the same. Next time you'd probably better write it out explicitly.
I am a grad student right now with 2 years of industry experience. Google still prefers people who are extremely good at Data Structures and Algorithms. I like doing them, but not so much to just grind them for the sake of getting into Google. I like to learn how to design big systems and grinding Data Structures and Algorithms seems like a waste of time.
I put in "only" 40 hours of refreshing on data structures and algorithms, and doing some practice coding problems, in the weeks leading up to my interview. And I got the job.
Frankly, it's been the best hourly return on investment of anything I've done in my life up to this point, by far. Assuming I wouldn't have gotten the job otherwise (which seems reasonable), each of those hours spent studying has proven to be worth several tens of thousands of dollars. I'm not exaggerating; I just did the math.
Maybe the interviewing process is broken or sub-optimal or whatever, but it is what it is, and if you can get through it by doing some additional studying, then it's absolutely worth it. Google is a good place to work on designing big systems, so if that's your interest, consider just putting in the work.
This is a solid advice. Thanks! I will try to dedicate a portion of my day to brusing up Data Structures and Algorithms and maybe, eventually, I will get good enough to crack the interview.
Yeah, but they need people good at Data Structures and Algorithms. I think I am above average but nowhere near the quality of people that they hire. Also, I am more interested in designing big systems end-to-end and feel that doing a lot of Data Structures and Algorithms is a waste of time.
That's a bit of an oversimplification. Google's failures often garner more marketplace traction than other companies' wildest success stories.
But at the end of the day they are an advertising company. Any product that doesn't help them sell advertising -- and lots of it -- will eventually impede the progress of the careers of the managers and employees who work on it. That's when the axe falls... not when a product "fails," necessarily, but when it's no longer "sexy."
Waymo is (a) no longer affiliated with Google, and (b) basically a hobby project. It is comparable to Blue Origin for Bezos, any of Musk's numerous speculative ventures in fields from tunnel-digging to AI, or the original AppleTV for Steve Jobs.
People who work there are very well aware of that, and are OK with it. No one outside the company should allow their own business or career path to depend on Waymo, at this stage. It could vanish tomorrow at the whims of the Alphabet execs and/or directors, because their own business doesn't depend on it.
Lol Google was one company till 2015. Alphabet and Google have the same CEO. Just shows your biases in trying to prove your point. It was Google who pumped in money in Waymo. Have fun using ddg and Firefox.
That is not true at all. You are regurgitating half baked theories of HNers who are just salty at Google (some balk at their success , some didn't get in etc .). Google's going to keep changing the world. HN is going to keep complaining.
(I work at Google)