Friday, March 14, 2008

Analyzing Google Error in java

Analyzing The following is an error that was witnessed by a Google user on July 3rd 2006. Below I give a possible explanations for what the error is saying.

****** BEGIN ERROR ******

pacemaker-alarm-delay-in-ms-overall-sum 2341989 pacemaker-alarm-delay-in-ms-total-count 7776761 cpu-utilization 1.28 cpu-speed 2800000000 timedout-queries_total 14227 num-docinfo_total 10680907 avg-latency-ms_total 3545152552 num-docinfo_total 10680907 num-docinfo-disk_total 2200918 queries_total 1229799558 e_supplemental=150000

–pagerank_cutoff_decrease_per_round=100 –pagerank_cutoff_increase_per_round=500 –parents=12,13,14,15,16,17,18,19,20,21,22,23 –pass_country_to_leaves –phil_max_doc_activation=0.5 –port_base=32311 –production –rewrite_noncompositional_compounds –rpc_resolve_unreachable_servers –scale_prvec4_to_prvec –sections_to_retrieve=body+url+compactanchors –servlets=ascorer –supplemental_tier_section=body+url+compactanchors –threaded_logging –nouse_compressed_urls –use_domain_match –nouse_experimental_indyrank –use_experimental_spamscore –use_gwd –use_query_classifier –use_spamscore –using_borg

****** END ERROR ******

Please note, I have reformatted this error to make it easier for you to read and understand. Okay now the broad scopesummary of this error is as follows.

The first “paragraph” is data the instance (the google program that is throwing this error) is revealing about itself. Basically it’s statistical data about how that instance has been running. Bare in mind I do not think this is aggregated data for the whole server, but rather just this one program instance. While there is definitely some interesting content here for the uber geek, what will probably interest you most is in the second paragraph.

The second “paragraph” reveals how this particular instance is configured. In other words, it’s the settings Google is using for this particular instance. Please keep in mind this is just one of millions of instances. And that NOT ALL will be configured this way. You will notice that there is a - in front of each setting, this is standard convention for declaring configuration settings when starting a program in Linux and Unix. The - does not mean minus, it’s just there to help the program differentiate between all the settings that are being set in it at the same time.

So let me break this down and give you my THEORETICAL explanation for each setting. Please understand this is only a guess. I do not work for Google. This is simply one programmer analyzing an error message from a Google query. Also keep in mind that I did not make the query so I’m missing some of the context to which the error is related to.

Finally, before I begin, I want to define a term I will be using often throughout this post and that’s “instance”. Many of you should know what Apache is. It’s the web server software that powers a large portion of the web pages on the Internet. When you browse to a web site, your browser talks to Apache which is running on the server you are talking to and it’s sending the web page back to your browser. What many people don’t know is that in most server configurations, Apache is not running as a single program. What happens is, when Apache is started up, a master copy of Apache starts running. This master copy then creates duplicates of itself. So in essence Apache could be running as many as 500 copies of itself. Each copy capable of handling one of many simultaneous requests for web pages. This why one web server can efficiently serve up several hundred web pages in the space of second. It’s duplicating itself to share the load. There are other reasons it does this, but this is the most simple way to explain it. Anyway, each copy of Apache that is running on the server is know as an Apache process or an Apache “instance”. So you would say most web servers on the Internet are running many instances of Apache at the same time.

If you extrapolate this idea, you will understand that each Google server that is returning search results is running a program that Google has written that is more than likely running many copies of itself at the same time on the same server.

–pagerank_cutoff_decrease_per_round=100 –pagerank_cutoff_increase_per_round=500 A whole myriad of ideas pop in my head for describing these two settings. But here’s my guess. When you do a search you land on page one of the results. You can then click the next link or the page 2 link to move to the next set of results. Well, Google may be setting a minimum and maximum threshold that its evaluating against page rank to determine which pages are allowed to show on the first page as opposed to the second page. These values may be telling the instance how much to adjust those thresholds for each successive series of listings.

–parents=12,13,14,15,16,17,18,19,20,21,22,23 This setting tells this particular instance who it’s parents are. Parents could be defined as master instances to which this instance is a slave to (probable). It could also be referring to master servers (physical boxes rather than programs) (I don’t think that’s the case though). It could also be referring to some other hierarchical design Google is using to segregate it’s instances.

–pass_country_to_leaves This seems to be telling the instance to pass the country that query was placed in as data along with the other data it’s sending up or down it’s node tree. This would make sense since considering Google has to filter results for places like China. It also allows them to do contextual searches based on locale.

–phil_max_doc_activation=0.5 I have no flippin idea what this is.

–port_base=32311 I’m betting this is telling this instance what TCP port to be listening for requests on. It could also be indicating which version of the software is currently running, but I’m leaning toward the former.

–production This more than likely indicates that this instance is running in a live (also known as production) environment. Rather than running as a test. Which makes sense when you consider this error was retrieved from a live user on Google’s site.

–rewrite_noncompositional_compounds This refers to how the instance should or should not modify the syntax of the search that was placed. For example articles like a, and, the can be superfluous in many searches. But I think this particular setting possibly tells this instance how to deal with compound phrases.

–rpc_resolve_unreachable_servers RPC is short for remote procedure call. This is an acronym that describes how one program contacts another program and exchanges information. It seems to be telling the instance to do additional checking to resolve servers it can’t initially find.

–scale_prvec4_to_prvec prvec4 could be translated to mean Page Range Vector 4. If so, it’s telling this instance to convert Page Rank Vector 4 algorithms down to some base line Page Rank Vector.

–sections_to_retrieve=body+url+compactanchors This is simply telling this instance what pieces of information it is responsible for finding/handling. The body, url and compact anchors for the listings it will show. Not exactly sure what it means by compact anchors.

–servlets=ascorer Servlet is generally a reference to a individual instance (program) when in an environment of many other instances, and is many times an indicator that Java is being used. See, because computers can do computations so fast, it doesn’t make sense to run one single program at a time. You maximize the servers effectiveness by running LOTS of program (which in some cases like Apache can actually be exact copies of each other) at the same time.

The ascorer reference could mean that this instance is being told to be a scorer or one who does scoring of listings for a request. This would fit right in with all the other configurations in here that reference Page Rank and other scoring mechanisms.

–supplemental_tier_section=body+url+compactanchors This is similar to sections_to_retreive. Except its probably telling this instance which information should be show in the supplemental section of the resulting listings.

–threaded_logging This probably means that this instance should be logging it’s activity, maybe the search term that was used, maybe information about the user, who knows. The threaded word indicates that it’s logging in an environment where many other instances are logging to the same place. So it needs to be careful not to interfere with other logging activity.

–nouse_compressed_urls This is telling the instance not to use compressed urls. What do they mean by compressed? Well it could mean a WHOLE LOT of things I don’t have time to go into right now. But I’m guessing it means not to change the url in any way when displaying it to user.

–use_domain_match This may be telling this instance that it should be looking for related listings within a domain for each listing returned. You know when you sometimes see one listings and then there is another listing indented right below it from the same site?

–nouse_experimental_indyrank This instance is being told NOT to use some experimental ranking mechanism they have dubbed “indyrank”.

–use_experimental_spamscore This instance is being to that it SHOULD use some experimental ranking mechanism they have dubbed “spamscore”. The name alone would imply they are trying out solutions to combat search engine spam. Very interesting.

–use_gwd I think this is telling the instance that it should look in the “Google Web Directory” for results or maybe include results from the GWD into results from other sources. Just a guess though but it does have a bit of logic to it.

–use_query_classifier This is telling the instance that it should be classifying it’s search terms. What kind of classification? I don’t know.

–use_spamscore This could be telling the instance to use a more stable version of a spam mechanism dubbed “spamscore” along with the experimental one referenced earlier. Maybe do some kind of error checking or to make sure the results from the experimental score doesn’t deviate to far from it’s baseline spam score. It could also just tell the instance to turn on spam scoring and the previous spamscore setting could be telling it which spam score mechanism to use.

–using_borg While you immediately think Star Trek, or at least I did when I saw this. I have a feeling this may be some reference to an internal routing system in the Google cluster as a whole. I could be totally way off base on this though.

Anyway, I found this rather interesting and thought you might find it an interesting view point into the inner workings of Google. Again, this is simply my THEORY about what these configurations might imply. I have absolutely no hard evidence to back any of this up. It just my opinion.Google Error

0 comments: