by Jack Krupansky of Base Technology
This paper describes the process that I went through to produce my "software agent link" page which is designed to be a resource for myself and others who want to know what's been going on in the software agent field on the web. It was not designed for casual surfers, or to highlight the hottest links, or prioritize the relative value of any page, but to be simply a list of virtually ALL references to software agents. It doesn't really achieve that grandly ambitious goal, as this paper explains, but does do a fairly adequate job. If you follow all the links on my page, you won't miss much.
Around March 1997 I decided that I thought software agents were going to be "the next big thing." I poked around the web a little bit with search engines and found lots of references. I started keeping bookmarks in the browser, but that list began growing very quickly. Initially I wasn't sure how to deal with the large number of hits for "agents", but gradually I evolved a strategy.
The clunky old browser I had been using, CompuServe's version of Mosaic, just wasn't up to the task. It would actually crash after I tried to view more than just a few pages of search results. Plus, I realized that the five hours of access time that CompuServe provides for its standard pricing plan just wasn't going to be enough.
Searching for "agent" gots lots of hits, but included real estate and insurance agents, among others. So I tried "software agent" and decided to go with it since the number of hits was more manageable, less than 5,000.
I tried several of the popular search engines, including Lycos, InfoSeek, HotBot, and AltaVista. I finally settled with AltaVista since it allowed me to set very narrow date ranges which allowed me to slowly advance through all the hits, a month, week, or even one day at a time.
I continued using CompuServe since I was having money problems and felt comfortable with their $9.95 per month plan. And my "slicing" approach ("salami" approach?) sidestepped the browser crashes.
In August I made a fair amount of progress, literally examining EVERY hit and adding the URL to my agent "link" page if it seemed even vaguely relevant to "software agents".
My task would have been much simpler if I had some specific, narrow sub-interest in mind or a specific application or even category of agent applications. But my interest was in everything having to do with agents since my goal was to try to figure out how to build a business based on "agent technology." By bias is towards tools and infrastructure, but I also realized that these days tools and infrastructure don't always make much money and highly-targeted applications can command hefty sticker prices. I'm also interested in agent research, since commercialization of research is certainly one way to make money.
I also subscribed to the "Agents Mailing List" which typically emails me several messages a day on whatever the senders seem to think is relevant to them at the moment. I captured most of the URLs mentioned in those messages (since I subscribed to the list) and added them to my list.
Initially I had been doing my search as mostly just to get a handle on what's out there. It wasn't till August that I decided that doing a complete and exhaustive search was a valid project in itself. So I continued the search long after it was giving my any more than diminishing returns. Many of the hits were just references to URLS I had already visited.
With my access time going up and the general sluggishness of the CompuServe browser/network, I decided to switch to a simple unlimited access Internet service provider. On someone else's recommendation I went with DataBlast which gives me unlimited access for $14.95 per month. I still use CompuServe for email since thousands of people world-wide have my CompuServe email address. With Datablast's snappier net access I was able to quickly proceed through my search results.
My search was never a full-time project, mostly just a bunch of hours each week-end. By early December I finally finished! Or so I thought. But I grew suspicious since the search through the second half of 1997 went fairly quickly. I did some experiments, re-searching a few selected time periods and realized a problem. I had inadvertently started typing "Software Agent" for my AltaVista search string rather than "software agent". Low and behold, the hits were different. So, I did a few more experiments and decided that I needed to re-search from August on. The increased hit rate slowed me considerably. I finally finished on January 25, 1998.
But there was one other anomoly that was bugging me. There just didn't seem to be many hits from around the middle of October. In fact, there was a decided drop on October 18th. There were fewer hits from October 13th till the end of the year (as well as January) that in just the week of October 7th!
I went over to Hotbot, which doesn't have a date range feature, but you can select a time period ending today. I requested the last three months and got over five hundred hits. So, my conclusion is that the hits are there but AltaVista just wasn't either detecting them or reporting them.
I know a little bit about how a search engine "web crawler" works and it just seems like something has gone screwy with AltaVista. But in the interests of "science" I decided that I should just finish my "experiment" and try to keep it as "controlled" as possible.
Unfortunately, my agent link page is not strictly the controlled result of the search. Like I said, I began with a bunch of bookmarks (may two dozen) before my search as well as URLs from the Agent Mailing List. I also added "interesting" URLs that I encountered when reviewing pages that showed up on the search.
Another problem with my search is that the phrase "software agent" is just not as inclusive as I'd like. If someone talks about "agents" but doesn't explicitly use the word "software" immediately before "agent", then there is no hit. Fortunately, people generally seem to mix both the single word and phrase. I only realized this when I visited a page reference by a hit page and noticed that its date was earlier than my current date slice. Granted, I'm doing all this by hand and I'm only human and quite error-prone, but for this experiment I was being quite careful and paying attention to detail. I mulled over the issue for a few minutes, manually searched the mystery page and then realized the problem.
A related problem was occasionally I'd get a hit and when I visited the hit page I could find no references to "software agent". Some of this was due to formatting and some due to the page having been updated (common for "news" pages.) I'd look at the offending page very carefully, but quickly go on if I couldn't ascertain the source of the confusion. But I quickly discovered that the search engine's crawler indexes phrases a bit differently than the way the browser's Find command works. The common cause of my confusion was simply a sentence or list item ending with the word "software" and the next sentence or list item beginning with the word "agent". Sometimes these were relevant and sometimes not (e.g., when the "software" was for "agents", but or the real estate and insurance kind.)
As of January 25, 1998, for a total of 3143 hits on "software agent" by AltaVista advanced search the distribution by year:
before 1993 none 1993 2 1994 32 1995 116 1996 636 1997 1994 1998 12
The first hit was dated November 6, 1993. But just because the hits were low or non-existent in the "early" years does not necessarily mean that software agents were not talked about very much on the web during that time. Pages could have been updated or copied and some older HTTP servers don't necessarily report correct dates.
As far as the AltaVista drop off, the hit numbers are:
Jan 97 87 Feb 97 91 Mar 97 130 Apr 97 181 May 97 178 Jun 97 246 Jul 97 242 Aug 97 416 Sep 97 720 Oct 97 244 Oct 1-12 233 Oct 13-31 11 Nov 97 16 Dec 97 6 Jan 98 12
Again, I vaguely noticed the hits dropping off in November, but the mental focus needed just to methodically slog through all these pages prevented my higher-level thinking processes from putting two and two together. After recognizing the problem in early December, I didn't notice the number of October and December hits go up by the end of the month or, for that matter by the end of January. This suggests that the problem is not just a matter of the AltaVista web crawler "catching up" (crawler "latency"), but some sort of AltaVista database update snafu.
Once I had completed the "data gathering" phase of the experiment, I was finally able to turn my brain's higher-level thought processes back on. I quickly realized a number of issues:
Trying several other popular search engines for "software agent":
Lycos 345 -- ? only last few months? InfoSeek 6,054 -- But plural doesn't affect results Excite - -- Lots of hits, but no count HotBot 5,209 Northern Light 8,467 AltaVista 3,143 -- Advanced Search
Some other interesting searches:
My link list has both the raw URL as well as descriptive titles. These titles are usually just a phrase or simple one-liner. Unfortunately, the titling of MOST pages was not sufficient for my needs. The common problems:
In most cases it was quite easy for me to just glance at the first screen of the page and simply copy and paste some just with a little editorial work to contrive a nice on-liner. If someone's personal resource page had sufficient agent links to warrant inclusion in my list, I usually just added the person's name to the phrase "agent links". In some cases it actually took me a fair amount of searching backwards in the URL path to find the person'sname.
In many, but not all cases I was able to synthesize a reasonable title from the first paragraph of text. Frequently some phrasing could be exactly copied. Sometimes some phrasing could be rearranged. But sometimes I had to sit and reread the "introduction" for a couple of minutes before I could come up with a one-liner the adequately described the page for my list. Sometimes I took the easy route and used several lines of narrative text to describe the page.
Most documents were in English, but a number of them were in European languages or Japanese. A lot of these pages were reported by the search engine due to references to English sites or papers.
I added many of these hits to my list if it "seemed" that the material was relevant to software agents.
The search engine would report hits for each page in a site that referred to software agents (singular, of course.) I then had to decide whether to just put in one entry for the whole site or to add separate entries for all or some of the individual pages. Most situations were cut and dry. Some exceptions:
Sometimes (maybe 1 out of 20) I was unable to visit the page for a hit. The common causes:
If the connection could not be made, I would usually try a few times before moving on. If the page was "obviously" relevant and I could easily use or synthesize its title, then I added it to the list anyway.
In some cases the owner of the page was kind enough to create a "forward" link when the page was moved. I didn't think about it at the time, but the owner might typically only added a forward link for the home page of their site (e.g., "site/home page has moved") rather than for each individual page, so if I had just rummaged around for the home page for the owner then I might have found more of these "lost" pages.
Now that I have this list, I will certainly add to it as I run across agent references. But I have no intention of doing anymore exhaustive searching since I'm exhausted! But I might very well continue to do incremental enhancement to the list on a weekly or monthly basis.
I'm considering how to best format (a two-column table with description and URL as columns?), structure, and organize the list. I could split the large list into obvious sections such as conferences (as well as coming conferences versus historical references), research labs, projects, products, companies, people, etc. Or I could just tag each entry with a code for the relevant categories and keep the single list intact. I like having one list since I can see it all, all at once. But I could still have multiple copies of the list, each sorted for relevancy to some category/purpose.
I was careful to use "structured text" to make it easy for a simple program to extract the information from the list so that it could be easily reorganized or placed in a database.
I thought of doing a better job of categorizing the list or at least adding codes for each entry, but it would have been simply too much additional work. The project took a lot longer than I anticipated as it was.
My hope is that my experiences with this project will lead to the development of a software agent which could recreate my work automatically at any time. This agent could then be continually tuned to improve its filtering or efficiency. The entire list could then be regenerated to be more accurate. This agent could run on an incremental basis to catch future hits. It could also go back and detect dead pages which can no longer be accessed or whose content has been changed to eliminate their relevancy.
The agent could have options and categories or specializations so individual users or researchers could dynamically generate subset lists that met their individual needs, possibly with additional search criteria.
And obviously users would like to be able to be put on a mailing list to be notified of additions or changes. The only real additon from a traditional Internet mailing list would be that the user could tune their special interests to be much narrower than everything about agents or all types of agents. For example, someone might only want to know about agent development tools which have been commercially available for at least a year. Or someone else might want the latest and greatest experimental development tools that are available for free. To each his own. That's one of the qualities that the agents of the future should be able to deliver.
TBD
Seriously, draw your own and let me know. I'm still thinking about what I really learned from this experiment and how I would do it again or how I would construct AND control an agent to do it for me.
Please contact us with any questions or comments.
Updated: August 25, 2001 04:53:04 PM -0600
Copyright © 2001 John W. Krupansky d/b/a Base Technology