Software agents under the hood: What do their guts look like?
Antarctica Starts Here. » Antarctica Starts Here. 2015-12-22
Summary:
In my last post I went into the the history of semi-autonomous software agents in a fair amount of detail, going as far back as the late 1970's and the beginning of formal research in the field in the early 1980's. Now I'm going to pop open the hood and go into some detail about how agents are architected in the context of how they work, some design issues and constraints, and some of the other technologies that they can use or bridge. I'm also going to talk a little about agents' communication protocols, both those used to communiate amongst themselves and those used to communicate with their users.
Software agents are meant to run autonomously once they're activated on their home system. They connect to whatever resources are set in their configuration files and then tend settle into a poll-wait loop where they hit their configured resources about as fast as the operating system will let them. Each time they hit their resources they look for a change in state or a new event and examine every change detected to see if it fits their programmed criteria. The agent then fires an event if there is a match and goes back to its poll-wait loop. Other types of agents use a scheduler design pattern instead of a poll-wait loop. In this design pattern, agents ping their data sources periodically but then go to sleep for a certain period of time, which can be anywhere from a minute to days or even months. This reduces CPU load (because poll-wait loops can hit a resource dozens or even hundreds of times a second, which causes the CPU to spend most of its time waiting for I/O to finish) and network utilization. Some agents may be designed to sleep by default but register themselves with an external scheduler process that wakes them up somehow, possibly by sending them an command over IPC or using an OS signal to touch them off.When considering software agents they seem pretty straightforward to design. All it has to do is start up, fork(2) itself into the background, and tickle some server someplace every once in a while. Right? This isn't actually the case. Conceptually speaking it's a decent high-level explanation but from a technical perspective it's naive in the sense that it skips over a lot of design issues that an aspiring agent developer needs to be cognizant of. The first issue, which I mentioned briefly above the cut, is scheduling. How do you schedule your agents to run so that they don't step on one another and lock up? How would you prevent two or more of your agents from colliding when they both try to access a shared resource? How do you schedule them so they don't get throttled or banned outright from some service they're using on your behalf because you're violating the service's terms of service or accidentally DoSing them? Will it be required that the user must actually be there to respond to the agent, or does the agent's message not need to be acknowledged immediately (if at all)?
Another thing to consider is prioritization of events sent by a given agent. Does the priority of an event differ if it's being sent to another agent, a service, or its user? Should it, and if so how can it be marked as such? What is a CRASH!/CRITICAL/HIGH/Medium/low alert in the context of an agent, will an agent define alerts of all of those types, and what action, if any, should an agent take when it generates or receives an event matching one of those priority levels? When should the agent escalate to the next higher priority level? What happens to the rest of an agent network if and when an agent goes offline? How would the agent network recover? How would the agents that are closest to the dead agent in the network handle it? In the event that the agent system is object oriented in nature, what happens if one of the agent prototypes has a bug? Will they all crash or will just one or two crash if they tickle that bug? What happens if the scheduler dies (which is a huge problem)? How would an agent network recover if multiple points of failure occur? Could the network recover unassisted? How long could the network operate in a degraded state without repairs before collapsing entirely? How would the user migrate part of, or an entire agent network to another processing substrate (at a different provider, a different OS, or a different sort of