ContentsDocumentation index |
IntroductionThis document summarizes the structure of SWAD-E semantic web portal tool. It gives details on how the portal is architected, what the main components are and how they interact. The companion documents provide more information on customizing a portal installation and administering a running portal. The structure of a portal instance is illustrated in the figure below:
Whilst the block diagram shows a single database in fact the portal takes data from multiple files loaded into memory, as well as from an optional database. Typically the display templates are simple files (either local or retrieval via http URLs) and are loaded and managed by a template engine. The ontologies are typically local RDF (RDFS or OWL) files and are loaded into memory. The data is usually kept in a database but can be loaded into memory from static files in the case of simple portal sites. The portal viewer itself adopts a convention Model-View-Controller (MVC) design. The Model is provided by a linked set of Java classes. A portal instance is specified by a DataSource which in turn gives access to representations of a filter state, a set of facets, a data store and wrapped versions of the RDF API provided by the Jena library. The View component uses the Jakarata Velocity template engine to render the displayed resources according to a set of display templates. The Controller is a Java servlet which has a small number of built in actions and the ability to invoke an arbitrary Velocity template. The design is not completely pure MVC in that several actions are implemented by their own servlets, though these typically return control the user by forwarding back to the main controller servlet. Whilst the main portal implementation uses Velocity and not jsps some parts of the system do use jsps (the separate input form support and the authentication controls for accessing administration functions) so the portal web application needs to run in a full JSP container such as tomcat. It would be relatively easy to modify the portal viewer to only require a simple servlet container if required - exercise for reader. The rest of this document outlines the MVC components in more detail before moving on to discuss the aggregator. topController and web interface notes
The controller part of the MVC design is implemented as a servlet
called
Specifying a datasource A single instance of the portal web application can provide views onto
multiple independent sources of data, each with their own style sheets
and template sets if required. The configuration file ( The datasource specifies all aspects of the portal behaviour. As well as the obvious things like the data, ontologies and navigation facets it also specifies the templates to be used to display resources and the base address for all templates (so that each source can have a completely independent set of templates if required). Encoding resources in parameter strings
At several points in the portal web interface the client needs to
identify an RDF resource to the portal. It's not necessary to understand
how this is done so long as the built in functions for doing this are used.
In general, most of the model objects provide a However, here seems like as good a place as any for documenting the concrete encoding used. Objects such as facets and datasources are given explicit encodings
in the config file which can used to identify them across http requests.
If those encodings are changed then any existing bookmarks to portal searches
will fail. For encoding RDF resources we use a convention that the first
letter identifies the type of the resource, the remainder is an encoding
of the resource and the whole set is URLEncoded to escape any non-URI
characters. So that Encoding filter states The main portal viewing operation ( As a worked example, the query request string to view all resources
in the portal from datasource Who's who in the Environment (encoding
string Internally the portal represents the state of a filter using a model object (see later) which makes it easy to determine what refinements are possible from a given filter and what resources match the current filter state. These state objects can be slow to generate because they may require traversal of the ontologies and counting of the number of resources that match candidate refinements. To ensure this process is not a bottleneck the state objects are cached. The parameter strings used to describe the filter state as part of the request URL are used as the keys to the cache. TopViewsAll of the main portal display pages are generated from templates. We want it to be easy to change these to provide new look-and-feel and new navigation support without having to write Java code. A particular requirement of the design was that we want to be able to write templates for displaying RDF data which could reuse embedded templates. For example, we wanted it to be possible to produce templates for displaying address fields described using different ontologies and then, in a template displaying an organization, simply ask for the address to displayed inline leaving it to the system to decide which type of address it is and so which embedded template to use. In order to provide that capability we chose the Jakarata Velocity template engine. This offers a simple and compact scripting language suited to the task of generating portal views. When the controller servlet receives a request to display a resource, or a set of search results matching some filter, it determines an appropriate template to use and hands over to the Velocity engine to display a response using the selected template. The choice of the template to use depends on the action being performed (viewing a search, viewing a page display of a resource), the type of the object being displayed (in the case of a page display operation) and the data source configuration. Through the data source configuration all the important templates can be reconfigured to point to different locations. See the customization documentation for more details. When the controller invokes the velocity processing engine it places some data objects into the "context" of the engine so that they are available as variables within the velocity scripts. These variables are:
In order to write new viewing templates you need to understand the Velocity scripting language (see http://jakarta.apache.org/velocity/ for documentation) and the methods available via the "model" data objects listed above (see below and the javadoc). The customization documentation provides more information on the default templates provided and how to adjust them. TopModelThe "model" in the MVC triumvirate is provided by the Java objects which are placed into the velocity context variables described above. The definitive documentation on this is provided in the javadoc and a highlight of the most important functions is provided in the customization appendix. Here we just give a an overview of the four groups of model components: a. Model - filters and facetsThe core idea of search in the portal is to divide the search space into a set of dimensions, called facets. Each facet specifies some property of the objects being searched over. This might be a simple keyword value or a hierarchical classification. The Facet Java interface is used to represent the definition of a single facet, the FacetState interface defines a search constraint for a single facet and a FilterState is a collection of FacetStates (one for each Facet configured for the portal). The default viewing templates use these interfaces to generate a faceted browsing user interface for navigating the portal data. b. Model - datasources and storesThe DataSource object encapsulates all the configuration information for a given portal instance and provides access to the ontology and instance data via a DataStore abstraction. DataStores store the data itself using an implementation of the proposed Jena MultiModel interface. This interface allows a collection of separate RDF Models to be treated as a single composite Model but the individual source Models can still be updated and the origin of any statement in the composite can be traced back to the source Model. There are two DataStore implementations provided, one which stores the data in a database and one which works from memory and loads the data from files when the portal webapp starts up. c. Model - RDF wrappersThe SWAD-E portal is built on top of the Jena semantic web toolkit, which in turn provides a very rich interface for manipulating RDF and OWL data. There is nothing to prevent one accessing the Jena Models which represent the data using the DataStore access functions. However, it is generally more convenient to access the RDF data via a set of wrapper objects which are slightly easier to script. In particular these wrappers enable scripts to access RDF propertyvalues just using a qname string (no need to construct a Property object). d. Model - rendering support
This block is really just a single object, the VMRendermanager, which provides a miscellaneous collection of utilities to
help with generating view pages from the data. Most such work
is done via the specific model objects above and the render manager
(context variable Text SearchThe portal supports free text search through an embedded instance of the Lucene engine. External access to the text search support is via a SearchServlet. This
supports three parameters: Internally the search support is packaged up using a ModelIndex object associated with each DataSource object. This provides support for query, for index generation and for incrementally adding new data to the index. Lucene stores the inverted index information as files in the file system, the location of the index files is configurable (see customization for details). There is no need to build an initial index, if one does not exist then it will automatically be built the first time a query is issued. Since the data being indexed is RDF rather than a set of text documents the way we map the RDF to Lucene is important. We treat each RDF resource as if it were a separate document and index it according to all of the text on all of the property values attached to the resource. Thus we are searching for resources based on their property values, not searching for individual RDF statements. In the case of properties whose value is a string literal then we use the default Lucene tokenization to split the text into stemmed words (this is only helpful for English text). In the case of properties whose values are resources things are more complex. If the resource is a blank node then we also add to the index any property values attached to that bNode (and so on, recursively). Thus for example if an Organization has an address represented using vCard (which uses bNodes) then a search on words in the address will return the Organization (it will also return the address bNode itself). If the resource is not a blank node but is a concept in a known ontology then we index it based on the concept's label (checking both rdfs:label and skos:prefLabel), otherwise we simply index on the localName part of the resource's URI. One issue with this mapping from RDF to pseudo documents is that it is hard to change the RDF incrementally. If a new entry is added to the portal that data file can be incrementally indexed. However, if an existing entry is changed that can affect a number of RDF statements. In the current implementation a changed resource will be incrementally indexed under its new values but its old values won't be removed. This is unlikely to be a problem in practice but it may be worth periodically rebuilding the text index. If it does become a problem then a more sophisticated implementation that handled RDF updates more completely would be possible. TopAggregatorThe aggregator provides support for periodically scanning a set of known RDF source sites and copying any changed data from them into the portal's database for viewing. Conceptually this is a different operation from the portal itself but since the implementation treats the aggregator as simply a service of the portal we describe the structure here. Each DataSource has (optionally) an associated HavestManager which performs the scans of known sources. The HavestManager polls all of the known source sites on a regular basis (the interval is configurable). The set of known sites and their status is described in RDF and is normally kept in a database (configurable but normally separate from the content database). That database can be bootstrapped from a set of RDF data files and can be manually edited using the administration UI tools. By default the HarvestManager does not start scanning when the portal first starts up and scanning must be manually started using the administration pages. Each known site is described using the harvester vocabulary. Some of the properties defined by that vocabulary are used internally by the HarvestManager to track when the site was last loaded and whether it has changed. Some of them are used to control the way the site is treated by the harvester and are set and displayed via the admin pages. See the administration docs for more details on how this access is done. Essentially each site is either "known" or "new" and if "known" can also be be optionally "trusted" or "block". When a site is blocked the harvester will no longer attempt to poll it or upload information. When a site is trusted then not only will any changed data be uploaded but if any rdfs:seeAlso links are found in the data then those links will be added to the set of sites to be polled (i.e. a trusted site can introduce new sites to the aggregator). The "new" status is there so that newly registered sites can be brought to the attention of the portal operations and they have to manually acknowledge the site via the administration interface before the site moves to being "known". The advantage of tightly binding the portal to the aggregator is that it is easy to support administration pages as part of the portal that control the aggregation process. The disadvantage is that if the portal web application is restarted then an old aggregator thread may be left running in the servlet container. TopProvenance supportThe portal provides support for identifying where statements originate from. The underlying support for this is quite general purpose but there are various simplifications in the way this is exposed to the Velocity template engine.
Each separate source of portal RDF data, whether a locally held file or a
remote URL polled by the aggregator, is added to the DataStore MultiModel
as a separate component model with its URI given as the label for that model.
This means that any single RDF statement in the DataStore can be traced back
to the source URI from whence it came. Thus a view template has the option of
checking the source of any statement it consults when constructing the view.
Access to this information from the scripts is provided by the
A common pattern is that the bulk of information on a given object comes
from a standard location and just a few annotations come from other sources.
To simplify template building in such circumstances a portal configuration
can define an RDF property to use as the "primary" property when determining
the source of a displayed resource. For example, one might chose the name
of an organization as its primary property. When viewing the information
on such a resource then the To simplify the display of the information on the sources then the DataSource object provides two convenience functions to get a piece of text describing the source and a banner message associated with the source. These values are in turn are found by consulting both the harvester database and the main portal database for rdfs:label and harvest_vocab:banner properties associated with the source. Thus a new site contributing data to a portal could include in its data a description of itself to display in the provenance side-bars of the portal pages. This is not a hugely secure approach - in principle a display page might need to check the provenance of such source descriptions! top |
![]() |
(c) Copyright 2004, Hewlett-Packard Development Company, LP, all rights reserved. |