SWAD-E Portal structure

Contents

Introduction

Controller

Views

Model

Text search

Aggregator

Provenance support

Documentation index

Portal structure

Portal administration

Portal customization

Introduction

This document summarizes the structure of SWAD-E semantic web portal tool. It gives details on how the portal is architected, what the main components are and how they interact. The companion documents provide more information on customizing a portal installation and administering a running portal.

The structure of a portal instance is illustrated in the figure below:

The main component is the portal viewer, This runs as a web application in a Java servlet container and provides a web interface onto a set of semantic web data held in the portal. The current implementation is not a generic semantic web browser and does not give dynamic access to semantic web data held outside the portal. The second component is an aggregator which periodically scans a list of known source sites and uploads any changed RDF data to the portal database so that it can be displayed by the viewer.

Whilst the block diagram shows a single database in fact the portal takes data from multiple files loaded into memory, as well as from an optional database. Typically the display templates are simple files (either local or retrieval via http URLs) and are loaded and managed by a template engine. The ontologies are typically local RDF (RDFS or OWL) files and are loaded into memory. The data is usually kept in a database but can be loaded into memory from static files in the case of simple portal sites.

The portal viewer itself adopts a convention Model-View-Controller (MVC) design. The Model is provided by a linked set of Java classes. A portal instance is specified by a DataSource which in turn gives access to representations of a filter state, a set of facets, a data store and wrapped versions of the RDF API provided by the Jena library. The View component uses the Jakarata Velocity template engine to render the displayed resources according to a set of display templates. The Controller is a Java servlet which has a small number of built in actions and the ability to invoke an arbitrary Velocity template. The design is not completely pure MVC in that several actions are implemented by their own servlets, though these typically return control the user by forwarding back to the main controller servlet.

Whilst the main portal implementation uses Velocity and not jsps some parts of the system do use jsps (the separate input form support and the authentication controls for accessing administration functions) so the portal web application needs to run in a full JSP container such as tomcat. It would be relatively easy to modify the portal viewer to only require a simple servlet container if required - exercise for reader.

The rest of this document outlines the MVC components in more detail before moving on to discuss the aggregator.

top

Controller and web interface notes

The controller part of the MVC design is implemented as a servlet called Entry which carries out the operation specified by the action parameter. There are only a few built in actions required:

v
The default view operation. If there is no filter specified in the other request parameters (see below) then this displays the top level browse page for the portal showing all the available facets, otherwise is shows the list of resources which match the given filter together with the remaining ways that the filter can be refined.
page
Display a singe resource in page view mode. The resource to be displayed is given via a resource parameter encoded as described below.
message
Used internally to allow action servlets to report a message back to the user via the main controller. It displays the text in the message attribute using the message template in portal://templates/message.vm
[other]
Any unrecognized action is assumed to be implemented via some template. The controller will find the template to be used by appending Action.vm to the action argument.

Specifying a datasource

A single instance of the portal web application can provide views onto multiple independent sources of data, each with their own style sheets and template sets if required. The configuration file (WEB-INF/config/sources.n3) defines all these sources and gives each a priority and an encoding string which can be used to identify the datasource in http requests. All of the controller actions take an optional ds parameter which specifies the desired datasource according to the configured encoding string. The specified sources are ordered (by assigning them an order number in the config file) so that if there is no Ds parameter then the first (lowest order number) source is taken as the default datasource.

The datasource specifies all aspects of the portal behaviour. As well as the obvious things like the data, ontologies and navigation facets it also specifies the templates to be used to display resources and the base address for all templates (so that each source can have a completely independent set of templates if required).

Encoding resources in parameter strings

At several points in the portal web interface the client needs to identify an RDF resource to the portal. It's not necessary to understand how this is done so long as the built in functions for doing this are used. In general, most of the model objects provide a getEncoding() so that from within a template you can simply use $object.encoding to insert the object encoded as a request parameter.

However, here seems like as good a place as any for documenting the concrete encoding used.

Objects such as facets and datasources are given explicit encodings in the config file which can used to identify them across http requests. If those encodings are changed then any existing bookmarks to portal searches will fail. For encoding RDF resources we use a convention that the first letter identifies the type of the resource, the remainder is an encoding of the resource and the whole set is URLEncoded to escape any non-URI characters. So that AanonID indicates a blank node with Jena internal identifier anonID whereas Llang|dt|foo encodes a literal with language lang, datatype URI DT and lexical form foo and Uprefix:local identifies a non-blank resource with optional namespace prefix. In the case of non-blank resources if the source data models have a declared XML namespace prefix which can be used to shorten the URI then that will be used. This means that saved bookmarks are also dependent upon the namespace prefixes not changing. This may not be an ideal solution but does help us keep URL's relatively short. Without some trick like this we might need to move to POSTs to overcome URL length limits in some proxies.

Encoding filter states

The main portal viewing operation (action=v) views those resources which match a "filter". The filter specifies the values of different search "facets" or free text searches (actually the free text search is encoded as a pseudo facet). The types of facet can be openly extended but out of the box the portal supports three types of facet - flat (simple property-value matching with the values taken from a fixed set of values), hierarchical (property-value matching but where the legal values are arranged in a hierarchy) and alphaRange (matching on the first letter of a literal-valued property such as a name). The filter state is encoding as a separate parameter for each facet which is in use. The parameter name for the facet is currently the facet's label with spaces removed. For flat and hierarchical facets the parameter value will be an RDF resource encoded as described above. For alphaRange facets we use a pseudo literal match of the form a*. For free text searches we used an internal pseudo-facet called textSearch whose parameter value is a lucene query string.

As a worked example, the query request string to view all resources in the portal from datasource Who's who in the Environment (encoding string wwite), which match the text search butterfly and whose Topic of interest facet matches the concept farming in the namespace identified by the XML namespace prefix swed_toi is: http://host/servlet/Entry?action=v&Ds=wwite&Topicofinterest=Uswed_toi%3Afarming&textSearch=butterfly

Internally the portal represents the state of a filter using a model object (see later) which makes it easy to determine what refinements are possible from a given filter and what resources match the current filter state. These state objects can be slow to generate because they may require traversal of the ontologies and counting of the number of resources that match candidate refinements. To ensure this process is not a bottleneck the state objects are cached. The parameter strings used to describe the filter state as part of the request URL are used as the keys to the cache.

Top

Views

All of the main portal display pages are generated from templates. We want it to be easy to change these to provide new look-and-feel and new navigation support without having to write Java code. A particular requirement of the design was that we want to be able to write templates for displaying RDF data which could reuse embedded templates. For example, we wanted it to be possible to produce templates for displaying address fields described using different ontologies and then, in a template displaying an organization, simply ask for the address to displayed inline leaving it to the system to decide which type of address it is and so which embedded template to use. In order to provide that capability we chose the Jakarata Velocity template engine. This offers a simple and compact scripting language suited to the task of generating portal views.

When the controller servlet receives a request to display a resource, or a set of search results matching some filter, it determines an appropriate template to use and hands over to the Velocity engine to display a response using the selected template. The choice of the template to use depends on the action being performed (viewing a search, viewing a page display of a resource), the type of the object being displayed (in the case of a page display operation) and the data source configuration. Through the data source configuration all the important templates can be reconfigured to point to different locations. See the customization documentation for more details.

When the controller invokes the velocity processing engine it places some data objects into the "context" of the engine so that they are available as variables within the velocity scripts. These variables are:

request
The HTTPServletRequest which prompted the action which lead to this view being called. This can be used to access additional parameters and session state.
datasource
The DataSource object through which all the configuration information and data can be accessed. Note that in the template this variable is called datasource whereas in the http request the parameter which specifies this is Ds, this is purely for historical reasons there's no strong reason they should have the same name but no good reason they have ended up being distinct.
filter
A FilterState object which defines the current search filter. If there is no search in progress then this will be the root filter state.
rm
A VMRenderManager object which provides services to assist with rendering views. In particular it has some methods to simplify generating URL's to access various portal services and support for recursively calling the velocity engine in order to render embedded object values.
resource
A NodeWrapper object which wraps up the RDF resource, if any, which is being processed.

In order to write new viewing templates you need to understand the Velocity scripting language (see http://jakarta.apache.org/velocity/ for documentation) and the methods available via the "model" data objects listed above (see below and the javadoc). The customization documentation provides more information on the default templates provided and how to adjust them.

Top

Model

The "model" in the MVC triumvirate is provided by the Java objects which are placed into the velocity context variables described above. The definitive documentation on this is provided in the javadoc and a highlight of the most important functions is provided in the customization appendix. Here we just give a an overview of the four groups of model components:

a. Model - filters and facets

The core idea of search in the portal is to divide the search space into a set of dimensions, called facets. Each facet specifies some property of the objects being searched over. This might be a simple keyword value or a hierarchical classification. The Facet Java interface is used to represent the definition of a single facet, the FacetState interface defines a search constraint for a single facet and a FilterState is a collection of FacetStates (one for each Facet configured for the portal). The default viewing templates use these interfaces to generate a faceted browsing user interface for navigating the portal data.

b. Model - datasources and stores

The DataSource object encapsulates all the configuration information for a given portal instance and provides access to the ontology and instance data via a DataStore abstraction. DataStores store the data itself using an implementation of the proposed Jena MultiModel interface. This interface allows a collection of separate RDF Models to be treated as a single composite Model but the individual source Models can still be updated and the origin of any statement in the composite can be traced back to the source Model. There are two DataStore implementations provided, one which stores the data in a database and one which works from memory and loads the data from files when the portal webapp starts up.

c. Model - RDF wrappers

The SWAD-E portal is built on top of the Jena semantic web toolkit, which in turn provides a very rich interface for manipulating RDF and OWL data. There is nothing to prevent one accessing the Jena Models which represent the data using the DataStore access functions. However, it is generally more convenient to access the RDF data via a set of wrapper objects which are slightly easier to script. In particular these wrappers enable scripts to access RDF propertyvalues just using a qname string (no need to construct a Property object).

d. Model - rendering support

This block is really just a single object, the VMRendermanager, which provides a miscellaneous collection of utilities to help with generating view pages from the data. Most such work is done via the specific model objects above and the render manager (context variable rm) is just a convenient place to put anything which doesn't fit elsewhere.

Top

Text Search

The portal supports free text search through an embedded instance of the Lucene engine.

External access to the text search support is via a SearchServlet. This supports three parameters: Ds to optionally specify the DataSource to be accessed, query which should a Lucene query string and search=last which specifies that the search should be constrained to the last set of search results (any other value or no value indicates a new global search). The SearchServlet will forward the results onto the controller to generate a view of the filtered results.

Internally the search support is packaged up using a ModelIndex object associated with each DataSource object. This provides support for query, for index generation and for incrementally adding new data to the index. Lucene stores the inverted index information as files in the file system, the location of the index files is configurable (see customization for details). There is no need to build an initial index, if one does not exist then it will automatically be built the first time a query is issued.

Since the data being indexed is RDF rather than a set of text documents the way we map the RDF to Lucene is important. We treat each RDF resource as if it were a separate document and index it according to all of the text on all of the property values attached to the resource. Thus we are searching for resources based on their property values, not searching for individual RDF statements. In the case of properties whose value is a string literal then we use the default Lucene tokenization to split the text into stemmed words (this is only helpful for English text). In the case of properties whose values are resources things are more complex. If the resource is a blank node then we also add to the index any property values attached to that bNode (and so on, recursively). Thus for example if an Organization has an address represented using vCard (which uses bNodes) then a search on words in the address will return the Organization (it will also return the address bNode itself). If the resource is not a blank node but is a concept in a known ontology then we index it based on the concept's label (checking both rdfs:label and skos:prefLabel), otherwise we simply index on the localName part of the resource's URI.

One issue with this mapping from RDF to pseudo documents is that it is hard to change the RDF incrementally. If a new entry is added to the portal that data file can be incrementally indexed. However, if an existing entry is changed that can affect a number of RDF statements. In the current implementation a changed resource will be incrementally indexed under its new values but its old values won't be removed. This is unlikely to be a problem in practice but it may be worth periodically rebuilding the text index. If it does become a problem then a more sophisticated implementation that handled RDF updates more completely would be possible.

Top

Aggregator

The aggregator provides support for periodically scanning a set of known RDF source sites and copying any changed data from them into the portal's database for viewing. Conceptually this is a different operation from the portal itself but since the implementation treats the aggregator as simply a service of the portal we describe the structure here.

Each DataSource has (optionally) an associated HavestManager which performs the scans of known sources. The HavestManager polls all of the known source sites on a regular basis (the interval is configurable). The set of known sites and their status is described in RDF and is normally kept in a database (configurable but normally separate from the content database). That database can be bootstrapped from a set of RDF data files and can be manually edited using the administration UI tools.

By default the HarvestManager does not start scanning when the portal first starts up and scanning must be manually started using the administration pages.

Each known site is described using the harvester vocabulary. Some of the properties defined by that vocabulary are used internally by the HarvestManager to track when the site was last loaded and whether it has changed. Some of them are used to control the way the site is treated by the harvester and are set and displayed via the admin pages. See the administration docs for more details on how this access is done. Essentially each site is either "known" or "new" and if "known" can also be be optionally "trusted" or "block". When a site is blocked the harvester will no longer attempt to poll it or upload information. When a site is trusted then not only will any changed data be uploaded but if any rdfs:seeAlso links are found in the data then those links will be added to the set of sites to be polled (i.e. a trusted site can introduce new sites to the aggregator). The "new" status is there so that newly registered sites can be brought to the attention of the portal operations and they have to manually acknowledge the site via the administration interface before the site moves to being "known".

The advantage of tightly binding the portal to the aggregator is that it is easy to support administration pages as part of the portal that control the aggregation process. The disadvantage is that if the portal web application is restarted then an old aggregator thread may be left running in the servlet container.

Top

Provenance support

The portal provides support for identifying where statements originate from. The underlying support for this is quite general purpose but there are various simplifications in the way this is exposed to the Velocity template engine.

Each separate source of portal RDF data, whether a locally held file or a remote URL polled by the aggregator, is added to the DataStore MultiModel as a separate component model with its URI given as the label for that model. This means that any single RDF statement in the DataStore can be traced back to the source URI from whence it came. Thus a view template has the option of checking the source of any statement it consults when constructing the view. Access to this information from the scripts is provided by the MultiStatementWrapper.getStatementSource methods.

A common pattern is that the bulk of information on a given object comes from a standard location and just a few annotations come from other sources. To simplify template building in such circumstances a portal configuration can define an RDF property to use as the "primary" property when determining the source of a displayed resource. For example, one might chose the name of an organization as its primary property. When viewing the information on such a resource then the ResourceWrapper.getPrimarySource method will find the first instance of that primary property and return the source URI from which that statement came. The view template can then compare that "primary source" with the source of any other individual property value.

To simplify the display of the information on the sources then the DataSource object provides two convenience functions to get a piece of text describing the source and a banner message associated with the source. These values are in turn are found by consulting both the harvester database and the main portal database for rdfs:label and harvest_vocab:banner properties associated with the source. Thus a new site contributing data to a portal could include in its data a description of itself to display in the provenance side-bars of the portal pages. This is not a hugely secure approach - in principle a display page might need to check the provenance of such source descriptions!

top