IRI and URI Templates

I’ve been having a number of discussions with folks like DeWitt Clinton, Dave Johnson, Elias Torres and Rob Yates about URI templating.

A URI template is a character string containing a number of tokens that, when replaced, form a valid URI. They are used, for instance, in OpenSearch to provide a way of specifying what a query url should look like. The challenge with the way OpenSearch’s templates work, however, is that they are incapable of representing every possible form of URI. So the other day DeWitt and I worked up a fairly complete syntax that would be capable of representing any URI format. I’m in the process of writing up a draft specification, but figured it would be worthwhile documenting what we came up with here.

There are two requirements:

  1. We MUST be able to template any valid Internationalized Resource Identifier
  2. The template MUST NOT assume or require any particular form of uri. e.g., OpenSearch currently requires that querystring consist of name=value params separated by ampersands. RFC3987, however, does not require any particular format for the querystring.

The ABNF grammar we came up with is this:

IRITemplate    = tscheme ":" thier-part [ "?" tquery ]
                 [ "#" tfragment ]
thier-part     =  "//" tauthority tpath-abempty
               / tpath-absolute
               / tpath-rootless
               / ipath-empty
tauthority     = [ tuserinfo "@" ] thost [ ":" tport ]
tuserinfo      = *(tsegment / iuserinfo)
thost          = *(tsegment / ihost)
tport          = tsegment / iport
tscheme        = scheme / tsegment
tpath-abempty  = *( "/" isegment / tsegment )
tpath-absolute = "/" [ isegment-nz / tsegment
                 *( "/" isegment / tsegment ) ]
tpath-rootless = igsegment-nz / tsegment
                 *( "/" isegment / tsegment ) ]
tquery         = *( ipchar / iprivate / tsegment / "/" / "?" )
tfragment      = *( ipchar / tsegment / "/" / "?" )
tsegment       = "{" ttoken / ttokens "}"  

ttoken         = [tprefix SP] [ 1*ipchar ":" ] 1*ipchar
                 [SP trepeat SP tdelim ] [SP ":" SP tparams ]
ttokens        = ?( "(" ttoken ")" ) SP trepeat SP tdelim
tprefix        = *ipchar
trepeat        = "?" / "+" [DIGIT] / [DIGIT] "*" [DIGIT]
tdelim         = *ipchar
tparams        = *(" " tparam )
tparam         = *ipchar

That’s likely a bit difficult to read, so how about some examples :-)

If you want a simple, OpenSearch-style template, you can do http://example.org/{foo}?q={bar}. The {foo} and {bar} tokens are replaced with whatever value you want to fill in the template. For instance, the template could be processed to produced the IRI http://example.org/a?q=b. The token name can be qualified with a prefix. For example, http://example.org/filterByLastName?lastName={xcard:family}.

Now suppose there are a fixed number of possible values for the token replacement. For example, Roller offers views of a blog in HTML, Atom and RSS. The grammar above allows us to produce a template like http://example.org/{view : html atom rss}/{user}. When processed, this template can produce URI’s like http://example.org/html/jasnell or http://example.org/atom/dewitt, etc.

The grammar also supports optional and repeating tokens. For example, the template http://example.org{/ path *} can be processed to match http://example.org, http://example.org/foo, http://example.org/foo/bar, http://example.org/foo/bar/woo, and so on. The asterisk (*) in the token specifies that the token can be processed zero or more times. The forward slash in the token is a literal prefix, meaning that it is prepended to whatever value is selected for the token replacement. A template like http://example.org/{path *} would have yielded results like http://example.org/foobarwoo (where each token replacement is run together with no delimiter).

For repeating tokens, it is also possible to specify a repeat delimiter. For example, the template http://example.org/foo?{q= query * &} would produce a result like http://example.org/foo?q=a&q=b&q=c. Without the repeat delimiter (e.g., http://example.org/foo?{q= query *}), the result would have been http://example.org/foo?q=aq=bq=c.

The asterisk indicates that the token is applied zero or more times. A plus sign (+) indicates that the token is to be applied one or more times. A question mark (?) indicates that the token is to be applied zero or one times. No repeat delimiter means that the token is to be applied exactly once. The asterisk may be used with leading and trailing digits to indicate the specific cardinality, e.g., 2*3 indicates means to apply the token 2 or 3 times. The plus sign may be used with a trailing digit to indicate the specific cardinality, e.g., +3 means to apply the token 1, 2 or 3 times.

A token may consist of one or more mutually exclusive options. For instance, http://example.org/foo?{(q= query : foo)(p = param : bar)} indicates that the URI produced may either be in the form http://example.org/foo?q=foo or http://example.org/foo?p=bar.

Cardinality may be specified for mutually exclusive choices. For example, in the template http://example.org/foo?{(q= query : foo)(p = param : bar)* & }, the produced IRI may contain any number of q or p parameters listed in any order. Specifically, this means that the token is applied zero or more times, and that each application of the token is a choice between either of the two given forms. If you wanted to require that the query and param tokens appear in a given order, but still allow for any number of them, you would separate the tokens as in http://example.org/foo?{q= query * & : foo}{&p= param * : bar}

With this syntax, every part of the IRI may be templated. For example, {scheme : http https}://{host}.example.org/{foo}{?q= bar}{# fragment}.

It’s also possible to use templating with non-HTTP IRI’s, e.g., tag:example.org,2006:/2006/06{/ path } and mailto:{user}-{id}@{host}.example.org.

There’s likely more, and it’s quite likely that there are a number of improvements we could make to this scheme.

Thoughts? Complaints? Praise?

17 Responses to “IRI and URI Templates”

  1. DeWitt Clinton Says:

    Fantastic writeup, James.

    I like the part you added regarding mutually exclusive options. That appears to address a scenario I occasionally bumped into with the OpenSearch Query Syntax. OpenSearch required two Url elements to handle mutual exclusivity; this syntax can get by with just one.though the reciproca

    Worth noting from our discussion is a point about how self contained this syntax is. A client should be able to read only this line and, if it knows the profile or application context in which it appears, parse it and generate valid IRIs.

    And you are giving me far too much credit! (This is James’ hard work folks, he is just being generous.)

    The next step (which I’ll try to tackle in parallel) will be to rewrite a parser that can handle this. The one that I’m particularly interested in exploring, for the practical challenge of it, is using XSLT to parse the syntax and make OpenSearch queries.

    BTW, I believe we can make this syntax a superset of the current OpenSearch Query Syntax. I.e., valid OpenSearch 1.1 Query Syntax should be a valid URI Template, and a URI Template parser may be able to parse OpenSearch 1.1.

    If this is true then there will be a very smooth path going from OpenSearch 1.1 to a future search syndication format that supports this.

    And at the very least it should be possible to use a profile to tell OpenSearch clients what template syntax the Url element is using. I am planning on introducing simple profiles for the Url anyway to help OpenSearch work well with CQL.

    …More in a bit, just wanted to post a big initial thumbs up. Nicely done!

  2. Lee Feigenbaum Says:

    Hi James,

    I read this and I can’t help but wonder, why not simply use an already-popular regular-expression syntax (e.g. Java’s, but really any of the more straightforward (read: not full Perl r.e.’s :-) reg. exp. syntaxes would do) for the templated parts of an IRI?

    Benefits would be:

    + reuse of existing reg. exp. parsers and engines
    + no need for authors to learn a new syntax

    The named parts of the templates would correspond to named capturing groups in a regular expression, and template parts could also be accessed positionally as per standard regular expression capturing conventions (i.e. $1, $2, etc. in Perl or Ruby).

    Simple patterns would still look, well, simple, and more complex patterns would look familiar to anyone who’s ever written regular expressions in a myriad of environments.

    I think that URI/IRI templating is a very important space (c.f. Ruby on Rails dispatch mechanism), but I’d rather not see what seems to be a reinvention of the wheel! :-)

    Lee

  3. Thomas Broyer Says:

    James, have you mint a regex for parsing/decomposing a part into the template and non-template subparts, and then parse/decompose a template into its subparts?

    I mean, it would be cool if I could do:
    1. use regex given in RFC3986 (or a modified one) to decompose the URI into its parts (scheme, authority, path, query and fragment)
    2. for each part:
    a. use a regex to split the part into template and non-template subparts (with, say, non-template parts always at even indices and template parts always at odd indices, eventually using empty matches)
    b. for each template subpart:
    aa. use a regex to decompose the template syntax into its parts
    bb. process the template and replace with the value
    c. reconstruct the part
    3. reconstruct the URI/IRI

    Also, why a need for “+[DIGIT]” if it’s equivalent to “1*[DIGIT]”? It unnecessarily complexifies the syntax just to save 1 byte…

    Otherwise, really good work, I can already imagine libraries in all sorts of languages to easily deal with these things ;-)

  4. James Says:

    Lee.. my biggest gripe with regards to using regular expressions is that they tend to become very unreadable, very quickly. Even seasoned developers often have difficulty understanding exactly what is going on in relatively complex regex’s.

    Thomas.. no, I have not minted the regex for parsing these things. It’s on the todo list but if someone wanted to come up with one, I wouldn’t complain :-). And you’re right, + [DIGIT] isn’t required. Good catch.

  5. Lee Feigenbaum Says:

    Hi James,

    I’d be curious to see a side-by-side example of a URI-templating example which is particularly more complex using an off-the-shelf regular expression syntax compared to your templating syntax. Especially if you take a page from the meaningless-whitespace of Perl’s /x modifier, it’s hard for me to imagine the two syntaxes being much different on the readability scale.

    Anyway, just my two cents.

    Lee

  6. James Says:

    Lee, I’ll take you up on that challenge :-) … Given that i positively stink at producing decent regular expressions, perhaps you could do the honors. Template the following:

    Scheme can be http or https.
    Host can be either www1.example.org or www2.example.org
    The path contains three segments, the first is always /blog, the second can be either /html, /atom, or /rss. The third is optional, and specifies a user id.
    Querystring parameters can include a tag=?, startdate=?, enddate=?, count=?, offset=? and search=? parameters. tag=? and search=? are mutually exclusive. Tag may be specified zero or more times.
    The URL can contain a fragment identifier that is always in the pattern ABC###.

    The IRI Template would look like (wrapped for readability)

    {scheme : http https}://{host : www1 www2}.example.org/blog/
    {format : html atom rss}{/ user ?}?{(tag= tag *)(search= search)?}
    {&startdate= startdate ?}{&enddate= enddate ?}{&count= count ?}
    {&offset= offset ?}{#ABC fragment ?}

    What would an equivalent regex look like?

  7. Lee Feigenbaum Says:

    Hi James,

    Great, thanks for the example. Regardless of the outcome it’s good for everyone involved to see as many examples as possible =)

    Assuming I’m parsing the template correctly, the part that matches http or https gets parsed into a component (variable? some other term?) called “scheme”, and similarly for “host”, “format”, “user”, etc. etc. I’m going to model these using .Net style named capturing groups:

    In the proposed templating scheme, it’s unclear to me whether the order of the parameters is important (does tag or search always need to come before the rest? I’m guessing yes since the others all include a leading &. But in that case, I’m not sure what in the templating mechanism will insert an & in between repeated tag parameters. Sorry if I’m being dense :-) ).

    I also don’t see what in the template restricts the characters following the #ABC to be 3 digits. Perhaps I’m reading your intentions incorrectly?

    (? https?)://(? ww[12])\.example\.org/blog/
    (? html|atom|rss)/?(? .*?)\?
    (?:(?:&?tag=(? .*?))*|search=(? .*?))?
    (?:&startdate=(? .*?))?
    (?:&enddate=(? .*?))?
    (?:&count=(? .*?))?
    (?:&offset=(? .*?))?
    (?:#ABC(? \d{3}))?

    pros and cons: in my opinion, *both* are ugly. both could be made easier to read by judicous use of ignored whitespace. both have similar expressivity, and both get tripped up a bit by allowing parameters in arbitrary order and by allowing repeating parameters separated by ampersands.

    The regular expression method allows us to easily restrict parts of the template to digits or to limited character ranges.

    In the end, I just don’t see much benefit in the new template syntax over an existing reg. exp. syntax, but I’m wondering if the proposed template system contains more implicit semantics than i’m giving it credit with - for example, does something in the templating imply that a URI can have re-ordered parameters or something along those lines?

    Lee

  8. Thomas Broyer Says:

    I’ve been playing a bit with python and your ABNF yesterday evening (actually trying to get a “parsing regex”) but I had a little problem:

    In your example http://example.org{/ path *}, is the tsegment part of tauthority or tpath-abempty? Given the prefix, it’s obviously a tpath-abempty but your ABNF rules don’t enforce this. You’d have to have specialized tsegment variants for each non-terminal where a template can occur.

    The other way of solving the problem is to define the template processing to take place before URI/IRI parsing (i.e. treat the templated-URI/IRI as an opaque string, process templates, then parse the result as an URI/IRI). This would however lead to a “generic” templating “language”, without any dependency over URI/IRI syntax.

    I think it’s important to start with decomposing the templated-URI/IRI into its parts and only then process templates, so that the “processor” can raise an error if the value of a variable does not match syntax constraints of this URI/IRI part. For example, a scheme is defined as starting with ALPHA followed by any “unreserved” except colon.

    Another minor problems: an ipchar might be a paren, this would break ttokens parsing (if the tprefix or the “variable name” starts with an opening paren, a parser will believe it’s parsing a ttokens; if any ipchar part of a ttoken, used in a ttokens, contains a closing paren, a parser will believe it’s at the end of the ttoken; etc.).
    I also wonder how an URI containing parens or curly brackets (or actually any “reserved” –particularly “sub-delims”–) can be templated without confusing parsers…

    I’ll try to work on it a bit more this evening (it might rather be tomorrow, as we’re having a barbecue at work tonight ;-) )

  9. James Says:

    Lee: thank you for posting the regex! Interpretation of the template is correct and yes, the order of the tokens in the template is significant. There is nothing in the template that restricts the fragment to digits, mainly because I didn’t really want to make the template syntax *that* expressive. That is, things start to get very complicated very quickly once we start adding value patterns to the template. However, it would be possible to insert value patterns in place of the value list, e.g., {#ABC fragment ? : \d(3)}, where the backslash is used as an escape character. I’m not sure I like that, but it is possible to do.

    Regarding the comparison between the regex and the template, yes, both are fairly ugly. The template has a semantic advantage in that it is easily parseable into familiar URI components (scheme, host, path, query, fragment, etc). Also, it’s use of named parameters makes it easier for a consumer to identify exactly which pieces of data are intended to go where in the template, as well as making it easier to move things around in the template later on. For example, suppose I have an existing template http://example.org/~{user}/index.{format} that I want to change to http://example.org/{format}/{user}, if I have an implementation api like template.set(”format”, “atom”), it would continue to just work properly with either template. It’s simply not possible to do that with your regex.

  10. James Says:

    Thomas: thank you for the thorough review of the ABNF syntax. I’ll see what i can come up with to fix the issues. In the meantime (and after your bbq ;-) ..) if you want to post a correction, I wouldn’t complain ;-)

  11. Lee Feigenbaum Says:

    Hi James,

    I’m not sure I understand your comment about named parameters?

    My regular expression uses named capturing groups, which provides the exact same capabilities as far as I can tell. What’s the difference?

    Lee

  12. James Says:

    Hmm.. I hadn’t recognized them as named groups. So if I read this correctly, given a pattern like (?:&count=(? .*?))?, I’m assuming I would use the name “count” to match? How about the format, scheme, host and fragment tokens? I don’t see names for those.

  13. Lee Feigenbaum Says:

    ahhhh LOL

    here’s what happened:

    the blog software didn’t escape my angle brackets, and swallowed them instead, which are how the named groups are expressed. let me try again:

    (?<scheme> https?)://(?<host> ww[12])\.example\.org/blog/
    (?<format> html|atom|rss)/?(?<user> .*?)\?
    (?:(?:&?tag=(?<tag> .*?))*|search=(?<search> .*?))?
    (?:&startdate=(?<startdate> .*?))?
    (?:&enddate=(?<enddate> .*?))?
    (?:&count=(?<count> .*?))?
    (?:&offset=(?<offset> .*?))?
    (?:#ABC(?<fragment> \d{3}))?

    hopefully that will come through correctly this time - sorry for the confusion, I should have proofread my original comment :-)

    Lee

  14. James Says:

    Heh.. for a second there I thought I was going nuts. Unfortunately, the use of the names makes the regex look worse ;-).. Hmm.. will have to think about it a bit more.

  15. Thomas Broyer Says:

    James: here’s the regex from RFC3986, Appendix B, adapted for templated-URIs:
    ^(?:((?:[^:/?#{]+|\{[^}]+\})+):)?(?://((?:[^/?#{]*|\{[^}]+\})*))?((?:[^?#{]*|\{[^}]+\})*)(?:\?((?:[^#{]*|\{[^}]+\})*))?(#((?:[^}]*|\{[^}]+\})*))?$
    I’ve also changed the matching groups so that a “.groups()” call (in Python) gives you a (scheme, authority, path, query, fragment) tuple.

    The following regexes (in Python) are for “ttoken”, “ttokens” and “tsegment” respectively:
    TREPEAT = r’(\?|\+\d*|\d*\*\d*)’
    TTOKEN = r’(?:([^ :)}]*) )?([^ )}]+)(?: (\?|\+\d*|\d*\*\d*) ([^ )}]*))?(?: : ([^)}]*))?’
    TTOKENS = ‘’.join([r’(?:’, ‘\(’, TTOKEN, r’\)’, ‘)* (\?|\+\d*|\d*\*\d*) ([^)}]*)’])
    TSEGMENT = ‘’.join([r’\{’, ‘(?:’, TTOKEN, ‘|’, TTOKENS, ‘)’, r’\}’])

    You can “.search()” with TSEGMENT to split each URI part into “literals” and tsegments, then use matching groups to extract “tprefix”, the variable name, “trepeat”, “tdelim” and “tparams”. “tparams” then has to be split on spaces.

    I’ve written some Python code to try this out. It seems to work (except for “http://example.org{/ path *}” where the tsegment is found in authority, not path; the fix is just a minor tweak in the first regex above).

    Yesterday (just before my bbq :-P ) I’ve been asked to go to a client’s office this monday and tuesday, so I probably won’t have time to play with templated-URIs/IRIs, or even have an internet connection… So you’ll have to wait ’til wednesday for some more feedback :-P

    About the ABNF, I’ll try to provide a “fix” tomorrow.
    I think “trepeat” should keep “?” and “+”, for simplicity (and given that these are widely used notations), but change “+” to be strictly equal to “1*” (no “+3″ notation, use “1*3″ instead).

  16. 虚拟主机 Says:

    Hi James,
    I’m not sure I understand your comment about named parameters,It’s complex :(

  17. Dan Brickley Says:

    Very interesting piece of work. I’ve just pointed to this URL during a telecon of W3C’s Content Labelling incubator group, where we are looking for RDF/XML ways of picking out groups of resources via pattern matching URIs (and IRIs).

    See http://www.w3.org/2005/Incubator/wcl/ and nearby, esp http://www.w3.org/2005/Incubator/wcl/matching.html and http://lists.w3.org/Archives/Member/member-xg-wcl/2006Jun/0079.html (latter w3c member only). When I get off the telecon I’ll have a look around for the context behind your work here (ie., what are you all collaborating on?). In particular, I’m wondering whether anybody here has interest in an RDF expression of these structures. Elias perhaps?