Full Text Search for Trac with Apache Solr

Just another WordPress site

Full Text Search for Trac with Apache Solr

Oh oscar has already started okay there is some indication but you can meet it to be a few minutes sorry about this folks it worked last night yet it was working vividly that was some screen problem probably a fixer no I have to do it not full screen recipe no can’t right click them for some reason ok no I left yeah it does that though more I don’t think it’s enabled okay sorry for the slight delay folks my name is Alex and I have been working on track and giving it some full-text search facilities through the Apache Solr project just a quick show of hands who’s familiar with track thought as much who’s familiar with solar okay I thought as much so I just on the off-chance isn’t anybody who on the planet who hasn’t seen track it’s this it’s an integrated bug tracker and wiki and

subversion front end all written in Python often gets deployed when projects need something in a hurry because it’s kind of self-contained has a bit of a reputation for being hard to configure untold although i found the opposite but I won’t leave it the problem we have today is this search box here this is what we would like it to be like this is what it is actually like basically boils down to a sql-like field one like field two yada yada yada so that’s that’s great it lets you search changelogs wikitext few other things but it doesn’t let us search any of the attachments to our pages or the contents of the repository itself which will be very nice to do so for that we need to replace the built-in search and the first thing we need to do for that is to implement our own search source the internal architecture of track is a donzo interfaces so for those of you familiar with them we have here a very minimal this is defined in track itself it’s just here for it’s just here to show off I searched source is an interface which will be implemented by many different classes it inherits very interface so that track can keep track of it and it defines two methods I hesitate to use the word abstract because that’s a bit too Java ish but basically when we want to provide a search source we implement those two methods when we are implementing methods we need to inherit from this track track provided classical component and what that get says is track will instantiate our object for us it actually acts as a singleton there is one per track process and the way this works is that when somebody does a search in track track goes through all of the things that have implemented I search source and calls get search filters these are basically in the built-in track you can tick a box I want to search tickets and wiki tickets and wiki would be things provided by a search source somewhere in track and then once the user has chosen those and typed in some terms when they actually click search this method is called for us and we basically have the search terms that were entered just a list of strings the filters that we have provided and that were chosen by the user and this is a very naughty implementation just to show the con that so here is the other side again this is very cut down just to show the concept this is what is actually happening inside track to call all of our components that have provided an i-search source for here is the list of components again everything that inherited from component and had implements I searched sauce in it will be in this class level attribute when something in the web interface calls find cute pets to be called that might be just routed to slash cute pets or whatever you want it to we first of all get all the search filters that are available we then check if they’re cute if so they go in a list of chosen filters the chosen filters are then provided to all of the components that implement the get get search result and

we finally return those in the list if get search result returns them so it’s a very simple loop of loops and this is the way that the built-in track search is implemented with SQL queries the difference in the real world is or in the real track is that these filters are mapped to what are called realms so there is a realm for wiki a realm for ticket and a realm for changelog and this is what we would like to extend so I will now switch to the actual code apologies I don’t know I did have all of this prepared fantastic so in order to implement full-text search we are reply we are providing our own I searched a I searched source we have here a come on we have our component called full-text search which implements a large number of interfaces most of these are so it can gather the information as its updated in track the critical one right here is the isearch source we then implement a number of configurable search realms this code around it is just so that you can choose whether to use full-text search or built-in search on a per realm basis the additional ones we are adding that aren’t in the built-in track our source and attachment doing it this way also allows us to interoperate with things that we don’t know about which but which provide their own search source such as track has a plug-in for mailing lists which can also provide search results so in our actual implementation we are basically taking the index which I’ll be describing a second submitting a search request to solar massaging the results slightly so that they fit with the search form at the track expects and then sending them back I won’t go too much into this code because time is tight but please ask me if you want to see anything right I’m about to jump into the internals of solar itself what does anybody have Python questions well you just showed us how to get data out of the search engine yes how do you get the data into the search engine I that that’s kind of split over the two sections of this talk so the way you get data in go away leave it or not this is better than Linux on its own on a macbook through a projector I think it is unity 2d so so the way you get data in is that amongst the other interfaces that track provides for you as a plug-in developer are you can implement interfaces such as attachment change listener or wiki

change listener and if you declare that you implement one of these interfaces and provide these methods in your class track will call them every time a wiki page is changed or created or deleted and we basically use that to submit the new information to solar or delete it from the index so the reason for using solar arrays to stay database independent because I think we could do something like that with a positive full-text search maybe or i don’t know if it would be more complex or whatever i never use it and so if we ortiz postgres is built in full-text search that would buy us simplicity and we’re already using post-crisis our back-end database we don’t have to run another process we don’t have to ship data over a socket or over the wire as much what it doesn’t get us which I’ll go into in a second is some of the features of solar in that it can pass and index binary documents such as Word documents PDFs open document format postgres would be able to do the wiki because it’s just plain text at the end of the day as can solar with solar yes if we were using the full text search built-in to postgres then we would have to implement our own binary data extraction ultimately the reason that full text that solar was used was it the previous version used postgres and we had some performance or the previous teams had some performance troubles so it’s not a I haven’t done a detailed comparison but that’s the principal thing you get the document you get binary files okay so moving on to solar itself it is deep in the land of Java comes from the Apache project hence it is Apache licensed it runs as a servlet so you need a servlet container such as Apache Tomcat or jetty to host it all your interaction with solar as a programmer is over HTTP restful api s you can actually completely if you can actually do a complete indexing with solar and do a search just with w get I don’t advise it though all the all the all the exchanges xml in our case although json is an option and you might want to consider that for reasons i’ll go into later and everything is utf-8 the major way that solar works is that solar has a concept of a document the document is just a lump of key-value pairs to be indexed and it you can customize the indexing on a per field basis but a field could be ten megabytes of text so it’s not a database but it does have some database like properties and if you’re clever you can actually use it in some ways like a database when you submit a document Solar passes it into tokens tokens are usually strings they can also be numbers tokens are stored in an index on disk by solar and the index is defined by a schema which I’ll show an example of in a second the critical thing to realize is that this is not something that you use in process like postgresql you have to run it as a server and it needs managing and monitoring as a server it is more complex but the complexity does buy you a few features that an embedded search library might not so just a quick warning we are deep in

the land of java and apache there is a lot of XML going on here so my apologies in advance I have tried to trim it as much as possible but these examples please don’t when the pot when the slides are published please don’t use these examples directly they are very cut down to fit on a slide so there are two things you need to configure to run solar the first is solar config.xml pretty much you can use the one provided in the solar example that you can download almost directly very little customization the second one is the schema dot XML which is where you define all the fields that you want to index how you want to index them whether they’re required whether you just index them or just store them you have both choices and this one will be heavily customized to your application but again start with the examples that you can download from the Apache Solr website starting from scratch is not advisable they’re almost a complete programming language in themselves so wrong button so starting with the schema dot XML this is very heavily based on the example schema that solar ship we have here are very basic type field types very little parsing and ind it and tokenizing goes on of these fields they are the closest thing you’re going to get to a database field in solar and these are the only ones that actually that’s pretty much as it appears in the file this is an arbitrary string the name it’s just a convenient way to refer to a type that you’ve defined in multiple fields without having to copy the definition again and again so if you like this is the actual type but we’ll refer to it as this when we define our fields a more useful example for for instance in the case of a wiki page where you’ve got the actual body of the text and you want to pass it would be text general so again this is a field type that we are defining not a field and in this case we are going into full-on pass this into tokens index it split it make it as searchable as possible starting from the top we’ve got our name and a class as before the interesting parts are that every field type where you define and where it is not just something like an int you must have a one tokenizer so a tokenizer takes a stream of characters not a stream not necessarily bites and splits them up into tokens tokens could be as simple as just white space delimited or White’s place white space plus a few punctuation characters there are a large number of token Isis you can choose in our case we are using this email tokenizer factory because it lets us search on the components of urls and domain names and email addresses so a search for gmail com would find anything that contained a gmail address or a search for loose lucinda Apache org would find anything with from the link to the apache Lucene website i won’t go too much into detail in this because time is limited and though this is the subject of a book but basically this is the workhorse of our search facility so once we have our field types we can define our fields and they are relatively simple up here we

have the fields that are let through unmolested if you like they are used in identifiers zand filters and sorting so you generally want to keep those as they came in in the input data you then get to the actual full-text search fields that you want to form queries on and in this case we are indexing the title the body there are others as well but again only what I could fit on a slide I’ll show you the full examples when we do the demo at the end and but that concludes the schema dot XML the only other file to configure is the solar config which is actually what decides whether this particular instance accepts updates except searches just one or the other it’s roughly equivalent to your root pi in Django I guess so here we define a request handler in this case for our searches we define a load of defaults all of these can be overridden at runtime by choosing the correct URL parameters going from top to bottom solar automatically paginate syou’re results so if you do a search that would result that matches a thousand documents it won’t give you a thousand documents straight away it’ll give you ten in this case the Deaf type refers to the default query type that this handler is forming use e disse max it’s what basically that is short for extended yeah sorry I forget the term so what it basically means is that there are many search syntaxes you can use if we would if we weren’t using solo we were just using plain apache Lucene there’s a very formal search syntax you could you just can’t type in some words and hope that it’ll search across all fields for them ed smacks is basically a nice middle ground between a full search syntax that you can restrict to particular fields do range queries that sort of thing and just type in a couple of keywords and I want to search some sensible defaults in this case we are using and queries by default this is not the solar default yes I’ll I’ll if there’s time at the end I’ll explain exactly is the reasoning behind that and the next one it qf is short for query fields so Edith max if you don’t in the query specify which fields you want to search it will use these ones and the little hat symbol with the number after that you’re seeing there is what’s called a boost function so for instance if we search for John and it fine and it matches in the body that is only half as interesting to us as if it matches as it sorry a quarter is interesting if it matches in the title and if we match John in the author field then that is really interesting and we want to promote that basically the search results that solar returns by default are in order of interest so it tries to rank them like Google would the most interesting one first rather than the most recent or alphabetical or something like that I won’t go into the PF right now because it’s rather topic I don’t fully understand so this is our search handler basically the name here matches the URL component so to use this particular handle we would submit a get request to host names /

solar / search q equals John and if we didn’t specify anything all those other defaults were taking place and we would get back a bundle of XML with scores fields that we had specified elsewhere as being in the default search and then we would do a second query to the same URL but with with a couple of parameters that specify an offset so you get the next ten results and so on and so on and so on okay and going back to the subjective that’s how you get data out of solar how do you get it in again it’s a request handler but with a different name and a different class implementing it we’ve got to update request handlers here the first one is the default when you use you construct your own XML packet and you post it that’s it if you get the right HTTP response back it means that Solar has index those documents and you either send a second XML packet which is basically just brackets commit and it actually writes those commits than they were available to searches or you can specify update ? commit equals true and it will automatically commit similar to similar to with databases your best strategy is to not commit on every document for this performance not commit after and documents but every 10 or 50 documents will get you nice and the second one here is tying into the real power of solar in that it can index arbitrary file well not arbitrary files but a lot of file formats so if we wish to index this presentation for instance we would perform a post to update / extract it would be handled by this extracting request handler and the all the text in document would be transformed into XHTML and mapped to the body field so that’s how we get our attachments and our subversion repository content into solar all right so that’s that’s a whistle-stop tour I do apologize it’s a lot to fit in to talk this length but I’d like to finish with showing you some actual files and a quick demo I suppose this is a track plug-in that you’re going to show us now is that available for 40 12 yes okay so we don’t have to wait until 02 13 to actually put it in there no so this isn’t part of the core of track this is a plug-in that is compatible with not 12 and I’m just at the end of the talk i’ll show you the URLs to download all of this from okay perfect things okay so demos please shout if you if the text is too still too small so what I have here is a track environment with its already pre-populated with a repository some wiki pages attachments and so on and I’ve cheated a little in that I have a fairly simple script to run solar just on a non-privileged port with jetty run track and afterwards it cleans up after itself so I’m just going to run that now

and if you’re not familiar with track already once you install it you get to commands track D which is a simple web server for running it in developing or just on an intranet if you want to do it on if you want to do it in production then usual story mod whiskey with apache or nginx with whatever the plugin for engine X is called the second command is track admin which is a command-line interface for administering your track environment comes with the load of subcommands this plug-in adds an extra one called full text which has some commands of its own lets you do index just update the index in case something to be missed but don’t if it is something’s already been indexed don’t do it again if you want to wipe everything and completely re ndex to be absolutely sure then that is re index and remove just deletes list to check what to inspect what has been is in the index and optimized just passes through to an Apache Solr function which basically is the equivalent of a postgres vacuum or a Windows defrag just coalesces all of the index and hopefully speeds it up so I’ve already indexed these so index does very little but if i do a rien dex then that actually deletes the index and starts again and that is just looping over all of the documents in all of the realms submitting them to solar to give you an idea of the speed we have a track project with a few gigabytes of subversion a few gigabytes of attachments a few hundred wiki pages and that takes about eight hours to index on a laptop in a virtual machine no sorry 12 hours in the virtual machine on the laptop about eight hours on a server with well a virtual server and there we’re looping through oh there we’re looping that’s too quick so good question yes you said it’s taking right along to index what is if the wiki side or the attachments or whatever are changing in that time while you are running the index so the the the indexing routine and the updating should catch everything I apologize to say I wish this had more unit test coverage I still need to figure out how to get solar in a test harness so that i can do integration level tests but it has been running in production for about six months now and we haven’t found any documents that is failed to index if your if at any time you’re not sure you can run full text index and that will scan that will scandal that will loop over all of the documents in all of the realms and check that there is a matching entry in solar so if to be absolutely sure just run full text re-index and then a few minutes later full text index or a few seconds later and you should be double sure so stop doing that so now to show you the actual thing come on so track itself is running on port 9000

that’s just what the script asked it to and what your users see and what is different two tracks built-in search is that we have these two extra realms and if we do a search for instance there is here I believe okay that’ll do so if we do a search for IP tables we should find this attachment SSH scan on wiki start in fact I’ve uploaded that several times but yes so that takes us back to the document via seen before so that is track with full-text search and solar thank you very much for your time I also promised download URLs so can I just I just want to yeah a quick apology semi word of warning this URL here I didn’t sunburned is a Python library for interfacing to solar so you don’t have to implement your own XML construction basically the plug-in as it’s written at the moment needs a slightly custom version of sunburnt so that is just a fork with a basically allows sunburnt to use the extracting request handler so that it makes it easier to submit a binary documents to solar I am NOT the original author of sunburnt I’m not trying to claim credit for that but you do need to use that version of sunburnt for now that will get I I do intend to get that up streamed over the rest of the summer the second one is the plug-in itself includes instructions how to add it how to configure track so that the built-in search is disabled and this can replace it there’s also a minor patch to track that you’ll need to add to your not 12 install it’s only affects one file again my apologize there wasn’t an interface to override the built-in search sources but i hope to call with the track people so that a future release will require no patching you’ll just need to download the plug-in and use it so mm well there is them the time is finished but I think that since there is large you can also continue with your questions so if anybody wants which alternatives did you look at and what made you choose solar and I’m afraid I can’t answer that question because I didn’t begin this plug-in from the start it was something I picked up after another member of the team left I can what I believe were the reasons the previous version of our product used postgres built in full text search there were performance problems I don’t know if that’s because it was configured poorly or because it was a limitation of that version of postgres but also it didn’t do the the indexing of binary files but that for us is the killer feature it basically builds on another Apache product called tika and that supports a vast array of file formats that we couldn’t hope to do ourselves and that’s almost certainly the reason we went for solar over other indexing engines just a small remark in your configuration you valued the title higher or debt attack this is correct or is this useful that might be something I need to revisit I’ll have a look at that

in a second the title is a bit of a lie in that the track doesn’t perfectly keep consistent naming between all it’s like a wiki page is has a has a name but ticket has a summary and we map the title in the code to all those things that are the like the human identifier that’s possibly why it’s mapped higher because it’s the actual name of the theme as opposed to a tag attached to it two questions when one questions is whether you fully support Unicode and the second question is how because I don’t know anything about solar where the solar supports multiple languages there’s enough from other search engines that you have to configure the languages that are being supported because search engine uses stop words to not index say for example in English a or the or these things it’s not something I can answering too much depth because we took the coward’s way out and we only support english so in terms of unicode support solar is utf-8 all the way through so there is not a problem with national character sets there is a minor gotcha in that XML which is what that is the default data format for all communication and is what is used by sunburnt doesn’t support all unicode characters yeah it surprised me as well so XML 1 point 0 which is what just about everything out there supports there are a number of characters at the very low end of a ski that are in the control character space that are illegal in an XML document so that is something to watch out for but the rest of Unicode yes in terms of the in terms of the human language support of solar there are tokenizer zand filters that it is we believe can detect the language of a document and annotate it as such and that the stop words and the stemming to that language but it is safe to say that English is by far the best supported it comes out the box with classes named English stemmer and there are in the example schemer special field types for text en general as opposed to just text general which has better customized stemming support so you can do other languages but you need to go deeper into it it’s it it’s something that would take longer to get in integrated to say okay um yeah just a short question arm he showed you should the result page of your search in the demo and do you intend to use solos snippets feature to show extracts of found documents and things like that I want to this was originally implemented as part of my day job and it and released with full blessing of the company I can’t promise that I will be working as heavily on it in the future because we have other thing other priorities at the moment I’m afraid but I will be sprinting on this and track at least on Saturday so if you’d like to join me then I can certainly aid you or spend some time on it myself obviously all bug reports patches complaints questions gratefully received either to Twitter or to my work email if anybody wants to go into more depth about how to configure solar there are two books which are available I’m halfway through well no I’m halfway through them both I have them with me here if anybody wants to have a look two and a bit of advice on how to configure solar and install it right I think I’ve kept you long enough thank you very much for your time everybody and each other

yes we have a