Archive

Archive for December, 2011

Lucene example, free text search on .NET DataSet

If you don’t have Free Text Search for your SQL server database, then you can use Lucene.NET to run free text searches on DataSets.

First you have to build up a Full text index (on disk), based on an SQL statement:

        var strIndexRoot = Server.MapPath("FT");
        if (System.IO.File.Exists(strIndexRoot + @"\write.lock"))
        {
            System.IO.File.Delete(strIndexRoot + @"\write.lock");
        }
        Directory directory = FSDirectory.GetDirectory(strIndexRoot);
        Analyzer analyzer = new StandardAnalyzer();
        IndexWriter writer = new IndexWriter(directory, analyzer);        
        var strSQL = "select id,text from yourTable";       
        System.Data.DataSet ds = ExecuteDataSet(strSQL);
        foreach (DataRow dr in ds.Tables["sql"].Rows)
        {
            Document doc = new Document();
            doc.Add(new Field("id", dr["id"].ToString() , Field.Store.YES, Field.Index.NO));
            doc.Add(new Field("postBody", dr["text"].ToString(), Field.Store.YES, Field.Index.ANALYZED));
            writer.AddDocument(doc);            
        }
        writer.Optimize();
        writer.Flush();
        writer.Close();

You need to have a folder called “FT” (Free text) as a subfolder of your ASP.NET root folder.
I’ve omitted the code for “ExecuteDataSet”, it’s just a call to the database with an SQL command.

The FT folder will have the files:
segments.gen
segments_2
_0.cfs

Once the Free Text Index is generated.

Then to Query the free Text Index, I used the code:

       var strIndexRoot = Server.MapPath("FT");        
        if (IO.File.Exists(strIndexRoot + @"\write.lock"))
        {
            IO.File.Delete(strIndexRoot + @"\write.lock");
        }
        Directory directory = FSDirectory.GetDirectory(strIndexRoot);
        Analyzer analyzer = new StandardAnalyzer();
        IndexWriter writer = new IndexWriter(directory, analyzer);
        QueryParser parser = new QueryParser("postBody", analyzer);
        Query query = parser.Parse("User Query goes here");
        //Setup searcher
        IndexSearcher searcher = new IndexSearcher(directory);
        //Do the search
        Hits hits = searcher.Search(query);
        int hitNumber = hits.Length();        
        for (int i = 0; i < hitNumber; i++)
        {
            Document doc = hits.Doc(i);
            var id = doc.Get("id");
            var text = doc.Get("postBody");
        }
        searcher.Close();
        directory.Close();
Categories: Uncategorized

Getting a numeric value for the similarity of two text values in SQL Server

If you don’t have the luxury of a full text index on your free SQL server express database, or can’t set it up, because someone else manages or hosts, it, then a work around is this function that returns a numeric value for the similarity of two text values (which can be used as an order-by clause later)

create function func_similarity
(@TextBody varchar(8000),
@SearchTerm varchar(4000))
returns int
as
BEGIN
if ( @SearchTerm = @TextBody ) return 2
if ( @Textbody like '%' + @SearchTerm + '%') return 1
if ( @SearchTerm like '%' + @Textbody + '%') return 1
return 0
END

In my case, I was looking for a ‘kind of’ free text lookup on a table of US zip codes:

create proc FindUSPostcode
@city varchar(100)
as
select top 10 * from postcodeus
order by
dbo.func_similarity(Town,@city)
+ dbo.func_similarity(City,@city)
+dbo.func_similarity(state,@city)
desc

This meant that a lookup for

FindUSPostcode('Detroit')

Returns a top result for Detroit, Maine (Somerset county). But ‘Detroit, MI’ returns a top result for Detroit Michigan.

An API will come soon! (postcodefinder.org.uk)

Categories: Uncategorized