Archive
Full text searching in MSDE (without MSSEARCH)
One of the drawbacks of MSDE is that Full Text Searching (FTS) us unavailable. However, you can put together your own version of FTS with a stored procedure and a few tables.
The benifit of FTS is that it can match queries against english text without great difficullty. Say we had some textual data "The cat sat on the mat" and we wanted to make a select statement for text containing the word "cat" we would use: select from sometable where text like ‘% cat %’. This has two knock on effects – (a) text "like" comparisons are relatively slow. and (b) since we have no idea how many records the query will return, we cannot use select top to speed up the query. Furthermore, if we searched for a collection of words such as "cat", "mat" and "floor", the above text should still come with a match, albeit with a lower degree of certainty, since the word "floor" is not in the data.
Instead, if we create a table conatining all the words in our table, joined on another table containing a link to each word, and a link to the document’s primary key. we can then do queries such as
select ID into #phraseIDs from phrases where phrase = ‘ cat’
select DocumentID into #DocIDs from links L join #phraseIDs P on P.ID = L.ID
select * from documents d join #DocIDs d2 on d.ID = D2.DocumentID
Although this looks more complex, since we are only using exact string comparison and joins on integers (assuming all the relevant columns are properly indexed). The query should work out to execute quicker.
The trick to doing this starts with a nice user defined function called fn_Split (http://www.windowsitpro.com/Windows/Articles/ArticleID/21071/pg/2/2.html)
Then I created two temporary tables in a seperate database
create database dinfoFTS
go
create table dinfoFTS..Phrases
( id int identity(1,1) not null,
Phrase varchar(1000)
)
go
create table dinfoFTS..Links
( id int identity(1,1) not null,
PhraseID int,
DocumentID int
)
go
I then created a stored procedure to populate these tables, making use of fn_split
alter procedure CreateFTS as
SET NOCOUNT on
delete from dinfoFTS..Phrases
delete from dinfoFTS..Links
create table #phrases (phrase varchar(1000),minID int)
DECLARE @name varchar(1000)
DECLARE @id int
DECLARE @lastInserted int
DECLARE companies_cursor CURSOR FOR
SELECT name + ‘ , ‘ + address,id
FROM companies
OPEN companies_cursor
FETCH NEXT FROM companies_cursor INTO @name,@id
WHILE @@FETCH_STATUS = 0
BEGIN
print @id
— parse the company name and address
insert into #phrases (phrase)
select value from dbo.fn_Split(@name,’ ‘)
— set the minID for any duplicates
update #phrases
set #phrases.minID = dinfoFTS..Phrases.id
from dinfoFTS..Phrases
where dinfoFTS..Phrases.Phrase = #phrases.phrase
set @lastInserted = scope_identity()
— insert any non-duplicates
insert into dinfoFTS..Phrases (phrase)
select phrase from #phrases where minID is null
— set the minID for recently inserted phrases
update #phrases
set #phrases.minID = dinfoFTS..Phrases.id
from dinfoFTS..Phrases
where dinfoFTS..Phrases.Phrase = #phrases.phrase
and #phrases.minID is null
and dinfoFTS..Phrases.id>@lastInserted
— Then insert #phrases into the dinfoFTS..Links table
insert into dinfoFTS..Links (PhraseID,DocumentID)
select minID,@id from #phrases
— clear out #phrases
delete from #phrases
FETCH NEXT FROM companies_cursor INTO @name,@id
END
CLOSE companies_cursor
DEALLOCATE companies_cursor
drop table #phrases
As a benchmark it took 20 seconds to read 1000 rows, and I’m now running it on a 11M row database!.
If anybody has any optimization hints, just post a comment.
Screen scraping windows applications
I’ve often found the need for extracting text from other windows applications, for example to extract text from MSN messenger, or the Internet explorer address bar or whatever. Instead of opening up Spy++ to get the Hwnd, and working with scraps of Win32 API calls I decided to write a generic class which recursively extracts text from every window on the screen, and stores it in an XmlDocument.
I used a VB6 version of this class before, so I decided to port it to .NET to modernize the code a bit, hense it’s written in VB.NET.
First I declared a few API functions thus:
Private Declare Function GetWindow Lib "user32.dll" (ByVal hWnd As Integer, ByVal wCmd As Integer) As Integer
Private Declare Function GetDesktopWindow Lib "user32.dll" () As Integer
Private Declare Function SendMessage Lib "user32.dll" Alias "SendMessageA" (ByVal hWnd As Integer, ByVal Msg As Integer, ByVal wParam As Int32, ByVal lParam As Int32) As Integer
Private Declare Function SendMessage Lib "user32.dll" Alias "SendMessageA" (ByVal hWnd As Integer, ByVal Msg As Integer, ByVal wParam As Int32, ByVal lParam As String) As Integer
Private Const GW_HWNDNEXT As Short = 2 ‘The window below the given window in the Z-order.
Private Const GW_CHILD As Short = 5 ‘The topmost of the given window’s child windows. This has the same effect as using the GetTopWindow function.
Private Const WM_GETTEXT As Short = &HDS ‘ Get Window text
Private Const WM_GETTEXTLENGTH As Short = &HES ‘ Get the length of window text
Notice that SendMessage is declared twice, with an overloaded parameter lParam. I found this to be a better approach to using "Object" as the type of the parameter, as it kept causing Ole errors.
The enty function was simply
Public Function Analyze() As System.Xml.XmlDocument
Return Analyze(GetDesktopWindow())
End Function
This then calls the Analyze(hWnd) function thus:
Private
Function Analyze(ByVal hWnd As Integer) As System.Xml.XmlDocument
Dim childHwnd As Integer
Dim XMLRootElement As Xml.XmlElement
Dim XMLRoot As Xml.XmlElement
xmlDoc = New XmlDocument
childHwnd = GetWindow(hWnd, GW_CHILD)
XMLRootElement = xmlDoc.CreateElement("window")
XMLRoot = xmlDoc.AppendChild(XMLRootElement)
AppendAnalysis(XMLRoot, hWnd, Nothing,0)
recurseChildren(XMLRoot, childHwnd, hWnd)
Return xmlDoc
End Function
Which then calls the recursive function recurseChildren
Private Sub recurseChildren(ByRef XMLParent As System.Xml.XmlElement, ByRef childHwnd As Integer, ByRef parentHwnd As Integer)
Dim grandChildHwnd As Integer
Dim ZCount As Integer = 1
Dim XMLChildElement As XmlElement
Dim XMLChild As XmlElement
Do Until childHwnd = 0
XMLChildElement = xmlDoc.CreateElement("window")
XMLChild = XMLParent.AppendChild(XMLChildElement)
AppendAnalysis(XMLChild, childHwnd, parentHwnd, ZCount)
ZCount = ZCount + 1
grandChildHwnd = GetWindow(childHwnd, GW_CHILD)
recurseChildren(XMLChild, grandChildHwnd, childHwnd)
childHwnd = GetWindow(childHwnd, GW_HWNDNEXT)
Loop
End Sub
This calls AppendAnalysis to build up the XmlDocument
Private Sub AppendAnalysis(ByRef Parent As System.Xml.XmlElement, ByRef childHwnd As Integer, ByRef parentHwnd As Integer, ByRef ZCount As Integer)
Dim Description As String
Dim XMLNodeWindowText As XmlElement
XMLNodeWindowText = xmlDoc.CreateElement("Text")
XMLNodeWindowText.InnerText = getTextFromHwnd(childHwnd)
Parent.AppendChild(XMLNodeWindowText)
If XMLNodeWindowText.InnerText = "" Then
Description = "#" & ZCount
Else
Description = XMLNodeWindowText.InnerText
End If
End Sub
And finally, we get down and dirty with the Win32 API
Private Function getTextFromHwnd(ByVal hwnd As Integer) As String
‘ use <?xml version="1.0" encoding="UTF-8"?> as a header.
Dim wintext As String ‘ receives the copied text from the target window
Dim slength As Integer ‘ length of the window text
Dim retval As Integer ‘ return value of message
slength = SendMessage(hwnd, WM_GETTEXTLENGTH, 0, 0) + 1
wintext = New String(ChrW(0), slength)
retval = SendMessage(hwnd, WM_GETTEXT, slength, wintext)
wintext = wintext.Substring(0, retval)
Return wintext
End Function
Here, I used SendMessage WM_GETTEXT rather than getWindowText, since I found that the SendMessage technique works better accross process boundaries.
The resultant XML looks like this
<window><Text></Text><window><Text></Text></window><window><Text>AppAnalyze.NET – Microsoft Visual Basic .NET [run] – Appanalyze.vb [Read Only]</Text>…
Importing a malformed flat file using DTS
If you ever had the pleasure of importing a large text file into SQL server using DTS Import and Export data, and had it fail on row 10.000, just to drop you back to the start again, you might find this a life saver.
You can get a DTS import to ignore malformed data using this process:
When using DTS import and export data, select "Save as DTS package". Run the task as usual and when it fails, open Enterprise manager, select your database, Select Data Transformation services > Local packages. Right click on the DTS package, and select "Design package"
Go to menu > package > Disconnected Edit then go to Tasks and click on the task that is failing due to errors. Located within this task properties is MaximumErrorCount and change that to a large number. Click the close button. Press package then execute to run the query
see: http://sqlservercentral.com/forums/shwmessage.aspx?forumid=92&messageid=173561 for more info…
Using Response.Filter in ASP.NET
http://www.webtropy.com/projects/textonly/
I was once asked if it were possible to easily create "Text only" versions of websites, so that they wouldn’t cost a fortune to look at when using a smart phone or PDA – which can cost 2 euros per megabyte, or more. I had forgotten about it for a while until I picked up the september 2004 edition of MSDN, which was sitting under my coffee table, and came across an article about Response.Filter, and it’s application for custom ASP.NET traces.
I developed this in VS.NET 2005 / .NET 2.0, but the code is the same for previous versions.
I activate the filter from a button click thus:
void Button1_Click(object sender, EventArgs e)
{
Response.Filter = new TextOnlySteam(Response.Filter);
}
And now, I define my custom stream, TextOnlyStream
public class TextOnlySteam : MemoryStream
{
private Stream strmOutput;
/// <summary>
/// Constructor, initializes the output stream
/// </summary>
public TextOnlySteam(Stream strmConstructor)
{
strmOutput = strmConstructor;
}
/// <summary>
/// Replaces occurrances of src="whatever.jpg" with src=""
/// </summary>
public override void Write(byte[] buffer, int offset, int count)
{
Encoding enc = HttpContext.Current.Response.ContentEncoding;
string strHtml = enc.GetString(buffer, offset, count);
strHtml = Regex.Replace(strHtml, "src=\"\w.*\"", "src=""",
RegexOptions.IgnoreCase);
byte[] bHtml = enc.GetBytes(strHtml);
strmOutput.Write(bHtml, 0, strHtml.Length);
}
}
The code should be improved to support other types of image inclusion such as table background and the CSS background url tag. Also, HTML does not dictate that inverted commas need to delimit tag parameters, apostrophes, or spaces can suffice. For this I would expect that I would use the HTML Agility pack to normalize the HTML to XML before parsing.
Oh, and I also set live two new sites www.listOfTaxis.info and www.listOfBanks.info (the DNS mightn’t propogate until monday morning)
set IDENTITY_INSERT is connection specific?
Just noticed a glitch in SQL server 2000. Which is quite ironic, since this blog is more than likely hosted on SQL 2000, being with Msn spaces.
use Database1
set IDENTITY_INSERT Database1.dbo.operator on
use Database2
set IDENTITY_INSERT Database2.dbo.operator on
IDENTITY_INSERT is already ON for table Database1.dbo.Operator’. Cannot perform SET operation for table Database2.dbo.operator’.
When I opened a new connection to server, it allowed me set the identity insert on database2. (the names of the tables involved have been changed for client privacy). So does this mean that SET variables are connection specific, and that the IDENTITY_INSERT variable can only be applied to one table at a time? – I can’t really think of a reason for making a multi-table insert requiring idenity inserts within the same connection, but I have to complain about something….
Anyway, just bought myself a new laptop today, off micromart, After my last one died when I spilt coffee on it!. If anybody would like a dead toshiba laptop for spare parts or to repair it, please post a comment on this blog, and I’ll get back to you.
Three new sites launched
Just completed three new sites. SQL server 2005 back end, with ASP 2.0
http://www.listofbuilders.info
http://www.listofElectricians.info
http://www.listofrestaurants.info/
All pretty much the same, but I was just learning about templates.
COM interoperability with Matlab 7 in C#
If you try to import Matlab 7.0’s com object into C#, you get this error when you try to call it using early binding.
MLApp.MLAppClass MyMatlab = new MLApp.MLAppClass();
textBox1.Text = MyMatlab.Execute("version");
results in:
An unhandled exception of type ‘System.InvalidCastException’ occurred
in Test version7.exe
Additional information: QueryInterface for interface MLApp.DIMLApp
failed.
This is apparently due to a mix up with GUID’s for mlapp.tlb, so, instead of hacking with the registry to correct their GUID, you can alternatively use late binding on the object thus;
Type objFactoryType;
objFactoryType = Type.GetTypeFromProgID("Matlab.Application");
object objFactory;
objFactory = Activator.CreateInstance(objFactoryType);
object[] args = new object[1];
args[0]="version";
object retval;
retval = objFactoryType.InvokeMember("Execute",BindingFlags.InvokeMethod,null,
objFactory,args);
MessageBox.Show(retval.ToString());
Which works fine…
Accessing a database with Macomedia Director
Working on a university project which included a need for Macromedia Director to access a database (in my case, MS Access). I downloaded ADOXTRA.X32 from XtraMania.com, copied it into my D:Program FilesMacromediaDirector 8.5Xtras folder, then with a bit of tinkering, I wrote the following Lingo function:
on GetDatabaseField (SQL,Column)
ADO = xtra("ADOxtra")
ADO.Init(true)
gRst = ADO.CreateObject(#Recordset)
gRst.ActiveConnection = "Provider=Microsoft.Jet.OLEDB.4.0;Mode=Read;Data Source=" & the moviePath & "prototype.mdb"
gRst.cursorLocation=gRst.adUseClient
gRst.LockType=gRst.adLockReadOnly
gRst.CursorType=gRst.adOpenStatic
gRst.Source=SQL
gRst.Open()
if gRst.Failed then
alert gRst.LastError
halt
end if
gRst.MoveFirst()
retVal = gRst.fields[Column]
gRst.Close()
return retVal
end
Which is called with something along the lines as:
on
mouseDown me
set the text of member "DynamicText" = GetDatabaseField ("select firstname from people where id = 1 ","firstname")
end
Retrieving web pages with foreign characters
When you make a request to a web page using code such as:
HttpWebRequest httprequest = (HttpWebRequest)WebRequest.Create(requestURI);
HttpWebResponse httpresponse = (HttpWebResponse)httprequest.GetResponse();
Stream responsestream = httpresponse.GetResponseStream();
StreamReader httpstream = new StreamReader(responsestream);
string bodytext = httpstream.ReadToEnd();
You may find that certain characters may be missing from the string returned, such as the copyright © character, or foreign characters, such as é (e acute). In order to get around this you need to use Latin encoding (ISO 8859) in the StreamReader thus:
StreamReader httpstream =
new StreamReader(responsestream, Encoding.GetEncoding("iso8859-1"));
… had me stumped for ages!
Indexed views SQL Server 2005
In an attempt to speed up queries coming from my 11 million row table "Companies", I decided to dabble in a bit of indexed views, which are now available in sql 2005.
Here’s a transscript of how I got on (warts and all)
Microsoft Windows [Version 5.2.3790]
(C) Copyright 1985-2003 Microsoft Corp.
C:Documents and SettingsAdministrator>sqlcmd
1>
Sqlcmd: Warning: Last operation was terminated by the user by pressing Ctrl-C.
C:Documents and SettingsAdministrator>sqlcmd -S.SQLEXPRESS -E
1> use dinfo
2> create view VCompanies with schemabinding as
3> select * from companies
4> go
Msg 111, Level 15, State 1, Server OPENMERCHANTACCSQLEXPRESS, Line 2
‘CREATE VIEW’ must be the first statement in a query batch.
Msg 156, Level 15, State 1, Server OPENMERCHANTACCSQLEXPRESS, Line 3
Incorrect syntax near the keyword ‘select’.
Sqlcmd: Error : Microsoft OLE DB Provider for SQL Server : One or more errors oc
curred during processing of command..
1> use dinfo
2> go
Changed database context to ‘dinfo’.
1> create view VCompanies with schemabinding as
2> select * from companies
3> go
Msg 1054, Level 15, State 6, Server OPENMERCHANTACCSQLEXPRESS, Procedure VCompa
nies, Line 2
Syntax ‘*’ is not allowed in schema-bound objects.
1> create view VCompanies with schemabinding as
2> select id,name,address,city,PostCode,Telephone from companies
3> go
Msg 4512, Level 16, State 3, Server OPENMERCHANTACCSQLEXPRESS, Procedure VCompa
nies, Line 2
Cannot schema bind view ‘VCompanies’ because name ‘companies’ is invalid for sch
ema binding. Names must be in two-part format and an object cannot reference its
elf.
1> select id,name,address,city,PostCode,Telephone from dinfo..companies
2> create view VCompanies with schemabinding as
3> select id,name,address,city,PostCode,Telephone from dinfo..companies
4> go
Msg 111, Level 15, State 1, Server OPENMERCHANTACCSQLEXPRESS, Line 2
‘CREATE VIEW’ must be the first statement in a query batch.
Msg 156, Level 15, State 1, Server OPENMERCHANTACCSQLEXPRESS, Line 3
Incorrect syntax near the keyword ‘select’.
1> create view VCompanies with schemabinding as
2> select id,name,address,city,PostCode,Telephone from dinfo..companies
3> go
Msg 4512, Level 16, State 3, Server OPENMERCHANTACCSQLEXPRESS, Procedure VCompa
nies, Line 2
Cannot schema bind view ‘VCompanies’ because name ‘dinfo..companies’ is invalid
for schema binding. Names must be in two-part format and an object cannot refere
nce itself.
1> create view VCompanies with schemabinding as
2> select id,name,address,city,PostCode,Telephone from dbo.companies
3> go
1> CREATE UNIQUE CLUSTERED INDEX VDiscountInd ON Vdiscount1 (ProductID)
2> go
Msg 1088, Level 16, State 12, Server OPENMERCHANTACCSQLEXPRESS, Line 1
Cannot find the object ‘Vdiscount1’, because it does not exist or you do not hav
e permission.
1> create unique clustered index idxVcompanies on VCompanies (name,address,city,
postcode,telephone)
2> go
Msg 1935, Level 16, State 1, Server OPENMERCHANTACCSQLEXPRESS, Line 1
Cannot create index. Object ‘VCompanies’ was created with the following SET opti
ons off: ‘QUOTED_IDENTIFIER’.
Sqlcmd: Error : Microsoft OLE DB Provider for SQL Server : One or more errors oc
curred during processing of command..
1> SET ANSI_NULLS ON
2> SET ANSI_PADDING ON
3> SET ANSI_WARNINGS ON
4> SET CONCAT_NULL_YIELDS_NULL ON
5> SET NUMERIC_ROUNDABORT OFF
6> SET QUOTED_IDENTIFIER ON
7> SET ARITHABORT ON
8>
9> go
1> create unique clustered index idxVcompanies on VCompanies (name,address,city,
postcode,telephone)
2> go
Msg 1935, Level 16, State 1, Server OPENMERCHANTACCSQLEXPRESS, Line 1
Cannot create index. Object ‘VCompanies’ was created with the following SET opti
ons off: ‘QUOTED_IDENTIFIER’.
1> drop view vcompanies
2> go
1> create view VCompanies with schemabinding as
2> select id,name,address,city,PostCode,Telephone from dbo.companies
3> go
1> create unique clustered index idxVcompanies on VCompanies (name,address,city,
postcode,telephone)
2> go..
Warning! The maximum key length is 900 bytes. The index ‘idxVcompanies’ has maxi
mum length of 1275 bytes. For some combination of large values, the insert/updat
e operation will fail.
Msg 1105, Level 17, State 2, Server OPENMERCHANTACCSQLEXPRESS, Line 1
Could not allocate space for object ‘SORT temporary run storage: 49609231545139
2′ in database ‘dinfo’ because the ‘PRIMARY’ filegroup is full. Create disk spac
e by deleting unneeded files, dropping objects in the filegroup, adding addition
al files to the filegroup, or setting autogrowth on for existing files in the fi
legroup.
The statement has been terminated.
Sqlcmd: Error : Microsoft OLE DB Provider for SQL Server : One or more errors oc
curred during processing of command..
alter database dinfo
ADD FILE
(
NAME = dinfot2,
FILENAME = ‘C:Program FilesMicrosoft SQL ServerMSSQL.1MSSQLDatadinfo2.ndf’,
SIZE = 5MB,
MAXSIZE = UNLIMITED,
FILEGROWTH = 5MB
)
GO
… and re-ran the Create index statement..