The previous experience about realization of correlation coefficients, we need first to realize the similar funcation of search engine - word segmenting, indexing, and then statistically comparing TF/IDF in order to judge the relevance of the contents. With regard to the this technology, we can get a good reference from this article: Beauty of Number, which wirtten by Mr. Wu Jun, a researcher of Google.
The creation of Tag, is a excellent complement to realization of correlation coefficients. Because it provides a way that we can change the method from getting the key words by machine (robot) to getting those key words by individual's description. The accuracy and applicability of the result people do are much better than machine does.
In a simple system, we often use one content table, one tag table and one relevance table to realize Tag function. For examle, the three tables are content table namely Blogs, tag table namely Tags and relevance table namely Marks, both Blogs and Tags have primary-key id, the relevance table connect three tables by comparing with blog_id and tag_id:
<pre class=brush:sql> CREATE TABLE `blogs` (
`id` int(10) auto_increment,
`title` varchar(255) ,
PRIMARY KEY (`id`)
);
CREATE TABLE `tags` (
`id` int(10) auto_increment,
`tag` varchar(100) ,
`count` int(10),
PRIMARY KEY (`id`)
);
CREATE TABLE `marks` (
`tag_id` int(10),
`blog_id` int(10),
PRIMARY KEY (`tag_id`,`blog_id`)
) ; </pre>
In such kind of system, if we want to get the relevant content of article A, we can only retrieve all tags of this article, and then retrieve articles which associated with those tags on it. For instance, if we want to get the relevant blog of id 1 blog, then we can realize it in <a href="http://en.wikipedia.org/wiki/Join_(SQL)|left join" />this way</a> in Marks table:
<pre class=brush:sql> SELECT B.blog_id, B.tag_id FROM marks AS A LEFT JOIN marks AS B ON A.tag_id = B.tag_id WHERE A.blog_id =1 and B.blog_id != 1 </pre>
the result will show in blog_id field with all blogs which have connection with id=1
Of course there is also some disadvantages of doing it. First of all, this is a result that considering the important of all tags are same. However, in fact all tags have their own weight. For instance, in EASTHV's internal tech-blog, there are 12 articles tagged as "Javascript" and there are 2 articles tagged as "Apache". If every article only has one tag, then obviously the articles which tagged as "Apache" have more interaction than the articles which tagged as "Javascript".
Thereupon we can use the concept of TF/IDF to calculate the weight of Tag. If one blog have 100 articles, then the weight of Tag "Javascript" is Ln(100/12)=2.120, the weight of Tag "Apache" is Ln(100/2) is 3.912. Showing that in the SQL query is:
<pre class=brush:sql> SELECT B.blog_id, B.tag_id, LOG( 100 / C.count ) AS weight FROM marks AS A LEFT JOIN marks AS B ON A.tag_id = B.tag_id LEFT JOIN tags AS C ON B.tag_id = C.id WHERE A.blog_id =1 AND B.blog_id !=1 </pre>
We can get the weight of Tag by relating Tag table and the count value (the articles of one Tag).
Then the following question is, usually one article have more than one Tags, such as:
The Tags article A have are: "Javascript", "Apache", "Css", "Ajax", the Tags article B have are: "Javascript", "Apache", "Google", the Tags article C have are: "Json", "Apache", "XML".
If all Tags have same weight, then obviously the correlation of three articles is AB>AC. Because AB have two same Tags and AC only have one.
How do we apply this in SQL conjunctive query, only require some few changes:
<pre class=brush:sql> SELECT B.blog_id, B.tag_id, SUM( LOG( 100 / C.count ) ) AS weight FROM marks AS A LEFT JOIN marks AS B ON A.tag_id = B.tag_id LEFT JOIN tags AS C ON B.tag_id = C.id WHERE A.blog_id =1 AND B.blog_id !=1 GROUP BY B.blog_id ORDER BY weight DESC </pre>
Here we use SUM function and GROUP BY query.
Thus, we only used one SQL query to gain the correlation sorted according to the sequence.



