date: 2024-12-24
title: ML-Distance Measure
status: DONE
author:
  - AllenYGY
tags:
  - NOTE
publish: TrueML-Distance Measure
Key to clustering. “similarity” and “dissimilarity” can also commonly used terms.
There are numerous distance functions for
We denote distance with: 
Most commonly used functions are Euclidean distance and Manhattan (city block) distance
Definition: Measures the proportion of matching elements (both 1s and 0s) between two binary vectors.
Formula:
1).0).1, vector 2 is 0).0, vector 2 is 1).Range: 1 indicates perfect matching.
Use Case: Suitable when both 1s and 0s carry equal importance.
Definition: Measures the similarity between two binary vectors by considering only the matches for 1s. Ignores 0s.
Formula:
1).1, vector 2 is 0).0, vector 2 is 1).Range: 1 indicates perfect similarity.
Use Case: Ideal for sparse data or cases where 1s are more significant than 0s.
Definition: Measures the total number of differing bits between two binary vectors. It counts mismatched positions.
Formula:
Range: 0 indicates no differences (identical vectors).
Use Case: Suitable for measuring the difference between binary strings or vectors.
| Measure | Formula | Focus | Range | Best Use Case | 
|---|---|---|---|---|
| SMC | Matches for 1s and0s | Equal importance for 1s and0s | ||
| Jaccard | Matches for 1s only | Sparse data, where 1s matter more | ||
| Hamming | Mismatched positions | Binary strings or sequences | 
Given two binary vectors:
Simple Matching Coefficient (SMC):
Jaccard Similarity:
Hamming Distance:
Nominal attributes: with more than two states or values.
the commonly used distance measure is also based on the simple matching method.
Given two data points 
This section explains how text documents are represented and how distances or similarities between them are measured.
| Term | Document 1 | Document 2 | 
|---|---|---|
| aid | 0 | 1 | 
| back | 1 | 0 | 
| dog | 1 | 0 | 
| men | 0 | 1 | 
| ... | ... | ... | 
In the Euclidean space, standardization of attributes is recommended so that all attributes can have equal impact on the computation of distances.
Standardize attributes: to force the attributes to have a common value range
Their values are real numbers following a linear scale.
Two main approaches to standardize interval scaled attributes, range and z-score. 
Z-score: transforms the attribute values so that they have a mean of zero and a mean absolute deviation of 1. The mean absolute deviation of attribute 
Z-score: