This application uses advanced natural language processing techniques to quantify and explore semantic relationships between different research fields based on their detailed descriptions. It enables researchers, administrators, and knowledge mappers to accurately analyze how different academic disciplines relate to one another through multiple similarity dimensions.
- Core Features
- Technical Architecture
- Advanced Similarity Algorithm
- Installation Guide
- Usage Guide
- API Reference
- Configuration Guide
- Technical Requirements
- Data Model
- Extending the Application
- Multidimensional Field Comparisons: Calculate similarity between research fields using a composite approach that incorporates semantic, lexical, and domain-specific measures
- Hierarchical Research Taxonomy: Organize fields in a structured hierarchy of groups and subgroups with relationship-aware similarity scoring
- Domain-Specific Boosting: Dynamically adjust similarity scores based on shared domain terminology and conceptual overlap
- Faceted Field Descriptions: Capture different aspects of research fields (definitions, methodologies, applications) for nuanced comparison
- Interactive Exploration: Browse, compare, and visualize field similarities through an intuitive web interface with dynamic visualizations
- Extensible Architecture: Modular design allowing new similarity measures or domain taxonomies to be incorporated with minimal code changes
The application follows a modular, service-oriented architecture designed for maintainability and extensibility:
python/
├── app.py # Application factory and entry point
├── config.py # Centralized configuration parameters
├── data/ # Data persistence
│ ├── nested_descriptions_research_groups.json # Field hierarchies
│ └── final_outputs_enhanced_multi/ # Output directory
│ └── field_similarities.json # Calculated similarities
├── services/ # Core business logic
│ ├── __init__.py
│ ├── data_service.py # Data access layer
│ └── similarity/ # Similarity calculation modules
│ ├── __init__.py
│ ├── embedding.py # Neural embedding (Sentence Transformers)
│ ├── domain.py # Domain-specific terminology detection
│ └── final_calculation.py # Composite similarity orchestration
├── routes/ # API endpoints
│ ├── __init__.py
│ └── api.py # RESTful interfaces
├── utils/ # Shared utilities
│ ├── __init__.py
│ └── text_processing.py # Text normalization and scaling functions
├── static/ # Static assets
│ ├── css/ # Stylesheets
│ │ └── styles.css
│ └── js/ # JavaScript files
│ └── app.js # Frontend application logic
└── templates/ # UI templates
└── final_app.html # Web interface
- Data Service: Handles loading and saving field and similarity data to JSON files
- Embedding Service: Computes semantic similarity using state-of-the-art language models
- Domain Service: Detects domain-specific terminology and calculates domain-based similarity
- Field Similarity Service: Orchestrates the overall similarity calculation process
- API Routes: Expose functionality through a RESTful interface
- Frontend Application: Provides an interactive user interface for exploring similarities
The system employs a sophisticated multi-component similarity calculation approach that combines multiple NLP techniques:
Uses the all-mpnet-base-v2 sentence transformer model to compute embedding vectors for field descriptions and measures their cosine similarity. This captures deep semantic relationships even when different terminology is used to describe similar concepts.
Identifies research domains present in each field by matching against specialized vocabulary across 17 technical domains:
- AI/ML
- Security
- Data Analytics
- HCI
- Graphics/Media
- Software Development
- Hardware Systems
- Healthcare
- Quantum Computing
- Blockchain
- Bioinformatics
- Environmental Science
- Cloud Computing
- NLP
- Fintech
- AR/VR
- Robotics
The system then computes similarity based on domain overlap, using a pre-defined domain relationship matrix to capture cross-domain relationships. This allows the system to understand that certain domains (like AI/ML and Data Science) are inherently more related than others.
Individual facets of field descriptions are compared separately with specific weights:
- Definition (60%): Core description of the field
- Methodologies (20%): Research approaches and techniques
- Applications (20%): Real-world use cases and implementations
This multi-faceted approach ensures that fields with similar applications but different methods (or vice versa) receive appropriate similarity scores.
The raw similarity score is calibrated based on hierarchical relationships:
- Fields in the same subgroup receive a baseline similarity of 0.75 plus weighted calculated similarity
- Fields in the same general group receive a baseline similarity of 0.7 plus weighted calculated similarity
- Cross-group similarities are capped at 0.7 to preserve taxonomic structure
Fields with significant domain overlap receive similarity boosts of up to 15%, especially when specialized terminology from the same domain is heavily present in both fields. This boosting only occurs when domain similarity exceeds a configurable threshold (default: 0.7).
- Python 3.8 or higher
- pip (Python package manager)
- Git (optional, for cloning the repository)
Clone the repository:
git clone https://github.com/Mohammed87FS/Tags_Similarity_Calcualtion.git
cd Tags_Similarity_CalcualtionAlternatively, download and extract the ZIP file from GitHub.
Create and activate a virtual environment:
# Create virtual environment
python -m venv .conda
# Activate on Windows
.conda\Scripts\activate
# Activate on macOS/Linux
source .conda/bin/activateInstall the required packages:
pip install -r requirements.txtDownload the required spaCy language model:
python -m spacy download en_core_web_smEdit python/config.py to adjust:
- Similarity component weights
- Group baseline similarities
- Domain boosting factors
- Domain terminology sets
- File paths
Navigate to the python directory and start the Flask server:
cd python
python app.pyThe application will be accessible at http://localhost:5000 in your web browser.
From the project root directory:
cd python
python app.pyThen open your web browser and navigate to http://localhost:5000.
The web interface is divided into three main sections:
- Add new research fields to the database
- Located at the top of the page
- Accessible via the "Add Field" link in the navigation bar
- Remove fields from the database
- Located in the middle of the page
- Accessible via the "Delete Field" link in the navigation bar
- Compare and analyze similarities between fields
- Located at the bottom of the page
- Accessible via the "View Similarities" link in the navigation bar
-
Navigate to the "Add Field" section
-
Fill in the form:
- Field Name: Enter a unique name (required)
- Group: Select an existing group or create a new one (required)
- Subgroup: Select an existing subgroup or create a new one (required)
- Definition: Provide a comprehensive definition of the field (highly recommended)
- Methodologies: Describe the research approaches and methods used in this field
- Applications: Explain how this field is applied in practice
-
Click "Add Field & Calculate Similarities"
-
Wait for the calculation to complete (may take a few moments)
-
Review the success message and download the updated data if needed
- Navigate to the "Delete Field" section
- Select the field you want to delete from the dropdown menu
- Click "Delete Field"
- Confirm the deletion in the modal that appears
- Wait for the system to delete the field and recalculate all similarities
- Review the results and download the updated data if needed
- Navigate to the "View Similarities" section
- Select a field from the dropdown menu
- Click "View Similarities"
- Review the sorted list of all fields with their similarity scores
- Toggle between sorting by similarity score or alphabetically
- Click the eye icon to view detailed comparison between any two fields
- Explore the similarity interpretation and field details in the comparison modal
- Navigate to the "Recalculate Similarities" section
- Click "Recalculate All Similarities"
- Confirm the operation in the modal that appears
- Wait for the calculation to complete (may take several minutes)
- Review the results and download the updated data if needed
The application provides a RESTful API for programmatic access and integration with other tools:
- Endpoint:
POST /api/add_field - Description: Add a new research field and calculate its similarities with existing fields
- Form Parameters:
name: Field name (required)group: Group name (required)subgroup: Subgroup name (required)definition: Field definition (optional)methodologies: Field methodologies (optional)applications: Field applications (optional)
- Response: JSON with success status and operation result
- Endpoint:
POST /api/delete_field - Description: Delete a field and update similarity records
- JSON Body:
{"fieldName": "Field Name"} - Response: JSON with success status and operation result
- Endpoint:
POST /api/delete_field_all - Description: Delete a field and recalculate all similarities from scratch
- JSON Body:
{"fieldName": "Field Name"} - Response: JSON with success status and operation result
- Endpoint:
GET /api/get_subgroups?group={group_name} - Description: Get all subgroups for a specific group
- Parameters:
group- The name of the group - Response: JSON with list of subgroup names
- Endpoint:
GET /api/get_similarity?field1={field1}&field2={field2} - Description: Get similarity between two specific fields
- Parameters:
field1: Name of the first fieldfield2: Name of the second field
- Response: JSON with similarity score and field data
- Endpoint:
GET /api/get_all_similarities_for_field?field={field_name} - Description: Get all similarities for a specific field
- Parameters:
field- The name of the field - Response: JSON with sorted list of all fields and their similarity scores
- Endpoint:
POST /api/recalculate_similarities - Description: Recalculate all pairwise similarities between fields
- Response: JSON with success status and operation result
- Endpoint:
GET /api/download_similarities - Description: Download the complete similarity data in JSON format
- Response: JSON file with tags array and similarities array
- Endpoint:
GET /api/test - Description: Check system status and data availability
- Response: JSON with system status and debug information
The config.py file contains all configurable parameters for the application:
DATA_DIR = 'data'
NESTED_DESCRIPTIONS_FILE = os.path.join(DATA_DIR, 'nested_descriptions_research_groups.json')
NESTED_DATA_FILE = os.path.join(DATA_DIR, 'nested_descriptions_research_groups.json')
SIMILARITY_FILE = os.path.join(DATA_DIR, 'final_outputs_enhanced_multi/field_similarities.json')SAME_GROUP_BASELINE = 0.7 # Baseline similarity for fields in same general group
SAME_SUBGROUP_BASELINE = 0.75 # Baseline similarity for fields in same subgroup
SIMILARITY_WEIGHT_SUB = 0.2 # Weight of calculated similarity to add to baseline for subgroup
SIMILARITY_WEIGHT_GENERAL = 0.15 # Weight of calculated similarity to add to baseline for general group
MAX_CROSS_GROUP_SIMILARITY = 0.7 # Maximum similarity for fields not in same group/subgroupDESCRIPTION_WEIGHTS = {
"definition": 0.6, # Weight for definition comparison
"methodologies": 0.2, # Weight for methodologies comparison
"applications": 0.2, # Weight for applications comparison
}COMPONENT_WEIGHTS = {
"embedding": 0.4, # Weight for neural embedding similarity
"domain": 0.4, # Weight for domain-specific similarity
"facet": 0.2 # Weight for faceted comparison
}ENABLE_DOMAIN_BOOSTING = True # Whether to apply domain-based score boosting
MAX_BOOST_FACTOR = 0.15 # Maximum boost to apply (0.15 = up to 15% boost)
DOMAIN_BOOST_THRESHOLD = 0.7 # Minimum domain similarity to trigger boost- Python 3.8+: Core programming language
- Flask 2.2+: Web framework for API and interface
- Sentence Transformers 2.2+: Neural embeddings for semantic similarity
- scikit-learn 1.2+: Machine learning utilities and vector operations
- spaCy 3.5+: Natural language processing toolkit
- NumPy 1.24+: Numerical computing library
- Bootstrap 5.2+: Frontend framework for responsive design
- jQuery 3.6+: JavaScript library for DOM manipulation
The application organizes research fields in a three-level hierarchy:
Categories (high-level research domains)
└── Subgroups (specialized areas within domains)
└── Fields (specific research disciplines)
Example:
Computer Science
└── Artificial Intelligence
├── Machine Learning
├── Deep Learning
└── Reinforcement Learning
Each field is represented as a JSON object with the following structure:
{
"name": "Field Name",
"description": {
"definition": "Comprehensive definition of the field...",
"methodologies": "Research approaches and methods used in this field...",
"applications": "Real-world applications and implementations..."
}
}Similarities are stored as a JSON object with two main components:
tags: Array of all unique field namessimilarities: Array of field pairs with calculated scores
{
"tags": ["Machine Learning", "Deep Learning", "Robotics", "..."],
"similarities": [
{
"field1": "Machine Learning",
"field2": "Deep Learning",
"similarity_score": 0.85
},
{
"field1": "Machine Learning",
"field2": "Robotics",
"similarity_score": 0.42
},
// Additional pairs...
]
}