This document outlines the detailed requirements and specifications for the PredictStream application. This is a demonstration project meant to showcase a polished, high-quality Streamlit dashboard for data exploration and predictive analytics.
- File Upload: Allow users to upload CSV and Excel files (.xlsx, .xls) with intuitive UI feedback
- Sample Data: Provide 2-3 sample datasets for users who want to explore without uploading files
- Data Preview: Show a preview of loaded data with pagination for larger datasets
- Data Validation: Check for common issues (missing values, inappropriate data types) and provide warnings
- Data Transformation: Enable basic transformations like:
- Handling missing values (drop, fill with mean/median/mode)
- Type conversion (numeric, categorical, datetime)
- Feature encoding (one-hot, label encoding)
- Scaling/normalization options (min-max, standard)
- Summary Statistics: Provide automated statistical summaries for numeric and categorical fields
- Data Quality Assessment: Show missing value percentages, outlier detection, and data type breakdown
- Correlation Analysis: Generate and visualize correlation matrices with options to highlight strong correlations
- Distribution Analysis: Create histograms and density plots for numeric variables
- Category Analysis: Generate bar charts and pie charts for categorical variables
- Time Series Analysis: If time-based data is detected, provide time series plots and decomposition
- Interactive Charts: Create interactive visualizations with Plotly
- Chart Types: Support various visualization types:
- Scatter plots (with optional trend lines)
- Bar charts and histograms
- Box plots and violin plots
- Heatmaps
- Pair plots for feature relationships
- Customization: Allow basic customization of chart appearance (titles, labels, colors)
- Export: Enable saving visualizations as PNG/JPG files
- Problem Type Detection: Suggest classification or regression based on target variable
- Model Selection: Offer a curated selection of models:
- Classification: Logistic Regression, Random Forest, XGBoost
- Regression: Linear Regression, Decision Trees, Random Forest
- Feature Selection: Allow users to select features for model training
- Training-Test Split: Configure train/test split ratio with option for random seed
- Cross-Validation: Enable k-fold cross-validation for more robust evaluation
- Hyperparameter Selection: Provide simplified hyperparameter options for each model
- Performance Metrics:
- Classification: Accuracy, Precision, Recall, F1-Score, ROC-AUC
- Regression: MAE, MSE, RMSE, R-squared
- Visualizations:
- Classification: Confusion Matrix, ROC Curve, Precision-Recall Curve
- Regression: Actual vs Predicted Plot, Residual Plot
- Feature Importance: Show and visualize importance of features in the model
- Model Explanation: Provide basic model interpretability using techniques like SHAP values
- New Data Prediction: Allow prediction on new data points using UI inputs
- Batch Prediction: Enable predictions on new datasets (uploaded separately)
- Export Results: Provide option to download prediction results as CSV
- Multi-Page Structure: Organize functionality across logical pages with clear navigation
- Responsive Design: Ensure good user experience on different screen sizes
- Sidebar Navigation: Use sidebar for primary navigation and global controls
- Progress Indicators: Show loading states during computations
- Guided Workflow: Present a clear, step-by-step process for analysis
- Interactive Elements: Use appropriate Streamlit widgets (sliders, dropdowns, etc.)
- Tooltips & Help: Provide guidance and explanations for technical concepts
- Error Handling: Show clear error messages with suggestions for resolution
- State Persistence: Maintain application state between page navigation
- Consistent Styling: Maintain consistent colors, fonts, and spacing
- Theming: Support light/dark mode toggle
- Branding: Include NeurArk branding (logo, colors) in a subtle manner
- Visual Hierarchy: Prioritize important information through layout and typography
- Python Version: Requires Python 3.11 or later
- Core Libraries: Compatible with latest versions of Streamlit, Pandas, NumPy, and Scikit-learn
- Development Tools: Uses pytest for testing
- Efficient Data Handling: Use appropriate techniques for larger datasets
- Caching: Implement caching for expensive operations (data processing, model training)
- Optimization: Balance functionality with performance considerations
- Modular Structure: Organize code into logical modules and functions
- Documentation: Include comprehensive docstrings and comments
- Error Handling: Implement robust exception handling
- Type Hints: Use Python type hints for improved code quality
- Dependencies: Minimize and properly document all dependencies
- Environment: Ensure compatibility with standard deployment platforms
- Intuitiveness: The application should be usable without extensive training
- Accessibility: Follow basic accessibility best practices
- Responsiveness: The UI should respond quickly to user interactions
- Polish: The application should appear professional and well-crafted
- Robustness: Handle edge cases and unexpected inputs gracefully
- Consistency: Maintain consistent terminology and interaction patterns
The application should excel at demonstrating these common scenarios:
- Quick EDA: Upload a dataset and quickly generate key insights and visualizations
- Model Comparison: Train multiple models on the same dataset and compare their performance
- Feature Analysis: Identify the most important features driving predictions
- Interactive Exploration: Allow stakeholders to interactively explore data relationships
As this is a demonstration project, the following limitations are acceptable:
- Performance with very large datasets (>100MB) may be limited
- No user authentication or multi-user support
- No persistent storage of results between sessions
- Limited to tabular data (no image, text, or specialized data formats)
- User Experience: Focus on creating an intuitive, polished interface
- Visual Appeal: Ensure visualizations and UI components look professional
- Core Functionality: Implement the core features completely before adding extras
- Performance: Ensure the application runs smoothly with reasonable-sized datasets