@@ -15,11 +15,12 @@ llm-stylometry/
1515│ └── workflows/ # Test automation workflows
1616├── llm_stylometry/ # Python package with analysis tools
1717│ ├── analysis/ # Statistical analysis utilities
18+ │ ├── classification/ # Text classification module (word count-based)
1819│ ├── core/ # Core experiment and configuration
1920│ ├── data/ # Data loading and tokenization
2021│ ├── models/ # Model utilities
2122│ ├── utils/ # Helper utilities
22- │ ├── visualization/ # Plotting and visualization
23+ │ ├── visualization/ # Plotting and visualization (GPT-2 + classification)
2324│ └── cli_utils.py # CLI helper functions
2425├── code/ # Training and CLI scripts
2526│ ├── generate_figures.py # Main CLI entry point
@@ -30,6 +31,7 @@ llm-stylometry/
3031├── data/ # Datasets and results
3132│ ├── raw/ # Original texts from Project Gutenberg
3233│ ├── cleaned/ # Preprocessed texts by author
34+ │ ├── classifier_results/ # Text classification results (pkl files)
3335│ └── model_results.pkl # Consolidated model training results
3436├── models/ # Trained models (80 baseline + 240 variants = 320 total)
3537│ └── {author}_tokenizer=gpt2_seed={0-9}/ # Baseline models
@@ -211,6 +213,185 @@ fig = generate_all_losses_figure(
211213
212214** Note** : T-test figures (2A, 2B) never apply fairness thresholding since they require all 500 epochs for statistical calculations.
213215
216+ ## Text Classification Analysis
217+
218+ In addition to GPT-2 stylometry, the project includes word count-based text classification using scikit-learn. This provides a complementary approach to authorship attribution through traditional machine learning.
219+
220+ ### Running Classification Experiments
221+
222+ Use the ` --classify ` flag to run text classification instead of GPT-2 training:
223+
224+ ``` bash
225+ # Run baseline classification (all unique words)
226+ ./run_llm_stylometry.sh --classify
227+
228+ # Run variant classifications
229+ ./run_llm_stylometry.sh --classify --content-only # Content words only
230+ ./run_llm_stylometry.sh --classify --function-only # Function words only
231+ ./run_llm_stylometry.sh --classify --part-of-speech # POS tags only
232+ ```
233+
234+ ### Classification Methodology
235+
236+ 1 . ** Feature Extraction** : ` CountVectorizer ` extracts word counts from all books
237+ - ** No stop words filtering** (` stop_words=None ` ) - critical for fair comparison across variants
238+ - Baseline: All unique words across the corpus
239+ - Content variant: Only content words (function words masked as ` <FUNC> ` )
240+ - Function variant: Only function words (content words masked as ` <CONTENT> ` )
241+ - POS variant: POS tag counts (words replaced with tags)
242+
243+ 2 . ** Cross-Validation** : Leave-one-book-out per author
244+ - Each split holds out exactly 1 book from each of the 8 authors (8 books total)
245+ - Up to 1,000 randomly sampled combinations
246+ - Ensures all books are tested and results are robust
247+
248+ 3 . ** Classifier** : Output-code multi-class classifier
249+ - Base estimator: Logistic regression (` max_iter=1000 ` , ` solver='lbfgs' ` )
250+ - Author-specific feature weights via back-solving: ` input = W_pinv @ (output - bias) `
251+ - Returns different word importance weights for each author
252+
253+ 4 . ** Metrics** : Classification accuracy with bootstrap-estimated 95% confidence intervals
254+ - Seaborn's automatic bootstrap (n_boot=1000)
255+ - Computed separately for each author and overall
256+
257+ ### Classification Results
258+
259+ ** Output files:**
260+ - ** Results** : ` data/classifier_results/{variant}.pkl ` (or ` baseline.pkl ` )
261+ - ** Accuracy charts** : ` paper/figs/source/classification_accuracy_{variant}.pdf `
262+ - ** Word clouds** : ` paper/figs/source/wordcloud_{author}_{variant}.pdf `
263+ - One overall word cloud showing general feature importance
264+ - One per author showing author-specific discriminative features
265+ - Vectorized PDF output using wordcloud library
266+
267+ ** Results structure:**
268+ ``` python
269+ import pickle
270+
271+ # Load classification results
272+ with open (' data/classifier_results/baseline.pkl' , ' rb' ) as f:
273+ data = pickle.load(f)
274+
275+ # Contents:
276+ # data['results']: pd.DataFrame with predictions and accuracies (long format)
277+ # data['vectorizer']: Fitted CountVectorizer
278+ # data['feature_names']: List of vocabulary words
279+ # data['variant']: Analysis variant (None for baseline)
280+ # data['n_splits']: Number of CV splits
281+ # data['seed']: Random seed used
282+ ```
283+
284+ ### Python API
285+
286+ ``` python
287+ from llm_stylometry.classification import run_classification_experiment
288+ from llm_stylometry.visualization import (
289+ generate_classification_accuracy_figure,
290+ generate_word_cloud_figure
291+ )
292+ from llm_stylometry.core.constants import AUTHORS
293+
294+ # Run classification experiment
295+ result_path = run_classification_experiment(
296+ variant = ' content' , # 'content', 'function', 'pos', or None for baseline
297+ max_splits = 1000 , # Maximum CV splits
298+ seed = 42 # Random seed for reproducibility
299+ )
300+
301+ # Generate accuracy bar chart
302+ generate_classification_accuracy_figure(
303+ data_path = ' data/classifier_results/content.pkl' ,
304+ output_path = ' paper/figs/source/classification_accuracy_content.pdf' ,
305+ variant = ' content'
306+ )
307+
308+ # Generate overall word cloud
309+ generate_word_cloud_figure(
310+ data_path = ' data/classifier_results/content.pkl' ,
311+ author = None , # None for overall, or specific author name
312+ output_path = ' paper/figs/source/wordcloud_overall_content.pdf' ,
313+ variant = ' content' ,
314+ max_words = 100
315+ )
316+
317+ # Generate per-author word clouds
318+ for author in AUTHORS :
319+ generate_word_cloud_figure(
320+ data_path = ' data/classifier_results/content.pkl' ,
321+ author = author,
322+ output_path = f ' paper/figs/source/wordcloud_ { author} _content.pdf ' ,
323+ variant = ' content'
324+ )
325+ ```
326+
327+ ### Advanced Usage
328+
329+ ** Custom data loading:**
330+ ``` python
331+ from llm_stylometry.classification import (
332+ load_books_by_author,
333+ create_count_vectorizer,
334+ vectorize_books
335+ )
336+
337+ # Load books
338+ books = load_books_by_author(data_dir = ' data/cleaned' , variant = None )
339+ # Returns: Dict[author] -> [(book_id, text), ...]
340+
341+ # Create vectorizer (stop_words=None is critical!)
342+ vectorizer = create_count_vectorizer(books)
343+ print (f " Vocabulary size: { len (vectorizer.vocabulary_)} " )
344+
345+ # Vectorize
346+ vectors = vectorize_books(books, vectorizer)
347+ # Returns: [(author, book_id, vector), ...]
348+ ```
349+
350+ ** Custom cross-validation:**
351+ ``` python
352+ from llm_stylometry.classification import (
353+ generate_cv_splits,
354+ run_cross_validation,
355+ OutputCodeClassifier
356+ )
357+
358+ # Generate custom CV splits
359+ splits = generate_cv_splits(vectors, max_splits = 100 , seed = 42 )
360+
361+ # Run CV
362+ results_df = run_cross_validation(vectors, splits, random_state = 42 )
363+
364+ # Results DataFrame (long format for seaborn):
365+ # - split_id: int
366+ # - author: str (true author)
367+ # - accuracy: float (1.0 if correct, 0.0 if incorrect)
368+ # - held_out_book_id: str
369+ # - predicted_author: str
370+ # - classifier: OutputCodeClassifier object
371+
372+ # Overall accuracy
373+ print (f " Accuracy: { results_df[' accuracy' ].mean():.4f } " )
374+ ```
375+
376+ ** Extract author-specific feature weights:**
377+ ``` python
378+ # Get classifier from results
379+ clf = results_df.iloc[0 ][' classifier' ]
380+ feature_names = vectorizer.get_feature_names_out().tolist()
381+
382+ # Get author-specific weights (via back-solving)
383+ weights = clf.get_feature_weights(feature_names)
384+
385+ # weights['baum']: {word: weight, ...}
386+ # weights['austen']: {word: weight, ...}
387+ # weights['overall']: {word: avg_weight, ...}
388+
389+ # Top words for Baum
390+ baum_weights = weights[' baum' ]
391+ top_baum = sorted (baum_weights.items(), key = lambda x : abs (x[1 ]), reverse = True )[:10 ]
392+ print (" Top Baum features:" , top_baum)
393+ ```
394+
214395## Training Models from Scratch
215396
216397** Note** : Training requires a CUDA-enabled GPU and takes significant time (80 models per condition, 320 total for all conditions).
0 commit comments