
METHODS & CODE BLOCKS
Python


Initial Processing
1. Importing contents from downloaded transcripts from the Chinese court documents archive: 120 court cases transcript.
(40 for Tibet with female defendant, Tibet with male defendant, Shanghai with female defendant, and Shanghai with male defendant respectively.)
2. Starting to filter the content from the documents, removing all punctuations and tags.
Tokenization
1. Import dictionaries like jieba
(This dictionary recognizes and separates continous Chinese characters into words).
2. Use other dictionaries to identify and remove common stop words





Visualisation
1. Recombination
Tokenized documents are recombined into a string for further analysis
2. The data is processed for visual outputs in the form of word clouds
Specifics of the visual output is adjusted
TF-IDF Analysis
1. Translation
The keywords to the processed documents are translated to english to facilitate analysis
2. TF-IDF
TF-IDF analysis is performed to analyze the frequency of previously defined keywords to examine the relationship between the causes of a divorce trial and the geographical location/gender of the defendant in the case brought to court.





Heat Maps
After TF-IDF results are obtained, heat maps are generated.
Keywords are replaced to english to generate visual outcomes. The frequency of the group of each keyword is displayed on a heat map to show the frequency.
Tokenization
1. Import dictionaries like jieba
(This dictionary recognizes and separates continous Chinese characters into words).
2. Use other dictionaries to identify and remove common stop words

