METHODS & CODE BLOCKS

Python

Initial Processing

1. Importing contents from downloaded transcripts from the Chinese court documents archive: 120 court cases transcript.

(40 for Tibet with female defendant, Tibet with male defendant, Shanghai with female defendant, and Shanghai with male defendant respectively.)

2. Starting to filter the content from the documents, removing all punctuations and tags.

Tokenization

1. Import dictionaries like jieba

(This dictionary recognizes and separates continous Chinese characters into words).

2. Use other dictionaries to identify and remove common stop words

Visualisation

1. Recombination

Tokenized documents are recombined into a string for further analysis

2. The data is processed for visual outputs in the form of word clouds

Specifics of the visual output is adjusted

TF-IDF Analysis

1. Translation

The keywords to the processed documents are translated to english to facilitate analysis

2. TF-IDF

TF-IDF analysis is performed to analyze the frequency of previously defined keywords to examine the relationship between the causes of a divorce trial and the geographical location/gender of the defendant in the case brought to court.

Heat Maps

After TF-IDF results are obtained, heat maps are generated.

Keywords are replaced to english to generate visual outcomes. The frequency of the group of each keyword is displayed on a heat map to show the frequency.

Tokenization

1. Import dictionaries like jieba

(This dictionary recognizes and separates continous Chinese characters into words).

2. Use other dictionaries to identify and remove common stop words