SubRip Text - Application

SubRip Text is an application created in 2005.

#638on PLDB

19Years Old

522kRepos

There are at least 521,758 SubRip Text repos on GitHub
file extensions for SubRip Text include srt
GitHub supports syntax highlighting for SubRip Text
Read more about SubRip Text on the web: 1.

Example from the web:

168
00:20:41,150 --> 00:20:45,109
- How did he do that?
- Made him an offer he couldn't refuse.

Example from Linguist:

1
00:00:01,250 --> 00:00:03,740
Adding NCL language.

2
00:00:04,600 --> 00:00:08,730
Thanks for the pull request! Do you know if these files are NCL too?

3
00:00:09,800 --> 00:00:13,700
Those are poorly-named documentation files for NCL functions.

4
00:00:14,560 --> 00:00:17,200
- What's better?
- This is better.

5
00:00:18,500 --> 00:00:23,000
- Would it be correct to recognise these files as text?
- Yes.

6
00:00:23,890 --> 00:00:30,000
In that case, could you add "NCL" to the text entry in languages.yml too?

7
00:00:30,540 --> 00:00:35,250
I added the example to "Text" and updated the license in the grammar submodule.

8
00:00:38,500 --> 00:00:42,360
Cloning the submodule fails for me in local with this URL.

9
00:00:42,360 --> 00:00:45,250
Could you use Git or HTTPS...?

10
00:00:46,810 --> 00:00:50,000
I updated the grammar submodule link to HTTPS.

11
00:00:51,100 --> 00:00:57,000
It's still failing locally. I don't think you can just update the .gitmodules file.

12
00:00:57,750 --> 00:01:03,000
You'll probably have to remove the submodule and add it again to be sure.

13
00:01:04,336 --> 00:01:11,800
- I'll see first if it's not an issue on my side...
- I removed the submodule and added it back with HTTPS.

14
00:01:13,670 --> 00:01:18,000
I tested the detection of NCL files with 2000 samples.

15
00:01:18,000 --> 00:01:25,000
The Bayesian classifier doesn't seem to be very good at distinguishing text from NCL.

16
00:01:25,000 --> 00:01:30,740
We could try to improve it by adding more samples, or we can define a new heuristic rule.

17
00:01:31,300 --> 00:01:36,200
- Do you want me to send you the sample files?
- Yes, please do.

18
00:01:37,500 --> 00:01:39,500
In your inbox.

19
00:01:41,285 --> 00:01:48,216
- So if I manually go through these and sort out the errors, would that help?
- Not really.

20
00:01:48,540 --> 00:01:55,145
It's a matter of keywords so there's not much to do there except for adding new samples.

21
00:01:55,447 --> 00:02:02,000
If adding a few more samples doesn't improve things, we'll see how to define a new heuristic rule.

22
00:02:04,740 --> 00:02:09,600
- I added quite a few NCL samples.
- That's a bit over the top, isn't it?

23
00:02:10,250 --> 00:02:16,000
We currently can't add too many samples because of #2117.

24
00:02:18,000 --> 00:02:20,830
(sigh) I decreased the number of added samples.

25
00:02:21,630 --> 00:02:25,300
Could you test the detection results in local with the samples I gave you?

26
00:02:26,000 --> 00:02:28,670
- What is the command to run that test?
- Here...

27
00:02:28,716 --> 00:02:38,650
[Coding intensifies]

28
00:02:38,650 --> 00:02:43,330
It is getting hung up on a false detection of Frege in one of the Text samples.

29
00:02:43,540 --> 00:02:46,115
Do you have any suggestions for implementing a heuristic?

30
00:02:47,640 --> 00:02:55,200
#2441 should fix this. In the meantime, you can change this in "test_heuristics.rb"

31
00:02:55,165 --> 00:02:57,240
Why did you have to change this?

32
00:02:57,777 --> 00:03:04,480
- It doesn't work for me unless I do that.
- Hum, same for me. Arfon, does it work for you?

33
00:03:04,920 --> 00:03:08,830
Requiring linguist/language doesn't work for me either.

34
00:03:09,300 --> 00:03:13,885
We restructured some of the requires a while ago and I think this is just out-of-date code.

35
00:03:14,065 --> 00:03:20,950
From a large sample of known NCL files taken from Github, it's now predicting with about 98% accuracy.

36
00:03:21,183 --> 00:03:28,000
For a large sample of other files with the NCL extension, it is around 92%.

37
00:03:27,880 --> 00:03:30,950
From those, nearly all of the errors come from one GitHub repository,

38
00:03:30,950 --> 00:03:34,160
and they all contain the text strings, "The URL" and "The Title".

39
00:03:35,660 --> 00:03:43,260
- Do you mean 92% files correctly identified as text?
- Yes, it correctly identifies 92% as text.

40
00:03:44,000 --> 00:03:46,150
I'd really like to see this dramatically reduced.

41
00:03:46,150 --> 00:03:51,150
What happens if we reduce to around 5 NCL sample files?

42
00:03:51,150 --> 00:03:52,600
Does Linguist still do a reasonable job?

43
00:03:53,470 --> 00:03:58,190
I reduced it to 16 NCL samples and 8 text samples.

44
00:03:58,190 --> 00:04:01,720
It correctly classifies my whole set of known NCL files.

45
00:04:01,870 --> 00:04:05,730
I tried with 5 samples but could not get the same level of accuracy.

46
00:04:06,670 --> 00:04:10,400
It incorrectly classifies all of the NCL files in this GitHub repository.

47
00:04:11,130 --> 00:04:14,660
All of these files contain the text strings, "THE_URL:" and "THE_TITLE:".

48
00:04:14,660 --> 00:04:19,500
It did not misclassify any other text-files with the extension NCL.

49
00:04:19,970 --> 00:04:25,188
With 100% accuracy? Does that mean it that the results are better with less samples??

50
00:04:25,610 --> 00:04:31,190
I also removed a sample text-file which should have been classified as an NCL file.

51
00:04:31,000 --> 00:04:35,895
I think that probably made most of the difference, although I didn't test it atomically.

52
00:04:35,895 --> 00:04:38,370
Okay, that makes more sense.

53
00:04:39,515 --> 00:04:43,450
I don't get the same results for the text files. Full results here.

54
00:04:44,650 --> 00:04:50,000
They all look correctly classified to me, except for the ones in Fanghuan's repository.

55
00:04:50,000 --> 00:04:55,920
I manually went through all of the ones where I didn't already know based on the filename or the repository owner.

56
00:04:56,526 --> 00:05:00,000
[Presses button] It now correctly classifies all of my test files.

57
00:05:00,000 --> 00:05:05,970
R. Pavlick, thanks for this. These changes will be live in the next release of Linguist. In the next couple of weeks.

58
00:05:05,970 --> 00:05:07,450
Great! Thanks.